The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Fine-tune LLMs on Your CPU with QLoRA

Fine-tune LLMs on Your CPU with QLoRA

Finally a tutorial that doesn't require CUDA

Benjamin Marie's avatar
Benjamin Marie
Jan 04, 2024
∙ Paid
5

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Fine-tune LLMs on Your CPU with QLoRA
12
1
Share
Image generated by Substack

QLoRA is now the default method for fine-tuning large language models (LLM) on consumer hardware. For instance, with QLoRA, we only need 8 GB of GPU VRAM to fine-tune Mistral 7B and Llama 2 7B while a standard fine-tuning would require at least 24 GB of VRAM.

QLoRA reduces memory consumption thanks to 4-bit quantization. This is usually performed with the bitsandbytes package which optimizes the quantization and QLoRA fine-tuning on GPU.

Is QLoRA fine-tuning not possible with the CPU?

As a part of their work on the extension of Hugging Face’s Transformers, Intel has optimized QLoRA fine-tuning to make it possible on the CPU.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I show how to use Intel’s extension for Transformers for fine-tuning Mistral 7B using only your CPU. I have experimented with it on the old and slow Google Colab’s CPU. We will see that while fine-tuning with an old CPU is indeed possible, we need a powerful and recent CPU to complete fine-tuning in a reasonable time.

I have implemented a notebook for demonstration in Google Colab. It’s available here:

Get the notebook (#36)

QLoRA with a CPU: How Is It Possible?

QLoRA fine-tunes a LoRA adapter on top of a base LLM quantized with the NormalFloat4 (NF4) data type. Only the parameters of the added LoRA adapter are trained. In other words, instead of updating the billions of parameters of the base LLM, QLoRA only updates the millions of parameters of the adapter.

I recommend reading the following article for a more extensive explanation of how QLoRA works:

QLoRa: Fine-Tune a Large Language Model on Your GPU

QLoRa: Fine-Tune a Large Language Model on Your GPU

Benjamin Marie
·
May 30, 2023
Read full story

Yet, even with the reduced number of trainable parameters, efficient QLoRA fine-tuning is challenging on a CPU. CPUs don’t natively support the NF4 data type.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share