QLoRa: Fine-Tune a Large Language Model on Your GPU
Fine-tuning models with billions of parameters is now possible on consumer hardware
Most large language models (LLM) are too big to be fine-tuned on consumer hardware. For instance, to fine-tune a 65 billion parameters model we need more than 780 Gb of GPU memory. This is equivalent to ten A100 80 Gb GPUs. In other words, you would need cloud computing to fine-tune your models.
Now, with QLoRa (Dettmers et al., 2023), you could do it with only one A100.
In this blog post, I will introduce QLoRa. I will briefly describe how it works and we will see how to use it to fine-tune a GPT model with 20 billion parameters, on your GPU.
Note: I used my own nVidia RTX 3060 12 GB to run all the commands in this post. You can also use a free instance of Google Colab to achieve the same results. If you want to use a GPU with a smaller memory, you would have to use a smaller LLM.
QLoRa: Quantized LLMs with Low-Rank Adapters
In June 2021, Hu et al. (2021) introduced low-rank adapters (LoRa) for LLMs.
LoRa adds a tiny amount of trainable parameters, i.e., adapters, for each layer of the LLM and freezes all the original parameters. For fine-tuning, we only have to update the adapter weights which significantly reduces the memory footprint.
QLoRa goes three steps further by introducing: 4-bit quantization, double quantization, and the exploitation of nVidia unified memory for paging.
In a few words, each one of these steps works as follows:
4-bit NormalFloat quantization: This is a method that improves upon quantile quantization. It ensures an equal number of values in each quantization bin. This avoids computational issues and errors for outlier values.
Double quantization: The authors of QLoRa define it as follows: “the process of quantizing the quantization constants for additional memory savings.”
Paging with unified memory: It relies on the NVIDIA Unified Memory feature and automatically handles page-to-page transfers between the CPU and GPU. It ensures error-free GPU processing, especially in situations where the GPU may run out of memory.
All of these steps drastically reduce the memory requirements for fine-tuning, while performing almost on par with standard fine-tuning.
Fine-tuning a GPT model with QLoRa
All the following code can run on a free instance of Google Colab. You can try it using the notebook #2 that you will find here:
Hardware requirements for QLoRa:
GPU: The following demo works on a GPU with 12 Gb of VRAM, for a model with less than 20 billion parameters, e.g., GPT-J. For instance, I ran it with my RTX 3060 12 GB. If you have a bigger card with 24 GB of VRAM, you can do it with a 20 billion parameter model, e.g., GPT-NeoX-20b.
The GPU cards with 24 GB are getting quite cheap, for instance, the PNY GeForce RTX 3090 24 GB, or if you have a bigger budget, the PNY GeForce RTX 4090 24GB is still affordable for a card of this size.
RAM: I recommend a minimum of 6 GB. Most recent computers have enough RAM.
If you lack RAM, I recommend buying a pair of RAM modules such as the Corsair VENGEANCE LPX DDR4 16GB (2x8GB).
Hard drive: GPT-J and GPT-NeoX-20b are both very big models. I recommend at least 80 GB of free space.
I recommend an SSD, for instance, the SAMSUNG 970 EVO Plus 500 GB if your motherboard supports NVMe M2 SSD, or the SAMSUNG 870 EVO SATA SSD 500GB.
If your machine doesn’t meet these requirements, the free instance of Google Colab would be enough instead.
Note: All the links in this section are Amazon affiliate links.
Software requirements for QLoRa:
We need CUDA. Make sure it is installed on your machine.
We will also need to install all the dependencies:
bitsandbytes: A library that contains all we need to quantize an LLM.
Hugging Face Transformers and Accelerate: These are standard libraries that are used to efficiently train models from Hugging Face Hub.
PEFT: A library that provides the implementations for various methods to only fine-tune a small number of (extra) model parameters. We need it for LoRa.
Datasets: This one is not a requirement. We will only use it to get a dataset for fine-tuning. Of course, you can provide instead your own dataset.
We can get all of them with PIP:
pip install -q -U bitsandbytes
pip install -q -U git+https://github.com/huggingface/transformers.git
pip install -q -U git+https://github.com/huggingface/peft.git
#pip install -q -U git+https://github.com/huggingface/accelerate.git
#current version of Accelerate on GitHub breaks QLoRa
#Using standard pip instead
pip install -q -U accelerate
pip install -q -U datasets
Next, we can start writing the Python script.
Loading and Quantization of a GPT Model
We need the following imports to load and quantize an LLM.
Keep reading with a 7-day free trial
Subscribe to The Kaitchup – AI on a Budget to keep reading this post and get 7 days of free access to the full post archives.