QLoRA: Fine-Tune a Large Language Model on Your GPU
Fine-tuning models with billions of parameters is now possible on consumer hardware
Most large language models (LLM) are too big to be fine-tuned on consumer hardware. For instance, to fine-tune a 65 billion parameter model we need more than 780 GB of GPU memory. This is equivalent to ten A100 80 GB GPUs. In other words, you would need cloud computing to fine-tune your models.
Now, with QLoRa (Dettmers et al., 2023), you could do it with only one A100.
In this article, I introduce QLoRA. I describe how it works and we show how to use it to fine-tune a GPT model with 20 billion parameters, on your GPU.
Note: I used my own nVidia RTX 3060 12 GB to run all the code described in this post. You can also use a free instance of Google Colab to achieve the same results. If you want to use a GPU with a smaller memory, you would have to use a smaller LLM.
Last update: March 25th, 2024
QLoRA: Quantized LLMs with Low-Rank Adapters
In June 2021, Hu et al. (2021) introduced low-rank adapters (LoRA) for LLMs.
LoRA adds a tiny amount of trainable parameters, i.e., an adapter, for each layer of the LLM and freezes all the original parameters. For fine-tuning, we only have to update the adapter weights which significantly reduces memory consumption.
QLoRA goes three steps further by introducing 4-bit quantization, double quantization, and the exploitation of NVIDIA’s unified memory for paging.
In a few words, each one of these steps works as follows:
4-bit NormalFloat quantization: This is a method that improves upon quantile quantization. It ensures an equal number of values in each quantization bin. This avoids computational issues and errors for outlier values.
Double quantization: The authors of QLoRA define it as follows: “the process of quantizing the quantization constants for additional memory savings.”
Paging with unified memory: It relies on the NVIDIA Unified Memory feature and automatically handles page-to-page transfers between the CPU and GPU. It ensures error-free GPU processing, especially in situations where the GPU may run out of memory.
All of these steps drastically reduce the memory consumption of fine-tuning, while performing almost on par with standard fine-tuning.
Fine-tuning a GPT model with QLoRA
All the following code can run on a free instance of Google Colab. You can try it using the notebook #2 that you will find here:
Hardware requirements for QLoRA:
Note: All the links in this section are Amazon affiliate links.
GPU: It works on a GPU with 12 GB of VRAM, for a model with less than 20 billion parameters, e.g., GPT-J. For instance, I ran it with my RTX 3060 12 GB. If you have a bigger card with 24 GB of VRAM, you can do it with a 20 billion parameter model, e.g., GPT-NeoX-20b.
The GPU cards with 24 GB are getting quite cheap, for instance, the PNY GeForce RTX 3090 24 GB, or if you have a bigger budget, the PNY GeForce RTX 4090 24GB is still affordable for a card of this size.
RAM: I recommend a minimum of 6 GB. Most recent computers have enough RAM.
If you lack RAM, I recommend buying a pair of RAM modules such as the Corsair VENGEANCE LPX DDR4 16GB (2x8GB).
Hard drive: GPT-J and GPT-NeoX-20b are both very big models. I recommend at least 80 GB of free space.
I recommend an SSD, for instance, the SAMSUNG 970 EVO Plus 500 GB if your motherboard supports NVMe M2 SSD, or the SAMSUNG 870 EVO SATA SSD 500GB.
If your machine doesn’t meet these requirements, the free instance of Google Colab would be enough instead.