Don’t Let QLoRA Merging Undo Your Fine-Tuning Work
Revisiting an old recipe with today's frameworks.
QLoRA fine-tuning is a highly effective and widely used parameter-efficient fine-tuning method. It works by quantizing a model to lower precision, typically 4-bit using bitsandbytes
, and then training an adapter, which consists of a small set of new trainable parameters, on top of the quantized model.
Similar to LoRA fine-tuning, QLoRA only updates the adapter, meaning optimizer states and gradients are stored exclusively for these parameters. As a result, memory consumption is significantly lower compared to full fine-tuning. Since QLoRA quantizes the model, the memory required to load it is typically reduced by approximately 3.5x.
What Happens to the Adapter After Fine-Tuning?
Once fine-tuning is complete, we have two main options for using the adapter:
Loading the adapter dynamically on top of the base quantized model when needed.
Merging the adapter into the base model permanently, eliminating the need to manage the adapter separately.
This article focuses on the second approach: merging the adapter into the base model. However, merging a QLoRA adapter is not as straightforward as many online tutorials suggest. A naive merge, similar to what is done with LoRA adapters, can severely degrade model accuracy.
We will explore the correct merging process, revisiting an older merging approach we reviewed 18 months ago to determine if it remains necessary. Additionally, we will assess merging directly into the quantized model, as implemented in Hugging Face PEFT. Furthermore, we will examine why post-merging quantization should not be performed using bitsandbytes
and instead evaluate alternative quantization methods, comparing their effectiveness.
The full implementation of the optimal merging process can be found in this notebook: