LoRA Adapters: When a Naive Merge Leads to Poor Performance
The case of LoRA adapters fine-tuned with QLoRA
QLoRA is a memory-efficient way to fine-tune LLMs. It quantizes the LLM and then fine-tunes a LoRA adapter on top of it. I have used this method many times in my previous articles to fine-tune GPT-NeoX, Falcon, and Llama 2 models.
QLoRA only saves the fine-tuned adapter and not the entire model since we have kept its parameters frozen.
But then, what do we do with this adapter?
We have two solutions to use it:
Load it on top of the base model every time we need it
Merge it with the base model to get a new model
For both these solutions, we have to be careful. We can’t just naively load the base model and then load the adapter on top of it. We have to load the base model and preprocess it the same way it was during QLoRA fine-tuning, otherwise, we may get a significant performance drop. The same applies if you want to merge the adapter.
In this article, I show you how to use the fine-tuned adapter. We will see that merging an adapter fine-tuned with QLoRA is not trivial. There is a method to avoid the performance drop after merging. I will explain and benchmark it. All the code to reproduce my experiments and the optimal merging strategy is available on the notebook page: