QLoRA with AutoRound: Cheaper and Better LLM Fine-tuning on Your GPU
Bitsandbytes is not your only option
We can fine-tune large language models (LLMs) on consumer hardware thanks to QLoRA. This parameter-efficient fine-tuning method quantizes the model's parameters, freezes them, and then fine-tunes an adapter on top of the model.
Originally, QLoRA was proposed by the author of the bitsandbytes quantization framework. bitsandbytes quantization performs very well, thanks to using the NormalFloat4 (NF4) data type. Most of the QLoRA code that you will find online relies on bitsandbytes quantization. However, bitsandbytes has several limits. It can’t quantize to a precision lower than 4-bit and it makes the model significantly slower, as we saw in this article:
Moreover, since QLoRA has been proposed, several better quantization methods have been published. For instance, we now have HQQ, AQLM, AutoRound, and AWQ.
With Hugging Face PEFT, it is possible to use these quantization methods for QLoRA instead of bitsandbytes but their impact on fine-tuning performance is understudied.
In this article, we will experiment and compare HQQ, AQLM, AutoRound, bitsandbytes, and GPTQ for QLoRA fine-tuning. We will see how fast they are for fine-tuning and their performance with QLoRA. All the code examples presented in this article use Llama 3.1 but it would work the same for other LLMs supported by these quantization methods.
You can find the code for fine-tuning LLMs (e.g., Llama 3.1) quantized with HQQ, AQLM, AutoRound, bitsandbytes, and GPTQ in this notebook: