The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
QLoRA with AutoRound: Cheaper and Better LLM Fine-tuning on Your GPU

QLoRA with AutoRound: Cheaper and Better LLM Fine-tuning on Your GPU

Bitsandbytes is not your only option

Benjamin Marie's avatar
Benjamin Marie
Aug 19, 2024
∙ Paid
16

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
QLoRA with AutoRound: Cheaper and Better LLM Fine-tuning on Your GPU
6
2
Share

We can fine-tune large language models (LLMs) on consumer hardware thanks to QLoRA. This parameter-efficient fine-tuning method quantizes the model's parameters, freezes them, and then fine-tunes an adapter on top of the model.

Originally, QLoRA was proposed by the author of the bitsandbytes quantization framework. bitsandbytes quantization performs very well, thanks to using the NormalFloat4 (NF4) data type. Most of the QLoRA code that you will find online relies on bitsandbytes quantization. However, bitsandbytes has several limits. It can’t quantize to a precision lower than 4-bit and it makes the model significantly slower, as we saw in this article:

The Best Quantization Methods to Run Llama 3.1 on Your GPU

The Best Quantization Methods to Run Llama 3.1 on Your GPU

Benjamin Marie
·
August 12, 2024
Read full story

Moreover, since QLoRA has been proposed, several better quantization methods have been published. For instance, we now have HQQ, AQLM, AutoRound, and AWQ.

With Hugging Face PEFT, it is possible to use these quantization methods for QLoRA instead of bitsandbytes but their impact on fine-tuning performance is understudied.

In this article, we will experiment and compare HQQ, AQLM, AutoRound, bitsandbytes, and GPTQ for QLoRA fine-tuning. We will see how fast they are for fine-tuning and their performance with QLoRA. All the code examples presented in this article use Llama 3.1 but it would work the same for other LLMs supported by these quantization methods.

You can find the code for fine-tuning LLMs (e.g., Llama 3.1) quantized with HQQ, AQLM, AutoRound, bitsandbytes, and GPTQ in this notebook:

Get the notebook (#96)

Fine-tuning Quantized LLMs

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share