The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Fine-tune Llama 3 70B on Your GPU with AQLM 2-bit
Copy link
Facebook
Email
Notes
More

Fine-tune Llama 3 70B on Your GPU with AQLM 2-bit

It's possible to fine-tune Llama 3 70B with only 24 GB of GPU RAM

Benjamin Marie's avatar
Benjamin Marie
May 13, 2024
∙ Paid
3

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Fine-tune Llama 3 70B on Your GPU with AQLM 2-bit
Copy link
Facebook
Email
Notes
More
6
Share
Generate with DALL-E

Large language models (LLMs) can be quantized to significantly reduce their size. However, once quantized, the model can’t be fine-tuned anymore. Updating weights in low precision, such as 4-bit, is yet something that we don’t know how to do well. Instead of directly updating the weights of the model, we can fine-tune an adapter on top of it. It’s what we do with QLoRA:

QLoRA: Fine-Tune a Large Language Model on Your GPU

QLoRA: Fine-Tune a Large Language Model on Your GPU

Benjamin Marie
·
May 30, 2023
Read full story

Most implementations of QLoRA only support 4-bit quantization. We can’t apply QLoRA to models quantized to a lower precision which might be necessary to further reduce the size of the model. An alternative is to fine-tune an adapter on top of a model quantized with another algorithm such as GPTQ which supports 2-bit and 3-bit quantizations. I’ve tried it but often couldn’t get good results. We need a better quantization algorithm.

With AQLM, we can accurately quantize very large models to a low precision, such as 2-bit. Using Mixtral-8x7b, I showed that we can fine-tune a model quantized with AQLM in 2-bit with only 24 GB of GPU RAM:

Fine-tune Mixtral-8x7B Quantized with AQLM (2-bit) on Your GPU

Fine-tune Mixtral-8x7B Quantized with AQLM (2-bit) on Your GPU

Benjamin Marie
·
March 14, 2024
Read full story

The main downside of AQLM is the cost of the quantization itself. For a 70B model such as Llama 3 70B, it can take weeks. Fortunately, the creators of AQLM released a 2-bit AQLM version of Llama 3 70B.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I show how to fine-tune Llama 3 70B quantized with AQLM in 2-bit. We will see that thanks to 2-bit quantization and a careful choice of hyperparameter values, we can fine-tune Llama 3 70B on a 24 GB GPU. I also show how to use the fine-tuned adapter for inference.

The notebook implementing Llama 3 70B fine-tuning is here:

Get the notebook (#69)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More