Fine-tune Llama 3 70B on Your GPU with AQLM 2-bit

It's possible to fine-tune Llama 3 70B with only 24 GB of GPU RAM

May 13, 2024

∙ Paid

Large language models (LLMs) can be quantized to significantly reduce their size. However, once quantized, the model can’t be fine-tuned anymore. Updating weights in low precision, such as 4-bit, is yet something that we don’t know how to do well. Instead of directly updating the weights of the model, we can fine-tune an adapter on top of it. It’s what we do with QLoRA:

QLoRA: Fine-Tune a Large Language Model on Your GPU

Benjamin Marie

May 30, 2023

Read full story

Most implementations of QLoRA only support 4-bit quantization. We can’t apply QLoRA to models quantized to a lower precision which might be necessary to further reduce the size of the model. An alternative is to fine-tune an adapter on top of a model quantized with another algorithm such as GPTQ which supports 2-bit and 3-bit quantizations. I’ve tried it but often couldn’t get good results. We need a better quantization algorithm.

With AQLM, we can accurately quantize very large models to a low precision, such as 2-bit. Using Mixtral-8x7b, I showed that we can fine-tune a model quantized with AQLM in 2-bit with only 24 GB of GPU RAM:

Fine-tune Mixtral-8x7B Quantized with AQLM (2-bit) on Your GPU

Benjamin Marie

March 14, 2024

Read full story

The main downside of AQLM is the cost of the quantization itself. For a 70B model such as Llama 3 70B, it can take weeks. Fortunately, the creators of AQLM released a 2-bit AQLM version of Llama 3 70B.

In this article, I show how to fine-tune Llama 3 70B quantized with AQLM in 2-bit. We will see that thanks to 2-bit quantization and a careful choice of hyperparameter values, we can fine-tune Llama 3 70B on a 24 GB GPU. I also show how to use the fine-tuned adapter for inference.

The notebook implementing Llama 3 70B fine-tuning is here:

Get the notebook (#69)

The Kaitchup – AI on a Budget

Fine-tune Llama 3 70B on Your GPU with AQLM 2-bit

It's possible to fine-tune Llama 3 70B with only 24 GB of GPU RAM

QLoRA: Fine-Tune a Large Language Model on Your GPU

Fine-tune Mixtral-8x7B Quantized with AQLM (2-bit) on Your GPU

This post is for paid subscribers