Fine-tune Llama 3 70B on Your GPU with AQLM 2-bit
It's possible to fine-tune Llama 3 70B with only 24 GB of GPU RAM
Large language models (LLMs) can be quantized to significantly reduce their size. However, once quantized, the model can’t be fine-tuned anymore. Updating weights in low precision, such as 4-bit, is yet something that we don’t know how to do well. Instead of directly updating the weights of the model, we can fine-tune an adapter on top of it. It’s what we do with QLoRA:
Most implementations of QLoRA only support 4-bit quantization. We can’t apply QLoRA to models quantized to a lower precision which might be necessary to further reduce the size of the model. An alternative is to fine-tune an adapter on top of a model quantized with another algorithm such as GPTQ which supports 2-bit and 3-bit quantizations. I’ve tried it but often couldn’t get good results. We need a better quantization algorithm.
With AQLM, we can accurately quantize very large models to a low precision, such as 2-bit. Using Mixtral-8x7b, I showed that we can fine-tune a model quantized with AQLM in 2-bit with only 24 GB of GPU RAM:
The main downside of AQLM is the cost of the quantization itself. For a 70B model such as Llama 3 70B, it can take weeks. Fortunately, the creators of AQLM released a 2-bit AQLM version of Llama 3 70B.
In this article, I show how to fine-tune Llama 3 70B quantized with AQLM in 2-bit. We will see that thanks to 2-bit quantization and a careful choice of hyperparameter values, we can fine-tune Llama 3 70B on a 24 GB GPU. I also show how to use the fine-tuned adapter for inference.
The notebook implementing Llama 3 70B fine-tuning is here: