Llama 3 is currently available in two versions: 8B and 70B. The 8B version, which has 8.03 billion parameters, is small enough to run locally on consumer hardware.
With parameter-efficient fine-tuning (PEFT) methods such as LoRA, we don’t need to fully fine-tune the model but instead can fine-tune an adapter on top of it. To further decrease memory consumption, we can even apply this method on top of a quantized Llama 3 with QLoRA.
In this article, I briefly present Llama 3 and the hardware requirements to fine-tune and run it locally. Then, I show how to fine-tune the model on a chat dataset. The code is fully explained. With LoRA, you need a GPU with 24 GB of RAM to fine-tune Llama 3. With QLoRA, you only need a GPU with 16 GB of RAM.
After the fine-tuning, I also show:
How to merge the fine-tuned adapter into Llama 3.
How to quantize the model to 4-bit with AWQ to reduce its size.
In the notebook only: How to fully fine-tune the model, i.e., without using an adapter, with GaLore.
All the code explained in this article is also implemented in this notebook:
The code presented in this article also works for Llama 3.1.