Fine-tune Llama 3 on Your Computer

With code to merge QLoRA adapters and quantize the model

Apr 22, 2024

∙ Paid

Llama 3 is currently available in two versions: 8B and 70B. The 8B version, which has 8.03 billion parameters, is small enough to run locally on consumer hardware.

With parameter-efficient fine-tuning (PEFT) methods such as LoRA, we don’t need to fully fine-tune the model but instead can fine-tune an adapter on top of it. To further decrease memory consumption, we can even apply this method on top of a quantized Llama 3 with QLoRA.

QLoRA: Fine-Tune a Large Language Model on Your GPU

Benjamin Marie

May 30, 2023

Read full story

In this article, I briefly present Llama 3 and the hardware requirements to fine-tune and run it locally. Then, I show how to fine-tune the model on a chat dataset. The code is fully explained. With LoRA, you need a GPU with 24 GB of RAM to fine-tune Llama 3. With QLoRA, you only need a GPU with 16 GB of RAM.

After the fine-tuning, I also show:

How to merge the fine-tuned adapter into Llama 3.
How to quantize the model to 4-bit with AWQ to reduce its size.
In the notebook only: How to fully fine-tune the model, i.e., without using an adapter, with GaLore.

All the code explained in this article is also implemented in this notebook:

Get the notebook (#62)

The code presented in this article also works for Llama 3.1.

Llama 3.1: Fine-tuning on Consumer Hardware — LoRA vs. QLoRA

Benjamin Marie

July 29, 2024

Read full story

The Kaitchup – AI on a Budget

Fine-tune Llama 3 on Your Computer

With code to merge QLoRA adapters and quantize the model

QLoRA: Fine-Tune a Large Language Model on Your GPU

Llama 3.1: Fine-tuning on Consumer Hardware — LoRA vs. QLoRA

This post is for paid subscribers