Multi-GPU Fine-tuning for Llama 3.1 70B with FSDP and QLoRA

What you can do with only 2x24 GB GPUs, and a lot of CPU RAM

Aug 05, 2024

∙ Paid

Fine-tuning large language models (LLMs) with up to 35B parameters is relatively easy and cheap since it can be done with a single consumer GPU. Fine-tuning larger models with a single consumer GPU is, in theory, not impossible as we can offload parts of the model to the CPU memory. However, it would be extremely slow, even with high-end CPUs.

Using multiple GPUs is the only alternative to keep fine-tuning fast enough. A configuration with 2x24 GB GPUs opens a lot of possibilities. 48 GB of GPU memory is enough to fine-tune 70B models such as Llama 3 70B and Qwen2 72B.

In this article, I explain how to fine-tune 70B LLMs using only two GPUs thanks to FSDP and QLoRA.

QLoRA: Fine-Tune a Large Language Model on Your GPU

Benjamin Marie

May 30, 2023

Read full story

I first explain what is FSDP and then we will see how to modify a standard QLoRA fine-tuning code to run it on multiple GPUs. For the experiments and demonstrations, I use Llama 3.1 70B but it would work similarly for other LLMs. For the hardware, I relied on 2 RTX 3090 GPUs provided by RunPod (here is my referral link) (only $0.66/hour). Using 2 RTX 4090 GPUs would be faster but more expensive.

My notebook fine-tuning Llama 3.1 70B using two GPUs is available here:

Get the notebook (#92)

The Kaitchup – AI on a Budget

Multi-GPU Fine-tuning for Llama 3.1 70B with FSDP and QLoRA

What you can do with only 2x24 GB GPUs, and a lot of CPU RAM

QLoRA: Fine-Tune a Large Language Model on Your GPU

Fully Sharded Data Parallel (FSDP): How Does It Work?

This post is for paid subscribers