The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Multi-GPU Fine-tuning for Llama 3.1 70B with FSDP and QLoRA
Copy link
Facebook
Email
Notes
More

Multi-GPU Fine-tuning for Llama 3.1 70B with FSDP and QLoRA

What you can do with only 2x24 GB GPUs, and a lot of CPU RAM

Benjamin Marie's avatar
Benjamin Marie
Aug 05, 2024
∙ Paid
6

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Multi-GPU Fine-tuning for Llama 3.1 70B with FSDP and QLoRA
Copy link
Facebook
Email
Notes
More
20
1
Share
Generated with DALL-E

Fine-tuning large language models (LLMs) with up to 35B parameters is relatively easy and cheap since it can be done with a single consumer GPU. Fine-tuning larger models with a single consumer GPU is, in theory, not impossible as we can offload parts of the model to the CPU memory. However, it would be extremely slow, even with high-end CPUs.

Using multiple GPUs is the only alternative to keep fine-tuning fast enough. A configuration with 2x24 GB GPUs opens a lot of possibilities. 48 GB of GPU memory is enough to fine-tune 70B models such as Llama 3 70B and Qwen2 72B.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I explain how to fine-tune 70B LLMs using only two GPUs thanks to FSDP and QLoRA.

QLoRA: Fine-Tune a Large Language Model on Your GPU

QLoRA: Fine-Tune a Large Language Model on Your GPU

Benjamin Marie
·
May 30, 2023
Read full story

I first explain what is FSDP and then we will see how to modify a standard QLoRA fine-tuning code to run it on multiple GPUs. For the experiments and demonstrations, I use Llama 3.1 70B but it would work similarly for other LLMs. For the hardware, I relied on 2 RTX 3090 GPUs provided by RunPod (here is my referral link) (only $0.66/hour). Using 2 RTX 4090 GPUs would be faster but more expensive.

My notebook fine-tuning Llama 3.1 70B using two GPUs is available here:

Get the notebook (#92)

Fully Sharded Data Parallel (FSDP): How Does It Work?

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More