Multi-GPU Fine-tuning for Llama 3.1 70B with FSDP and QLoRA
What you can do with only 2x24 GB GPUs, and a lot of CPU RAM
Fine-tuning large language models (LLMs) with up to 35B parameters is relatively easy and cheap since it can be done with a single consumer GPU. Fine-tuning larger models with a single consumer GPU is, in theory, not impossible as we can offload parts of the model to the CPU memory. However, it would be extremely slow, even with high-end CPUs.
Using multiple GPUs is the only alternative to keep fine-tuning fast enough. A configuration with 2x24 GB GPUs opens a lot of possibilities. 48 GB of GPU memory is enough to fine-tune 70B models such as Llama 3 70B and Qwen2 72B.
In this article, I explain how to fine-tune 70B LLMs using only two GPUs thanks to FSDP and QLoRA.
I first explain what is FSDP and then we will see how to modify a standard QLoRA fine-tuning code to run it on multiple GPUs. For the experiments and demonstrations, I use Llama 3.1 70B but it would work similarly for other LLMs. For the hardware, I relied on 2 RTX 3090 GPUs provided by RunPod (here is my referral link) (only $0.66/hour). Using 2 RTX 4090 GPUs would be faster but more expensive.
My notebook fine-tuning Llama 3.1 70B using two GPUs is available here: