Fast and Memory-Efficient Full Fine-Tuning with Unsloth (single-GPU)
With the best hyperparameters for a cost-effective full fine-tuning
Fine-tuning large language models (LLMs) for specific tasks and domains can be extremely expensive. This process typically requires multiple high-end GPUs due to the significant memory demands of LLMs.
Unsloth, known for being one of the fastest and most memory-efficient frameworks for fine-tuning, was previously limited to LoRA and QLoRA methods. Meaning it only supported adapter-based fine-tuning, not full model fine-tuning.
However, Unsloth now supports full fine-tuning as well. You can fully fine-tune models with 7–8 billion parameters, such as Llama 3.1 and Qwen2.5, using a single GPU with 48 GB of VRAM.
In this article, we'll explore how to use Unsloth for full fine-tuning of LLMs. I’ll walk through example code for fine-tuning LLMs like Llama 3.1, analyze memory usage during training, and examine how different hyperparameters affect both memory consumption and training speed. Interestingly, due to Unsloth’s extensive optimizations, some hyperparameter changes that would normally accelerate training, such as increasing the batch size or not paging the optimizer states, might actually slow it down. We’ll dive into why that happens.
I also compare the performance of three GPUs: the NVIDIA L40S, RTX 6000 Ada, and RTX A6000. Using RunPod (referral link) pricing as a reference, we’ll determine which card delivers the best performance-to-cost ratio. I’ll also provide estimates for the H100.
The fine-tuning code discussed in this article is available in the accompanying notebook:
Full Fine-Tuning with Unsloth
Keep reading with a 7-day free trial
Subscribe to The Kaitchup – AI on a Budget to keep reading this post and get 7 days of free access to the full post archives.