Fast and Memory-Efficient Full Fine-Tuning with Unsloth (single-GPU)

With the best hyperparameters for a cost-effective full fine-tuning

Apr 14, 2025

∙ Paid

Fine-tuning large language models (LLMs) for specific tasks and domains can be extremely expensive. This process typically requires multiple high-end GPUs due to the significant memory demands of LLMs.

Unsloth, known for being one of the fastest and most memory-efficient frameworks for fine-tuning, was previously limited to LoRA and QLoRA methods. Meaning it only supported adapter-based fine-tuning, not full model fine-tuning.

However, Unsloth now supports full fine-tuning as well. You can fully fine-tune models with 7–8 billion parameters, such as Llama 3.1 and Qwen2.5, using a single GPU with 48 GB of VRAM.

In this article, we'll explore how to use Unsloth for full fine-tuning of LLMs. I’ll walk through example code for fine-tuning LLMs like Llama 3.1, analyze memory usage during training, and examine how different hyperparameters affect both memory consumption and training speed. Interestingly, due to Unsloth’s extensive optimizations, some hyperparameter changes that would normally accelerate training, such as increasing the batch size or not paging the optimizer states, might actually slow it down. We’ll dive into why that happens.

I also compare the performance of three GPUs: the NVIDIA L40S, RTX 6000 Ada, and RTX A6000. Using RunPod (referral link) pricing as a reference, we’ll determine which card delivers the best performance-to-cost ratio. I’ll also provide estimates for the H100.

The fine-tuning code discussed in this article is available in the accompanying notebook:

Get the notebook (#157)

The Kaitchup – AI on a Budget

Fast and Memory-Efficient Full Fine-Tuning with Unsloth (single-GPU)

With the best hyperparameters for a cost-effective full fine-tuning

Full Fine-Tuning with Unsloth

This post is for paid subscribers