unsloth: Faster and Memory-Efficient QLoRA Fine-tuning
Unsloth Mistral 7B: 2x faster fine-tuning and saving 1.3 GB of VRAM
With quantization and adapters, fine-tuning large language models (LLM) is now very fast and memory-efficient on consumer hardware. Methods like QLoRA can reduce by 4x the memory requirement to fine-tune LLMs. Other methods like Flash-Attention 2 also accelerate inference and fine-tuning.
There are many frameworks taking care of the careful optimization of these fine-tuning methods. Unsloth is one of them, and it’s open-source:
unslothai/unsloth (Apache 2.0 license)
With unsloth, fine-tuning LLMs can be up to 5x faster while reducing memory consumption by 60%.
In this article, I first present the main optimizations that make unsloth faster and more memory-efficient. Then, we will see how to use unsloth for fine-tuning Mistral 7B. I also compare this fine-tuning to the original Hugging Face implementation to confirm the improvements in training time and memory consumption brought by unsloth.
The notebook implementing this tutorial is available here:
Since this article uses QLoRA fine-tuning, I recommend reading the following article if you don’t know how QLoRA works:
The Optimizations Behind Unsloth
Intelligent Weight Upcasting
This is by far the most efficient optimization.
For fine-tuning with QLoRA, the weights for some layers/modules of the model, e.g., the language modeling head, are upcasted to float32. This upcasting is done as fine-tuning with only float16 weights can be unstable.
However, Hugging Face Transformers performs a naive upcasting to remain compatible with most model architectures. On the other hand, unsloth optimizes this weight upcasting specifically for the models it supports, such as Mistral and Llama 2.
Unsloth upcasts fewer weights which results in a lower memory consumption and fewer operations, hence a faster fine-tuning.
Pytorch’s Scaled Dot Product Attention
In the transformer model, the scaled dot product attention is a key operation that has to be performed numerous times during fine-tuning and inference.
Unsloth makes it faster by directly exploiting the Pytorch’s fast implementation. Note that this optimization doesn’t seem to yield a significant acceleration (less than 2% faster according to unsloth documentation).
Better Exploitation of bfloat16
bfloat16 is a better data type than float16 for fine-tuning. It improves 16-bit training stability. bitsandbytes internally uses float16 and then converts to bfloat16. Note: bfloat16 is only supported from the Ampere generation of NVIDIA GPUs (RTX 30xx/40xx or A100).
Unsloth accelerates QLoRA fine-tuning by avoiding this conversion from float16 and directly uses bloaft16. Note: I’m not 100% sure that bitsandbytes still performs this conversion internally. I didn’t check it carefully.
Integration of xFormers
Unsloth uses the xFormers framework to optimize many of the building blocks of the transformer model with custom Triton kernels.
Note that xFormers also exploits Flash-Attention 2 by default.
Causal Attention Masking
Unsloth targets the optimization of causal LLM. Instead of using a separate attention mask, unsloth uses a causal mask to speed up fine-tuning.
In a causal attention mask, the attention is restricted to attending to the previous positions in the sequence and not the future positions. This means that during the self-attention computation, each position in the sequence can only attend to its preceding positions or itself, ensuring a causal dependency structure.
This type of attention mask is optimal in tasks like language modeling and autoregressive generation, where the order of the input sequence matters, and predictions are made one step at a time in a sequential manner.
Triton Implementation of RoPE Embeddings
RoPE, short for Rotary Position Embedding, seamlessly integrates explicit relative positional relationships within the self-attention mechanism. It’s used by many of the recent LLMs (Llama 2, Mistral 7b, ….).
RoPE stands out due to its adaptable nature for sequences of any length, diminishing connections between tokens as they get farther apart, and its ability to enhance linear self-attention by including relative position information.
Unsloth implements RoPE with Triton to accelerate models further.
What’s OpenAI’s Triton?
Triton makes it possible to reach peak hardware performance with relatively little effort; for example, it can be used to write FP16 matrix multiplication kernels that match the performance of cuBLAS—something that many GPU programmers can’t do—in under 25 lines of code.
Triton Implementation of the RMSNorm
RMSNorm applies regularization to the aggregated inputs of a neuron within a layer based on the root mean square (RMS). This imparts the model with re-scaling invariance and the inherent capability to adapt to learning rates implicitly. RMSNorm is more simple and more efficient compared to the standard LayerNorm.
Unsloth also implements with Triton the RMSNorm which brings a slight acceleration of LLMs.
Memory-Efficient Cross Entropy Loss
Unsloth optimized the cross entropy loss computation to significantly reduce memory consumption. It’s unclear to me how they did it. It seems that they have also implemented some part of its computation with Triton but this needs further checking.
Manual Autograd for MLP and Self-Attention
In PyTorch, Autograd (automatic differentiation) is a core component that enables automatic computation of gradients.
PyTorch uses a dynamic computational graph, meaning that the graph is built on the fly as operations are executed. The autograd package keeps track of operations performed on tensors and can automatically compute gradients for input tensors.
However, this automatic computation is not optimal. As shown by unsloth, many of the operations performed by PyTorch’s Autograd can be factored into a smaller number of operations, hence reducing the computational cost of the gradients.
The team behind unsloth manually optimized the gradient computation for LoRA fine-tuning for the MLP and self-attention modules.
Unsloth Mistral 7B: 2x Faster Fine-tuning, Saving 1.3 GB of VRAM
Several of the optimizations implemented by unsloth seem to be targeting recent GPUs (from the Ampere generation).
If you have a consumer GPU such as an RTX GPU, it would benefit from all these optimizations. On Google Colab, only the A100 is recent enough.
In the notebook, I also show examples using an older GPU, Google Colab’s T4.