LoRA Trainable Tokens: Save Memory, Improve Accuracy for Your Domain
How to teach an LLM to use your new tokens without fully retraining the token embeddings
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique for large pre-trained models. Unlike standard full fine-tuning, which updates all model parameters, LoRA freezes the entire model and introduces only a small set of trainable parameters. These parameters are added to specific layers or modules of the model, which allows for efficient adaptation with minimal memory overhead.
Because LoRA only stores optimizer states and gradients for the trainable parameters, it consumes significantly less memory than full fine-tuning. However, since the other model parameters remain frozen, it cannot accommodate new tokens. Any new tokens would have untrained embeddings.
In a previous article, we explored how to use LoRA fine-tuning while fully retraining the token embeddings and the language modeling head.
This approach enables the model to effectively handle special tokens, such as those used in a chat template and specific domains. While this method is effective, it requires significantly more memory since the embeddings and language modeling head parameters are made trainable.
In this article, we will explore a new alternative from the Hugging Face PEFT (Parameter-Efficient Fine-Tuning) library. Rather than retraining the full embeddings and language modeling head, this method focuses solely on updating the embeddings for the special tokens the model needs to learn. We will first examine how this technique works, its limitations, and its memory efficiency. Finally, we will compare its performance to full retraining.
The notebook that implements this fine-tuning approach, using Llama 3.1 and 3.2 as examples, is available here:
Trainable Tokens with LoRA: How Does It Work?
Keep reading with a 7-day free trial
Subscribe to The Kaitchup – AI on a Budget to keep reading this post and get 7 days of free access to the full post archives.