LoRA Trainable Tokens: Save Memory, Improve Accuracy for Your Domain

How to teach an LLM to use your new tokens without fully retraining the token embeddings

Apr 03, 2025

∙ Paid

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique for large pre-trained models. Unlike standard full fine-tuning, which updates all model parameters, LoRA freezes the entire model and introduces only a small set of trainable parameters. These parameters are added to specific layers or modules of the model, which allows for efficient adaptation with minimal memory overhead.

Because LoRA only stores optimizer states and gradients for the trainable parameters, it consumes significantly less memory than full fine-tuning. However, since the other model parameters remain frozen, it cannot accommodate new tokens. Any new tokens would have untrained embeddings.

In a previous article, we explored how to use LoRA fine-tuning while fully retraining the token embeddings and the language modeling head.

Fine-tune the Token Embeddings and the Language Modeling Head of Llama 3

Benjamin Marie

May 27, 2024

Read full story

This approach enables the model to effectively handle special tokens, such as those used in a chat template and specific domains. While this method is effective, it requires significantly more memory since the embeddings and language modeling head parameters are made trainable.

In this article, we will explore a new alternative from the Hugging Face PEFT (Parameter-Efficient Fine-Tuning) library. Rather than retraining the full embeddings and language modeling head, this method focuses solely on updating the embeddings for the special tokens the model needs to learn. We will first examine how this technique works, its limitations, and its memory efficiency. Finally, we will compare its performance to full retraining.

The notebook that implements this fine-tuning approach, using Llama 3.1 and 3.2 as examples, is available here:

Get the notebook (#155)

The Kaitchup – AI on a Budget

LoRA Trainable Tokens: Save Memory, Improve Accuracy for Your Domain

How to teach an LLM to use your new tokens without fully retraining the token embeddings

Fine-tune the Token Embeddings and the Language Modeling Head of Llama 3

Trainable Tokens with LoRA: How Does It Work?

This post is for paid subscribers