The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
LoRA Trainable Tokens: Save Memory, Improve Accuracy for Your Domain
Copy link
Facebook
Email
Notes
More

LoRA Trainable Tokens: Save Memory, Improve Accuracy for Your Domain

How to teach an LLM to use your new tokens without fully retraining the token embeddings

Benjamin Marie's avatar
Benjamin Marie
Apr 03, 2025
∙ Paid
7

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
LoRA Trainable Tokens: Save Memory, Improve Accuracy for Your Domain
Copy link
Facebook
Email
Notes
More
Share
Generated with ChatGPT

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique for large pre-trained models. Unlike standard full fine-tuning, which updates all model parameters, LoRA freezes the entire model and introduces only a small set of trainable parameters. These parameters are added to specific layers or modules of the model, which allows for efficient adaptation with minimal memory overhead.

Because LoRA only stores optimizer states and gradients for the trainable parameters, it consumes significantly less memory than full fine-tuning. However, since the other model parameters remain frozen, it cannot accommodate new tokens. Any new tokens would have untrained embeddings.

In a previous article, we explored how to use LoRA fine-tuning while fully retraining the token embeddings and the language modeling head.

Fine-tune the Token Embeddings and the Language Modeling Head of Llama 3

Fine-tune the Token Embeddings and the Language Modeling Head of Llama 3

Benjamin Marie
·
May 27, 2024
Read full story

This approach enables the model to effectively handle special tokens, such as those used in a chat template and specific domains. While this method is effective, it requires significantly more memory since the embeddings and language modeling head parameters are made trainable.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, we will explore a new alternative from the Hugging Face PEFT (Parameter-Efficient Fine-Tuning) library. Rather than retraining the full embeddings and language modeling head, this method focuses solely on updating the embeddings for the special tokens the model needs to learn. We will first examine how this technique works, its limitations, and its memory efficiency. Finally, we will compare its performance to full retraining.

The notebook that implements this fine-tuning approach, using Llama 3.1 and 3.2 as examples, is available here:

Get the notebook (#155)

Trainable Tokens with LoRA: How Does It Work?

Keep reading with a 7-day free trial

Subscribe to The Kaitchup – AI on a Budget to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More