The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Fine-tune the Token Embeddings and the Language Modeling Head of Llama 3

Fine-tune the Token Embeddings and the Language Modeling Head of Llama 3

If you have enough GPU RAM

Benjamin Marie's avatar
Benjamin Marie
May 27, 2024
∙ Paid
7

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Fine-tune the Token Embeddings and the Language Modeling Head of Llama 3
8
1
Share
Generated with DALL-E

We can easily adapt a pre-trained large language model (LLM) to new tasks thanks to low-rank adaptation (LoRA). LoRA freezes the entire model and adds a small amount of trainable parameters on top of it. By only training these new parameters instead of the entire model, LoRA, and its quantized variant QLoRA, save a lot of GPU memory and make it possible to fine-tune LLMs on consumer hardware.

QLoRA: Fine-Tune a Large Language Model on Your GPU

QLoRA: Fine-Tune a Large Language Model on Your GPU

Benjamin Marie
·
May 30, 2023
Read full story

LoRA is usually only applied to attention and MLP modules. The token embeddings and the language modeling head remain the same after LoRA fine-tuning. This is often not ideal since the token embeddings learned during pre-training are very general without any domain or task specialization. Some of them may have even been left untrained. This is the case for instance for some special tokens of Llama 3 8B.

Ideally, we should retrain the token embeddings and the language modeling head to better adapt the model to a new task or domain.

But what is the cost of this retraining? Is it really worth it? When should we do it?

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I investigate the impact of retraining the token embeddings and language modeling head of Llama 3 during (Q)LoRA fine-tuning. We will see that, due to the large vocabulary of Llama 3, it is indeed very costly in GPU memory but still feasible on consumer hardware. More importantly, we will see that retraining the token embeddings and language modeling head of Llama 3 can significantly improve the fine-tuning.

The notebook demonstrating the impact of this retraining is available here:

Get the notebook (#73)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share