The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
TransMLA: Improve Qwen2.5 and Llama 3x LLMs with DeepSeek's Multi-Head Latent Attention

TransMLA: Improve Qwen2.5 and Llama 3x LLMs with DeepSeek's Multi-Head Latent Attention

Give your LLMs significantly more learning power

Benjamin Marie's avatar
Benjamin Marie
Mar 05, 2025
∙ Paid
10

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
TransMLA: Improve Qwen2.5 and Llama 3x LLMs with DeepSeek's Multi-Head Latent Attention
2
Share
Generated with ChatGPT

Grouped Query Attention (GQA) modifies self-attention in large language models (LLMs) by grouping query heads into fewer key-value (KV) heads, reducing memory usage. However, its impact on model quality varies, with significant degradation depending on the model.

Despite this, GQA remains a practical middle ground between Multi-Head Attention (MHA), which is computationally expensive, and Multi-Query Attention (MQA), which can significantly compromise quality. It enables faster inference, lower VRAM usage, and better scalability for large models. Most recent LLMs, like Llama 3.x and Qwen2.5, use GQA.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Since the release of DeepSeek V3 and R1, an alternative approach, Multi-Head Latent Attention (MLA), has been gaining traction. MLA can outperform GQA in memory efficiency while performing better than MHA. In this article, we will explore the mechanics of MLA and demonstrate how TransMLA, a recently proposed method, can efficiently convert a model’s GQA into MLA to significantly enhance a model’s accuracy at minimal cost. We will then fine-tune the converted model using LoRA, optimizing it for its new MLA architecture while analyzing its memory consumption and performance compared to the original GQA model. We will do this with Qwen2.5 but the same could be done with Llama 3.x models.

The following notebook provides a full implementation of TransMLA, including a cost-effective fine-tuning process for MLA-based models:

Get the notebook (#148)

Multi-Head Latent Attention: Why Is It Better?

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share