The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Shrink LLMs with Vocabulary Reduction: From Gemma 3 270M to 141M

64k-token tokenizer, −129M params

Benjamin Marie's avatar
Benjamin Marie
Sep 29, 2025
∙ Paid
10
1
Share
Image generated with ChatGPT

State-of-the-art language models increasingly rely on large token vocabularies to cover many writing systems. Today, the vocabulary size of popular open models typically sits in the ~128k–150k range. Google’s Gemma 3 goes further with a tokenizer of 262,144 entries, designed to improve coverage for non-Latin scripts.

A big vocabulary helps because it shortens sequences for Chinese, Japanese, Korean, and many Indic languages, which tends to boost multilingual performance. But it’s not free. At each decoding step, the model must compute logits for the entire vocabulary. There’s no language-specific gating in a standard decoder. So even if you never generate Bengali, Korean, or Arabic characters, those probabilities are still calculated.

The cost also shows up in memory: with Gemma 3 27B (hidden size 5,376, vocab 262,144), the tied embedding/LM-head matrix has
5,376 x 262,144 = 1,409,286,144 parameters. In bfloat16, that’s ≈ 2.82 GB just for that matrix (≈ 5.6 GB if embeddings and LM head are untied).

For Gemma 3 27B, those ~1.409 B embedding/LM-head parameters are ~5.2% of the total. For Gemma 3 270M, the ratio flips: roughly 170 M parameters in the embedding table versus ~100 M across the transformer blocks, about 63% embeddings.

So why spend so much compute and memory on tokens you’ll never use?

If your application targets a narrow language set, you can shrink the vocabulary and claw back speed, memory, and cost, often with no loss on your target tasks after fine-tuning. The recipe is straightforward:

  • train a tokenizer tailored to your languages and domains;

  • keep the embeddings for any tokens that survive in the new tokenizer;

  • initialize embeddings for new tokens (randomly or with a smarter scheme);

  • resize the LM head (tied or untied);

  • then fine-tune on your data.

This works very well. And for a small model like Gemma 3 270M, this is all feasible with a consumer GPU!

In this article, I walk through this approach end-to-end on Gemma 3 270M, reducing the vocab from ~262k to 64k. The model becomes effectively a ~141M-parameter variant while maintaining performance on the target task after fine-tuning. To demonstrate that the model can still achieve outstanding performance once fine-tuned, I trained on English→French translation, preserving coverage for English and French when building the tokenizer; you can apply the same recipe to other languages or domains. The resulting model is tiny but delivers a translation quality not too far from commercial translation systems.

You can find my code here:

Get the notebook (#184)

Training a New Vocabulary for Language Models

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture