Shrink LLMs with Vocabulary Reduction: From Gemma 3 270M to 141M
64k-token tokenizer, −129M params
State-of-the-art language models increasingly rely on large token vocabularies to cover many writing systems. Today, the vocabulary size of popular open models typically sits in the ~128k–150k range. Google’s Gemma 3 goes further with a tokenizer of 262,144 entries, designed to improve coverage for non-Latin scripts.
A big vocabulary helps because it shortens sequences for Chinese, Japanese, Korean, and many Indic languages, which tends to boost multilingual performance. But it’s not free. At each decoding step, the model must compute logits for the entire vocabulary. There’s no language-specific gating in a standard decoder. So even if you never generate Bengali, Korean, or Arabic characters, those probabilities are still calculated.
The cost also shows up in memory: with Gemma 3 27B (hidden size 5,376, vocab 262,144), the tied embedding/LM-head matrix has
5,376 x 262,144 = 1,409,286,144 parameters. In bfloat16, that’s ≈ 2.82 GB just for that matrix (≈ 5.6 GB if embeddings and LM head are untied).
For Gemma 3 27B, those ~1.409 B embedding/LM-head parameters are ~5.2% of the total. For Gemma 3 270M, the ratio flips: roughly 170 M parameters in the embedding table versus ~100 M across the transformer blocks, about 63% embeddings.
So why spend so much compute and memory on tokens you’ll never use?
If your application targets a narrow language set, you can shrink the vocabulary and claw back speed, memory, and cost, often with no loss on your target tasks after fine-tuning. The recipe is straightforward:
train a tokenizer tailored to your languages and domains;
keep the embeddings for any tokens that survive in the new tokenizer;
initialize embeddings for new tokens (randomly or with a smarter scheme);
resize the LM head (tied or untied);
then fine-tune on your data.
This works very well. And for a small model like Gemma 3 270M, this is all feasible with a consumer GPU!
In this article, I walk through this approach end-to-end on Gemma 3 270M, reducing the vocab from ~262k to 64k. The model becomes effectively a ~141M-parameter variant while maintaining performance on the target task after fine-tuning. To demonstrate that the model can still achieve outstanding performance once fine-tuned, I trained on English→French translation, preserving coverage for English and French when building the tokenizer; you can apply the same recipe to other languages or domains. The resulting model is tiny but delivers a translation quality not too far from commercial translation systems.
You can find my code here: