Gemma 3n: Fine-Tuning, Inference, and Submodel Extraction
Running Gemma 3n with vLLM and fine-tuning with TRL
The Gemma 3n models are optimized for low-resource environments through selective parameter activation, allowing efficient inference with just 2B or 4B active parameters. Despite their compact footprint, they support a wide range of multimodal inputs, including text, images, audio, and video, and generate text outputs with a context window of up to 32K tokens. These models have also been trained across 140 languages.
Google released both base and instruct variants under the commercial-use-friendly Gemma license.
First released as a preview in May 2025, Gemma 3n has been trending on the Hugging Face Hub ever since, despite limited framework support in the beginning. That changed last week when Google released safetensors versions of the models, along with official support by the Transformers library. This means that popular inference frameworks like vLLM and SGLang, which use Transformers as a backend, can now run Gemma 3n, although not without a few hiccups, as we’ll see.
In this article, we’ll explore how Gemma 3n and its Matformer architecture work. We'll walk through using the instruct (*-it) variant for inference, with vLLM, and demonstrate how to fine-tune the base model. I’ll also show how to extract specific subsets of weights, leveraging Matformer’s modular design.
For hands-on experiments with Gemma 3n, including inference with vLLM and fine-tuning workflows, check out this notebook: