Gemma 4 31B and 26B A4B: Architecture and Memory Consumption
The Weekly Kaitchup #136
Hi everyone,
In this edition of The Weekly Kaitchup, let’s discuss:
Gemma 4
The release of Trinity-Large-Thinking
Gemma 4
Finally, Google has released Gemma 4.
There is a lot to discuss, especially the smaller E2B and E4B models. But in this post, I want to focus on the larger variants:
Gemma 4 31B, a dense model
Gemma 4 26B A4B, an MoE
I’ll publish a deeper analysis next week, including my own benchmarks, efficiency measurements, and deployment notes. The experiments are already running.
For now, I want to look at three things: how these models are built, and how much memory they will consume on your machine.
A Zero-Risk Architecture?
I’m a bit disappointed by the lack of novelty in these models. We won’t learn much from them. They are very standard.
Gemma 4 31B is mostly an evolution of Gemma 3 27B. The MLP/nonlinearity/norm stack is very similar, while the attention stack and multimodal stack are where the small changes are.
Gemma 4 31B keeps the same 5 local: 1 global high-level pattern, but makes the global layers more specialized. The global layers feature unified Keys and Values and use Proportional RoPE (p-RoPE) for long-context efficiency.
Gemma 4 still soft-caps final logits.
About the 26B A4B: yes, it is a real sparse MoE, but not as sparse as Qwen3.5 35B, which has 512 experts. It has 8 active experts out of 128 total, plus 1 shared expert.
GPU Memory Consumption of Gemma 4 31B and 26B
If you are going to use the model without quantization, 16-bit per weight, they will consume:
google/gemma-4-31B-it: 61 GB
google/gemma-4-26B-A4B-it: 50 GB
Since the models are standard, we already have some quantized checkpoints available:
nvidia/Gemma-4-31B-IT-NVFP4: 32.7 GB (very conservative quantization by NVIDIA)
cyankiwi/gemma-4-31B-it-AWQ-4bit: 20.5 GB
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit: 17.2 GB
I’ll evaluate them all, with a subsample of the GGUFs version from different providers.
The relevant config values to do the KV cache math (at max context length) are:
31B:
max_position_embeddings=262144num_hidden_layers=60num_key_value_heads=16head_dim=256num_global_key_value_heads=4global_head_dim=512sliding_window=1024, with a 5-local / 1-global pattern inlayer_types(so 50 sliding layers and 10 full layers)
26B-A4B:
max_position_embeddings=262144num_hidden_layers=30num_key_value_heads=8head_dim=256num_global_key_value_heads=2global_head_dim=512sliding_window=1024, again 5-local / 1-global (so 25 sliding layers and 5 full layers)
Like for Qwen3.5, we can see that the MoE will consume much less memory: fewer KV heads and fewer layers, while they both have the same head dimensions.
For sliding layers, the sliding-window cache only keeps min(max_cache_len, sliding_window) tokens.
So the per-layer bf16 storage works out to:
31B sliding layer:
\(2 \times 1024 \times 16 \times 256 \times 2\)over the whole window, or 16,384 bytes per token per layer.
31B full layer:
\(2 \times 262144 \times 4 \times 512 \times 2\)over the whole context, or 8,192 bytes per token per layer. Gemma 4 full-attention layers use
global_head_dimandnum_global_key_value_heads.26B sliding layer: 8,192 bytes per token per layer.
26B full layer: 4,096 bytes per token per layer.
That gives these steady-state, sliding-window-aware totals at max context:
Gemma 4 31B
10×262144×8192+50×1024×16384
= 22,313,697,280 bytes = 20.78 GiB.Gemma 4 26B-A4B
5×262144×4096+25×1024×8192
= 5,578,424,320 bytes = 5.20 GiB.
So the 31B’s KV cache consumes 25% more memory than Qwen3.5 27B, while the 26B consumes slightly less memory than Qwen3.5 35B.
Gemma 4: Weights + KV cache
With the original model, not quantized, you will need a 96 GB (e.g., RTX Pro 6000) to fit a full sequence for Gemma 4 31B. For the same memory, you should be able to run 8 or 9 concurrent queries for the 26B.
Quantized to 4-bit, as Unsloth Q4 or cyankiwi’s AWQ, the 31B consumes between 17 and 20 GB, i.e., the full context KV cache dominates memory consumption. If you quantize it to 8-bit, it’ll fit on a 32 GB GPU, otherwise, depending on the 4-bit variants you choose, it will consume between 37 and 40 GB.
Under one apples-to-apples assumption set, batch 1, bf16 weights, bf16 cache, max text context = 262,144, and ignoring allocator / runtime overhead, Qwen3.5-27B comes out noticeably lighter than Gemma 4 31B. Using the released checkpoint sizes plus the long-context cache math, you get roughly:
Qwen3.5-27B: ~67.8 GiB
Gemma 4 31B: ~79.1 GiB
So Gemma 4 31B is about 11.3 GiB heavier at full context on these assumptions.
Full analysis next week!
Trinity-Large-Thinking was released by Arcee.
The main change from Trinity-Large-Preview is that the model now uses a “thinking” stage before answering, with the goal of improving multi-turn tool use, context coherence, instruction following, and stability in long agent runs.
It’s a sparse MoE model with 400B total parameters and 13B active parameters per token.
60 transformer layers, with the first 6 as dense layers
model dimension 3072, FFN intermediate dimension 12288
48 attention heads, 8 KV heads
128 head dimension, 256 routed experts, 1 shared expert, and top-4 expert activation per token
No MLA, no DSA, no hybrid architecture. To limit the KV cache size, it relies on a repeated 3:1 local/global attention pattern, sliding-window attention (4096 tokens) in local layers, without positional embeddings in global layers.
For KV-cache math, we have:
K+V bytes per token per layer = 2 × n_kv_heads × head_dim × bytes_per_value= 2 × 8 × 128 × 2 = 4096 bytes
That is 4 KiB per token per layer. Trinity has 5 local sliding-window layers and 15 global layers. For the local layers, only the 4096-token window has to be kept. For the global layers, the full 262,144-token context has to be kept. So the total per-sequence KV cache at max context is:
KV total = 2 × bytes × n_kv_heads × head_dim × (L_global × T + L_local × W)= 2 × 2 × 8 × 128 × (15 × 262,144 + 45 × 4,096)= 16,861,102,080 bytes = 15.703125 GiB
So the practical summary is: 1 GiB per global layer, 16 MiB per local layer, 15.70 GiB total per sequence at 256K in bf16/fp16. If all 60 layers were full attention, the same model would need 60 GiB of KV cache, so the sliding-window pattern cuts the KV-cache requirement by about 74%. This is roughly the same kv cache size as Qwen3.5 27B.
They didn’t directly compare the model to Qwen3.5 397B, even though they are of similar size.
The Salt
The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.
This week, we reviewed:
⭐Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?
MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% discount for group subscriptions):
Have a nice weekend!








Gemma was impressive for a while but has since been overshadowed by numerous LLMs. Your quote about being “disappointed by the lack of novelty” sums it up unfortunately.
Google’s trade war with other hyperscalers puts them in a similar mindset as Meta, who intends to rent their next LLM. It looks like Google opted to stay competitive without investing significant resources in Gemma.
You mentioned that NVIDIA integrated NVP4 quantization into Gemma. This suggests they were concerned about Gemma4 competing with the NeMoTrons, but perhaps not anymore after reviewing!
Google stated they built Gemma4 from scratch, implying a more robust training dataset. I wonder if it can capture second or third-order nuances like Gemini does, or if such reasoning is still impossible beyond its 31B parameter range?
These were very helpful: The Fastest and Cheapest 120B LLM?, TurboQuant: Finally, Fast and Widely Available Low-Bit KV Cache Quantization?, Mistral Small 4: A Good Alternative to Qwen3.5 122B and Nemotron 3 Super? Keep churning out the great work!