The KV-Cache of Small MoEs: Qwen3, Qwen3.5, GLM 4.7 Flash, and Nemotron 3 Nano Compared
A memory-first look at four efficient open LLM architectures.
Inference efficiency has become one of the main ways LLMs distinguish themselves. Two models can look similar on paper: roughly the same parameter count, similar context length, both marketed as fast, yet behave very differently once you actually try to serve them at scale. In this article, we focus on four recent open models built for efficient deployment using different architectures:
Qwen3.5-35B-A3B
GLM-4.7-Flash
Nemotron-3-Nano-30B-A3B
Qwen3-30B-A3B-Instruct-2507
All four are Mixture-of-Experts (MoE) models, but with a different attention architecture.
In an MoE, the model contains multiple specialized sub-networks called experts, but only a small subset of them is activated for each token. That gives the model access to a large total parameter count while keeping the amount of computation performed per token much lower than in a dense model of comparable size. In other words, MoE is one of the main tricks that lets these models stay strong in quality while remaining practical for inference.
But efficient inference is not only about how many parameters are active. It is also about memory movement, batchability, and especially the cost of the KV cache. During text generation, the model stores the attention keys and values produced for previous tokens so it can reuse them instead of recomputing them at every decoding step. This is what makes autoregressive generation fast enough to be usable, but it also creates a major memory bottleneck: as context length and concurrency grow, the KV cache can dominate serving cost. That is why recent architectures put so much emphasis not just on reducing compute, but on reducing, compressing, or avoiding KV storage wherever possible.
In this article, we will examine how these four MoE models approach inference efficiency in practice, beginning with the architectural ideas that make them attractive to serve and then zooming in on the specific KV-cache strategy used by each one. We will compare standard grouped-query attention, hybrid designs that use full attention only in some layers, and compressed MLA-style caching, and we will translate each approach into a simple formula you can use to estimate memory usage for different context lengths and levels of concurrency.
I made a notebook to estimate the memory consumption at inference time for each one of these 4 models:
How to Compute the Standard KV cache Memory Consumption
A useful starting point is the standard BF16 transformer with vanilla multi-head attention, where every query head has its own KV head. In that case, the KV cache size is:
The first 2 accounts for K and V and the last 2 is for BF16 bytes per element.
From there, each of the four models mentioned in the introduction improves efficiency by deviating from that vanilla-attention baseline in a different way.
The KV Cache Math: GQA, MLA, and Hybrid LLMs
Qwen3-30B-A3B-Instruct-2507
Note: Same maths for the “thinking” version.
Why inference is efficient
Qwen3-30B-A3B-Instruct-2507 is efficient in the classic MoE + GQA way:
48 layers
32 query heads / 4 KV heads
native 262,144-token context
Compared with a dense transformer of similar nominal size, MoE reduces active compute per token, and GQA reduces KV cache size.
What its KV cache really is
This is the cleanest case of the four. Every layer contributes to standard attention KV storage, but because the model has only 4 KV heads instead of 32, the cache is much smaller than in vanilla multi-head attention.
Let:
B = concurrent queries
T = context length
L = 48
Hkv = 4
D = 128
BF16 bytes = 2
Then:
For this model:
So the KV cache grows by 98,304 bytes per token per concurrent sequence.
At 32k context, batch/concurrency = 1, that is about 3.22 GB (or 3.0 GiB).
Among these four, Qwen3-30B-A3B-Instruct-2507 is the most conventional. GQA reduces KV memory. It is easy to understand and easy to serve, but it does not get the deeper architectural KV savings of the more specialized hybrid models.
GLM-4.7-Flash
Why inference is efficient
GLM-4.7-Flash is also a 30B-A3B MoE model, but its key memory-saving feature is MLA rather than standard GQA:
47 layers
20 attention heads
kv_lora_rank = 512
qk_rope_head_dim = 64
Instead of storing full K and V tensors for each attention head, MLA stores:
a compressed latent KV vector (
kvc), anda small decoupled positional key (
kpe).
Under an optimized MLA runtime, the model reconstructs what it needs during attention rather than caching expanded per-head tensors.
What its KV cache really is
This is the subtle part. For GLM, “KV cache size” is not just a model property, it is also an implementation property.
In a naïve implementation, the model can look much more expensive because the latent representation is expanded back into large per-head K/V tensors. But in an optimized MLA runtime, the stored cache is closer to:
So the stored per-token state per layer is roughly:
rather than:
That is the core reason GLM can be memory-efficient despite not using the same kind of hybrid architecture as Nemotron or Qwen3.5.
Let:
B = concurrent queries
T = context length
L = 47
Lkv= 512
R =64
BF16 bytes = 2
Then the compressed MLA cache is approximately:
For this model:
So the compressed cache grows by 54,144 bytes per token per concurrent sequence.
At 32k context, batch/concurrency = 1, that is about 1.77 GB (or 1.65 GiB).
With MLA, an LLM can be much more memory-efficient than a naïve GQA.
NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
Why inference is efficient
Nemotron-3-Nano uses a hybrid Mamba-2 / Transformer / MoE design:
52 total layers
23 Mamba-2 layers
23 MoE layers
only 6 attention layers
This is extremely inference-friendly because the model keeps some attention for quality, but most of its depth is handled by mechanisms that do not require dense token-wise transformer KV storage.
What its KV cache really is
The important detail is that only the 6 attention layers contribute standard transformer KV cache.
The Mamba layers maintain their own recurrent state, and that state is real memory, but it is not the same thing as the usual token-indexed transformer KV cache. Its size is constant and usually smaller than 200 MB.
Let:
B = concurrent queries
T = context length
Lattn = 6
Hkv = 2
D = 128
BF16 bytes = 2
Then:
For this model:
So the KV cache grows by only 6,144 bytes per token per concurrent sequence.
At 32k context, batch/concurrency = 1, that is about 0.20 GB (or 0.1875 GiB).
Among these four, Nemotron has the smallest standard attention KV cache because it has the fewest true attention layers and very small KV heads.
Qwen3.5-35B-A3B
Why inference is efficient
Qwen3.5-35B-A3B uses a hybrid sequence model plus MoE sparsity. The 40-layer layout:
10 × (3 × Gated DeltaNet → MoE, then 1 × Gated Attention → MoE)
only 10 full-attention layers
2 KV heads at the attention layers
head_dim = 256
What its KV cache really is
For KV accounting, the key point is that only the full-attention layers use standard transformer K/V storage.
Like Mamba-2, the linear-attention layers maintain a different recurrent or state-space-style state, but not the usual token-indexed transformer KV cache. Its size is negligible.
Let:
B = concurrent queries
T = context length
Lattn = 10
Hkv = 2
D = 256
BF16 bytes = 2
Then:
For this model:
So the KV cache grows by 20,480 bytes per token per concurrent sequence.
At 32k context, batch/concurrency = 1, that is about 0.67 GB (or 0.625 GiB) of attention KV cache.
Its standard attention cache is relatively small because full attention appears only sparsely across layers, but it has larger KV heads and more attention layers than Nemotron 3 Nano.
Comparison at Scale
As we can see in the figure below, the KV cache grows very differently depending on the model’s attention architecture:

