The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

The KV-Cache of Small MoEs: Qwen3, Qwen3.5, GLM 4.7 Flash, and Nemotron 3 Nano Compared

A memory-first look at four efficient open LLM architectures.

Benjamin Marie's avatar
Benjamin Marie
Mar 18, 2026
∙ Paid

Inference efficiency has become one of the main ways LLMs distinguish themselves. Two models can look similar on paper: roughly the same parameter count, similar context length, both marketed as fast, yet behave very differently once you actually try to serve them at scale. In this article, we focus on four recent open models built for efficient deployment using different architectures:

  • Qwen3.5-35B-A3B

  • GLM-4.7-Flash

  • Nemotron-3-Nano-30B-A3B

  • Qwen3-30B-A3B-Instruct-2507

All four are Mixture-of-Experts (MoE) models, but with a different attention architecture.

In an MoE, the model contains multiple specialized sub-networks called experts, but only a small subset of them is activated for each token. That gives the model access to a large total parameter count while keeping the amount of computation performed per token much lower than in a dense model of comparable size. In other words, MoE is one of the main tricks that lets these models stay strong in quality while remaining practical for inference.

But efficient inference is not only about how many parameters are active. It is also about memory movement, batchability, and especially the cost of the KV cache. During text generation, the model stores the attention keys and values produced for previous tokens so it can reuse them instead of recomputing them at every decoding step. This is what makes autoregressive generation fast enough to be usable, but it also creates a major memory bottleneck: as context length and concurrency grow, the KV cache can dominate serving cost. That is why recent architectures put so much emphasis not just on reducing compute, but on reducing, compressing, or avoiding KV storage wherever possible.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, we will examine how these four MoE models approach inference efficiency in practice, beginning with the architectural ideas that make them attractive to serve and then zooming in on the specific KV-cache strategy used by each one. We will compare standard grouped-query attention, hybrid designs that use full attention only in some layers, and compressed MLA-style caching, and we will translate each approach into a simple formula you can use to estimate memory usage for different context lengths and levels of concurrency.

I made a notebook to estimate the memory consumption at inference time for each one of these 4 models:

Get the notebook (#198)

How to Compute the Standard KV cache Memory Consumption

A useful starting point is the standard BF16 transformer with vanilla multi-head attention, where every query head has its own KV head. In that case, the KV cache size is:

\(\text{KV bytes} = \text{batch_size} \times \text{context_tokens} \times \text{layers} \times 2 \times \text{num_heads} \times \text{head_dim} \times 2\)

The first 2 accounts for K and V and the last 2 is for BF16 bytes per element.

From there, each of the four models mentioned in the introduction improves efficiency by deviating from that vanilla-attention baseline in a different way.

The KV Cache Math: GQA, MLA, and Hybrid LLMs

Qwen3-30B-A3B-Instruct-2507

Note: Same maths for the “thinking” version.

Why inference is efficient

Qwen3-30B-A3B-Instruct-2507 is efficient in the classic MoE + GQA way:

  • 48 layers

  • 32 query heads / 4 KV heads

  • native 262,144-token context

Compared with a dense transformer of similar nominal size, MoE reduces active compute per token, and GQA reduces KV cache size.

What its KV cache really is

This is the cleanest case of the four. Every layer contributes to standard attention KV storage, but because the model has only 4 KV heads instead of 32, the cache is much smaller than in vanilla multi-head attention.

Let:

  • B = concurrent queries

  • T = context length

  • L = 48

  • Hkv = 4

  • D = 128

  • BF16 bytes = 2

Then:

\(\text{KV bytes} = B \times T \times L \times 2 \times H_{\text{kv}} \times D \times 2\)

For this model:

\(\text{KV bytes} = B \times T \times 48 \times 2 \times 4 \times 128 \times 2 = B \times T \times 98{,}304\)

So the KV cache grows by 98,304 bytes per token per concurrent sequence.

At 32k context, batch/concurrency = 1, that is about 3.22 GB (or 3.0 GiB).

Among these four, Qwen3-30B-A3B-Instruct-2507 is the most conventional. GQA reduces KV memory. It is easy to understand and easy to serve, but it does not get the deeper architectural KV savings of the more specialized hybrid models.

GLM-4.7-Flash

Why inference is efficient

GLM-4.7-Flash is also a 30B-A3B MoE model, but its key memory-saving feature is MLA rather than standard GQA:

  • 47 layers

  • 20 attention heads

  • kv_lora_rank = 512

  • qk_rope_head_dim = 64

Instead of storing full K and V tensors for each attention head, MLA stores:

  1. a compressed latent KV vector (kvc), and

  2. a small decoupled positional key (kpe).

Under an optimized MLA runtime, the model reconstructs what it needs during attention rather than caching expanded per-head tensors.

What its KV cache really is

This is the subtle part. For GLM, “KV cache size” is not just a model property, it is also an implementation property.

In a naïve implementation, the model can look much more expensive because the latent representation is expanded back into large per-head K/V tensors. But in an optimized MLA runtime, the stored cache is closer to:

\(L_{\text{kv}} = \text{kv_lora_rank}\)
\(R = \text{qk_rope_head_dim}\)

So the stored per-token state per layer is roughly:

\(L_{\text{kv}} + R\)

rather than:

\(2 \times \text{num_kv_heads} \times \text{head_dim}\)

That is the core reason GLM can be memory-efficient despite not using the same kind of hybrid architecture as Nemotron or Qwen3.5.

Let:

  • B = concurrent queries

  • T = context length

  • L = 47

  • Lkv= 512

  • R =64

  • BF16 bytes = 2

Then the compressed MLA cache is approximately:

\(\text{KV bytes} = B \times T \times L \times (L_{\text{kv}} + R) \times 2\)

For this model:

\(\text{KV bytes} = B \times T \times 47 \times (512 + 64) \times 2 = B \times T \times 54{,}144\)

So the compressed cache grows by 54,144 bytes per token per concurrent sequence.

At 32k context, batch/concurrency = 1, that is about 1.77 GB (or 1.65 GiB).

With MLA, an LLM can be much more memory-efficient than a naïve GQA.

NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

Why inference is efficient

Nemotron-3-Nano uses a hybrid Mamba-2 / Transformer / MoE design:

  • 52 total layers

  • 23 Mamba-2 layers

  • 23 MoE layers

  • only 6 attention layers

This is extremely inference-friendly because the model keeps some attention for quality, but most of its depth is handled by mechanisms that do not require dense token-wise transformer KV storage.

What its KV cache really is

The important detail is that only the 6 attention layers contribute standard transformer KV cache.

The Mamba layers maintain their own recurrent state, and that state is real memory, but it is not the same thing as the usual token-indexed transformer KV cache. Its size is constant and usually smaller than 200 MB.

Let:

  • B = concurrent queries

  • T = context length

  • Lattn = 6

  • Hkv = 2

  • D = 128

  • BF16 bytes = 2

Then:

\(\text{KV bytes} = B \times T \times L_{\text{attn}} \times 2 \times H_{\text{kv}} \times D \times 2\)

For this model:

\(\text{KV bytes} = B \times T \times 6 \times 2 \times 2 \times 128 \times 2 = B \times T \times 6{,}144\)

So the KV cache grows by only 6,144 bytes per token per concurrent sequence.

At 32k context, batch/concurrency = 1, that is about 0.20 GB (or 0.1875 GiB).

Among these four, Nemotron has the smallest standard attention KV cache because it has the fewest true attention layers and very small KV heads.

Qwen3.5-35B-A3B

Why inference is efficient

Qwen3.5-35B-A3B uses a hybrid sequence model plus MoE sparsity. The 40-layer layout:

  • 10 × (3 × Gated DeltaNet → MoE, then 1 × Gated Attention → MoE)

  • only 10 full-attention layers

  • 2 KV heads at the attention layers

  • head_dim = 256

What its KV cache really is

For KV accounting, the key point is that only the full-attention layers use standard transformer K/V storage.

Like Mamba-2, the linear-attention layers maintain a different recurrent or state-space-style state, but not the usual token-indexed transformer KV cache. Its size is negligible.

Let:

  • B = concurrent queries

  • T = context length

  • Lattn = 10

  • Hkv = 2

  • D = 256

  • BF16 bytes = 2

Then:

\(\text{KV bytes} = B \times T \times L_{\text{attn}} \times 2 \times H_{\text{kv}} \times D \times 2\)

For this model:

\(\text{KV bytes} = B \times T \times 10 \times 2 \times 2 \times 256 \times 2 = B \times T \times 20{,}480\)

So the KV cache grows by 20,480 bytes per token per concurrent sequence.

At 32k context, batch/concurrency = 1, that is about 0.67 GB (or 0.625 GiB) of attention KV cache.

Its standard attention cache is relatively small because full attention appears only sparsely across layers, but it has larger KV heads and more attention layers than Nemotron 3 Nano.

Comparison at Scale

As we can see in the figure below, the KV cache grows very differently depending on the model’s attention architecture:

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 The Kaitchup · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture