The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Qwen3.5 9B, 4B, 2B & 0.8B: GPU Requirements, VRAM Usage & KV Cache Breakdown (262K Context)

How much memory do Qwen3.5 small models really need?

Benjamin Marie's avatar
Benjamin Marie
Mar 03, 2026
∙ Paid

After the large 397B models and the medium models, Qwen3.5 released the small variants: 9B, 4B, 2B, and 0.8.

Unlike last week’s medium checkpoints, these are finally in the “actually runnable” category on consumer hardware in full precision, i.e., without quantization.

Qwen3.5 Medium Models: Dense vs. MoE

Qwen3.5 Medium Models: Dense vs. MoE

Benjamin Marie
·
Feb 25
Read full story

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I’ll focus on their deployment-time memory footprint. Again, thanks to the use of Gated Deltanet (a form of “linear attention”) in 75% of the layers, these models have a small KV cache, keeping memory usage low even at long context lengths.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 The Kaitchup · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture