The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Qwen3.5 Medium Models: Dense vs. MoE

75% linear attention layers, tiny KV cache, strong results.

Benjamin Marie's avatar
Benjamin Marie
Feb 25, 2026
∙ Paid

Following last week’s release of Qwen3.5 397B, Alibaba’s Qwen team has introduced three new “medium” models in the Qwen3.5 multimodal family, Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, and Qwen3.5-27B, alongside a “base” variant of Qwen3.5-35B-A3B that is presumably only lightly post-trained, making it easier to fine-tune.

Qwen3.5: Scaling Hybrid Attention to 397B Parameters

Qwen3.5: Scaling Hybrid Attention to 397B Parameters

Benjamin Marie
·
Feb 19
Read full story

In this article, I’ll focus on the models’ architectures and their deployment-time memory footprint. Thanks to the use of Gated Deltanet (a form of “linear attention”) in 75% of the layers, these models offer high throughput and a small KV cache, keeping memory usage low even at long context lengths.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

They’re still too large for consumer GPUs in full precision, but they should become practical once quantized. Next week, I’ll publish a detailed guide covering quantization and benchmarking across multiple low-bit checkpoints, primarily 4-bit, and possibly 2-bit as well. My early experiments suggest Qwen3.5 is unusually robust to aggressive low-bit quantization.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 The Kaitchup · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture