Qwen3.5 Medium Models: Dense vs. MoE
75% linear attention layers, tiny KV cache, strong results.
Following last week’s release of Qwen3.5 397B, Alibaba’s Qwen team has introduced three new “medium” models in the Qwen3.5 multimodal family, Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, and Qwen3.5-27B, alongside a “base” variant of Qwen3.5-35B-A3B that is presumably only lightly post-trained, making it easier to fine-tune.
In this article, I’ll focus on the models’ architectures and their deployment-time memory footprint. Thanks to the use of Gated Deltanet (a form of “linear attention”) in 75% of the layers, these models offer high throughput and a small KV cache, keeping memory usage low even at long context lengths.
They’re still too large for consumer GPUs in full precision, but they should become practical once quantized. Next week, I’ll publish a detailed guide covering quantization and benchmarking across multiple low-bit checkpoints, primarily 4-bit, and possibly 2-bit as well. My early experiments suggest Qwen3.5 is unusually robust to aggressive low-bit quantization.



