Qwen3.5: Scaling Hybrid Attention to 397B Parameters
With Qwen3.5's memory requirements and GGUF recommendations
Last year, Qwen released Qwen3-Next, an 80B-parameter model that signaled a shift away from full attention toward a more inference-efficient approach based largely on linear attention. Qwen3-Next used Gated DeltaNet (GDN) to speed up inference and reduce memory usage.
In practice, Qwen3-Next wasn’t particularly competitive: it underperformed Qwen3 32B (a much smaller model) and was trained on “only” 15T tokens versus 36T tokens for Qwen3. That left an open question: can LLMs using GDN truly scale, i.e., do they reliably improve with more parameters and more training data?
With the release of Qwen3.5, we now have a clear empirical answer. Linear attention with GDN scales well and can produce state-of-the-art LLMs, while remaining significantly more cost-effective at inference time.
In this article, I’ll cover what’s new in Qwen3.5 and how they scaled from Qwen3-Next to Qwen3.5-397B-A17B. I’ll also break down memory consumption to show why linear attention makes inference more efficient. Finally, I’ll share results from testing several quantized GGUF variants already available, along with practical recommendations on which ones to use.


