The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Qwen3.5: Scaling Hybrid Attention to 397B Parameters

With Qwen3.5's memory requirements and GGUF recommendations

Benjamin Marie's avatar
Benjamin Marie
Feb 19, 2026
∙ Paid
Qwen3.5: Gated DeltaNet Can Scale

Last year, Qwen released Qwen3-Next, an 80B-parameter model that signaled a shift away from full attention toward a more inference-efficient approach based largely on linear attention. Qwen3-Next used Gated DeltaNet (GDN) to speed up inference and reduce memory usage.

In practice, Qwen3-Next wasn’t particularly competitive: it underperformed Qwen3 32B (a much smaller model) and was trained on “only” 15T tokens versus 36T tokens for Qwen3. That left an open question: can LLMs using GDN truly scale, i.e., do they reliably improve with more parameters and more training data?

With the release of Qwen3.5, we now have a clear empirical answer. Linear attention with GDN scales well and can produce state-of-the-art LLMs, while remaining significantly more cost-effective at inference time.

In this article, I’ll cover what’s new in Qwen3.5 and how they scaled from Qwen3-Next to Qwen3.5-397B-A17B. I’ll also break down memory consumption to show why linear attention makes inference more efficient. Finally, I’ll share results from testing several quantized GGUF variants already available, along with practical recommendations on which ones to use.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 The Kaitchup · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture