Running Qwen3-Next: Hybrid Attention, MoE, and 4-Bit Quantization
Hybrid attention, sparse MoE, and quantization tested with vLLM and evaluated in multilingual tasks.
Qwen3-Next-80B-A3B is an 81.3B-parameter model available in Thinking and Instruct variants.
It uses a sparse Mixture-of-Experts (MoE) setup, with only 11 of 512 experts active per token at inference. Unusually at this scale, it adopts a hybrid stack enabling efficient long-context handling by using mostly linear-time DeltaNet layers interleaved with periodic global attention (≈3:1), which reduces compute and memory growth versus pure self-attention while preserving recall.
Qwen’s published benchmark scores look impressive, all the more so considering the model’s efficiency.
In this article, we’ll examine Qwen3-Next’s architecture and specifications, then walk through practical usage with vLLM, including quantized variants that run on a single 48 GB GPU with no CPU offloading and limited loss in accuracy. I’ll also discuss my own evaluation of the Instruct version for various languages.


