Running Qwen3-Next: Hybrid Attention, MoE, and 4-Bit Quantization

Hybrid attention, sparse MoE, and quantization tested with vLLM and evaluated in multilingual tasks.

Sep 18, 2025

∙ Paid

Qwen3-Next-80B-A3B is an 81.3B-parameter model available in Thinking and Instruct variants.

It uses a sparse Mixture-of-Experts (MoE) setup, with only 11 of 512 experts active per token at inference. Unusually at this scale, it adopts a hybrid stack enabling efficient long-context handling by using mostly linear-time DeltaNet layers interleaved with periodic global attention (≈3:1), which reduces compute and memory growth versus pure self-attention while preserving recall.

Qwen’s published benchmark scores look impressive, all the more so considering the model’s efficiency.

Qwen3-Next-80B-A3B-Instruct Benchmark Comparison — Qwen3 Instruct models evaluated on reasoning benchmarks.

In this article, we’ll examine Qwen3-Next’s architecture and specifications, then walk through practical usage with vLLM, including quantized variants that run on a single 48 GB GPU with no CPU offloading and limited loss in accuracy. I’ll also discuss my own evaluation of the Instruct version for various languages.

The Kaitchup – AI on a Budget

Running Qwen3-Next: Hybrid Attention, MoE, and 4-Bit Quantization

Hybrid attention, sparse MoE, and quantization tested with vLLM and evaluated in multilingual tasks.

This post is for paid subscribers