The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Running Qwen3-Next: Hybrid Attention, MoE, and 4-Bit Quantization

Hybrid attention, sparse MoE, and quantization tested with vLLM and evaluated in multilingual tasks.

Benjamin Marie's avatar
Benjamin Marie
Sep 18, 2025
∙ Paid

Qwen3-Next-80B-A3B is an 81.3B-parameter model available in Thinking and Instruct variants.

It uses a sparse Mixture-of-Experts (MoE) setup, with only 11 of 512 experts active per token at inference. Unusually at this scale, it adopts a hybrid stack enabling efficient long-context handling by using mostly linear-time DeltaNet layers interleaved with periodic global attention (≈3:1), which reduces compute and memory growth versus pure self-attention while preserving recall.

Qwen’s published benchmark scores look impressive, all the more so considering the model’s efficiency.

Qwen3-Next-80B-A3B-Instruct Benchmark Comparison
Qwen3 Instruct models evaluated on reasoning benchmarks.

In this article, we’ll examine Qwen3-Next’s architecture and specifications, then walk through practical usage with vLLM, including quantized variants that run on a single 48 GB GPU with no CPU offloading and limited loss in accuracy. I’ll also discuss my own evaluation of the Instruct version for various languages.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture