How Well Does Qwen3 Handle 4-bit and 2-bit Quantization?

Let's review Qwen3 and check which one you should use

May 01, 2025

∙ Paid

Qwen3 models have finally arrived, and they don’t disappoint!

Despite their compact sizes, they perform remarkably well across benchmarks. The 14B and 32B models are very promising, and they are great for consumer-grade hardware. But perhaps the most intriguing is Qwen3-30B-A3B: a 30-billion parameter model with only 3 billion active parameters at inference. This MoE design makes it very lightweight, a quantized version can fit comfortably on a 24 GB GPU, and run efficiently when paired with GPU-friendly formats like GPTQ+Marlin.

In this article, we’ll explore how well the Qwen3 models handle quantization, and the short answer is: surprisingly well. These models are particularly quantization-friendly, with even 2-bit versions showing strong performance. I’ll walk through the quantization process, share evaluation results, and demonstrate how to run the models efficiently using vLLM, both with and without the reasoning mode enabled.

Since this is my first deep dive into the Qwen3 series, I’ll also briefly introduce the models.
You’ll find a companion notebook below that walks through quantization, evaluation, and inference, with and without reasoning:

Get the notebook (#161)

The Kaitchup – AI on a Budget

How Well Does Qwen3 Handle 4-bit and 2-bit Quantization?

Let's review Qwen3 and check which one you should use

This post is for paid subscribers