Accelerate Models with Quantization: Recipes for NVFP4, GPTQ, AWQ, SmoothQuant, AutoRound, and FP8
Focus on 4-bit and 8-bit quantization + vLLM benchmarking with accuracy and inference throughput
Running LLMs is easy. Quantizing LLMs is also easy. But running quantized LLMs? That often doesn’t work as expected. This is one of the reasons GGUF is so popular: it’s a format that can be easily run by popular frameworks like Ollama and llama.cpp.
However, if you want state-of-the-art quantization accuracy and to take advantage of highly optimized CUDA kernels for INT4, FP8, and FP4 models, you often need to get your hands a bit dirty.
In this article, I explore six different quantization recipes that yield models optimized to run very fast with vLLM. We’ve already applied most of them in previous articles using different frameworks:
W4A16: INT4 quantized weights with GPTQ, AWQ, and AutoRound, calibrated/tuned
W8A8: INT8 quantized weights and quantized activations, calibrated with SmoothQuant
FP8-Dynamic: FP8 quantized weights and dynamically quantized activations
NVFP4: FP4 quantized weights and activations, calibrated
All these recipes can be run on a single consumer GPU, but you’ll need a recent one (for FP8 and NVFP4 in particular), such as an RTX 50xx. I used an RTX 5090 (from RunPod) and was able to quantize 8B models. None of these recipes took more than an hour.
I also provide a single customizable script capable of running each of these recipes. You can find it here:
In the following sections, we’ll test each recipe with Qwen3 4B Instruct and also its Thinking variants to measure the impact on reasoning and long-sequence generation. I report both inference throughput and accuracy on popular benchmarks.
Note: I focused on Qwen3 in this article, but I could quantize Olmo 3 with the same script. You can find my quantized Olmo 3 here (still ongoing):


