NVFP4: Same Accuracy with 2.3x Higher Throughput for 4-Bit LLMs
How to quantize LLMs with NVFP4
As large language models (LLMs) continue to grow in size and complexity, quantization has become an essential technique for making inference more efficient, especially on consumer and enterprise-grade hardware. Among the emerging quantization formats, NVIDIA’s NVFP4 stands out for its tight integration with Blackwell GPUs and promise of significant speedups without major accuracy trade-offs.
How does NVFP4 compare to widely used 4-bit quantization methods such as AWQ, AutoRound, and bitsandbytes? Should you systematically use NVFP4 models if you have a Blackwell GPU?
In this article, I put NVFP4 to the test, evaluating it across key dimensions like accuracy, model size, and inference throughput, using publicly available models as well as a few custom-quantized variants.
I also share practical tips on using NVFP4 models with vLLM, and why activation quantization proves to be critical for maintaining NVFP4’s performance edge.
The following notebook shows how to quantize models with NVFP4.
I used an RTX 6000 Pro from RunPod (referral link) for all my experiments. An RTX 5080 or 5090 would also work.