The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

NVFP4: Same Accuracy with 2.3x Higher Throughput for 4-Bit LLMs

How to quantize LLMs with NVFP4

Benjamin Marie's avatar
Benjamin Marie
Aug 25, 2025
∙ Paid
9
6
1
Share

As large language models (LLMs) continue to grow in size and complexity, quantization has become an essential technique for making inference more efficient, especially on consumer and enterprise-grade hardware. Among the emerging quantization formats, NVIDIA’s NVFP4 stands out for its tight integration with Blackwell GPUs and promise of significant speedups without major accuracy trade-offs.

How does NVFP4 compare to widely used 4-bit quantization methods such as AWQ, AutoRound, and bitsandbytes? Should you systematically use NVFP4 models if you have a Blackwell GPU?

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I put NVFP4 to the test, evaluating it across key dimensions like accuracy, model size, and inference throughput, using publicly available models as well as a few custom-quantized variants.

I also share practical tips on using NVFP4 models with vLLM, and why activation quantization proves to be critical for maintaining NVFP4’s performance edge.

The following notebook shows how to quantize models with NVFP4.

Get the notebook (#179)

I used an RTX 6000 Pro from RunPod (referral link) for all my experiments. An RTX 5080 or 5090 would also work.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture