Accurate 2-bit Quantization: Run Massive LLMs on a Single Consumer GPU
70B models for consumer hardware
2-bit quantization for LLMs can be surprisingly accurate when using state-of-the-art techniques. In a previous article, we saw how to quantize Qwen2.5 72B to 2-bit, reducing the model size to 23.8 GB while retaining 88% of its original accuracy, all with AutoRound and minimal tuning of quantization hyperparameters.
That said, some of these hyperparameters, especially at such low-bit precision, play a critical role in quantization success. In particular, group size can significantly affect model stability, accuracy, and final size.
In this article, we explore practical recipes for 2-bit quantization, with a focus on understanding and properly setting the group size. The goal is to help you anticipate its effect on model size and accuracy before running the quantization.
Our best 2-bit 72B model retains 96.5% of the original 16-bit accuracy for the IFEval benchmark while being 5.3x smaller.
The quantization recipes and evaluations discussed here are available in this notebook: