2-bit VPTQ: 6.5x Smaller LLMs, >95% Accuracy
Very accurate 2-bit quantization for running 70B LLMs on a 24 GB GPU
Significant advancements in quantization for LLMs were made last year. Algorithms like AQLM and AutoRound have demonstrated that 4-bit quantization can maintain the accuracy of the original models across most tasks.
Recent developments in even lower-precision quantization, such as 2-bit and 3-bit quantization, are now showing acceptable levels of degradation, except for some models like Llama 3 models which are particularly challenging to quantize accurately.
That said, 2-bit quantization still introduces noticeable accuracy loss in most cases.
One promising algorithm for low-bit quantization is VPTQ, proposed by Microsoft. It was introduced in October 2024 and has since shown excellent performance and efficiency in quantizing large models.
In this article, we will:
Review the VPTQ quantization algorithm.
Demonstrate how to use VPTQ models, many of which are already available. For instance, we can easily find low-bit variants of Llama 3.3 70B, Llama 3.1 405B, and Qwen2.5 72B.
Evaluate these models and discuss the results to understand when VPTQ models can be a good choice for LLMs in production.
Remarkably, 2-bit quantization with VPTQ almost achieves performance comparable to the original 16-bit model on tasks such as MMLU. Moreover, it enables running Llama 3.1 405B on a single GPU, while using less memory than a 70B model!
The notebook below shows the steps to run a VPTQ model and details my evaluation process: