2-bit VPTQ: 6.5x Smaller LLMs, >95% Accuracy

Very accurate 2-bit quantization for running 70B LLMs on a 24 GB GPU

Jan 27, 2025

∙ Paid

Significant advancements in quantization for LLMs were made last year. Algorithms like AQLM and AutoRound have demonstrated that 4-bit quantization can maintain the accuracy of the original models across most tasks.

The Best Quantization Methods to Run Llama 3.1 on Your GPU

Benjamin Marie

August 12, 2024

Read full story

Recent developments in even lower-precision quantization, such as 2-bit and 3-bit quantization, are now showing acceptable levels of degradation, except for some models like Llama 3 models which are particularly challenging to quantize accurately.

That said, 2-bit quantization still introduces noticeable accuracy loss in most cases.

One promising algorithm for low-bit quantization is VPTQ, proposed by Microsoft. It was introduced in October 2024 and has since shown excellent performance and efficiency in quantizing large models.

In this article, we will:

Review the VPTQ quantization algorithm.
Demonstrate how to use VPTQ models, many of which are already available. For instance, we can easily find low-bit variants of Llama 3.3 70B, Llama 3.1 405B, and Qwen2.5 72B.
Evaluate these models and discuss the results to understand when VPTQ models can be a good choice for LLMs in production.

Remarkably, 2-bit quantization with VPTQ almost achieves performance comparable to the original 16-bit model on tasks such as MMLU. Moreover, it enables running Llama 3.1 405B on a single GPU, while using less memory than a 70B model!

The notebook below shows the steps to run a VPTQ model and details my evaluation process:

Get the notebook (#139)

The Kaitchup – AI on a Budget

2-bit VPTQ: 6.5x Smaller LLMs, >95% Accuracy

Very accurate 2-bit quantization for running 70B LLMs on a 24 GB GPU

The Best Quantization Methods to Run Llama 3.1 on Your GPU

Vector Post-Training Quantization

This post is for paid subscribers