The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
2-bit VPTQ: 6.5x Smaller LLMs, >95% Accuracy

2-bit VPTQ: 6.5x Smaller LLMs, >95% Accuracy

Very accurate 2-bit quantization for running 70B LLMs on a 24 GB GPU

Benjamin Marie's avatar
Benjamin Marie
Jan 27, 2025
∙ Paid
9

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
2-bit VPTQ: 6.5x Smaller LLMs, >95% Accuracy
5
1
Share
Generated with ChatGPT

Significant advancements in quantization for LLMs were made last year. Algorithms like AQLM and AutoRound have demonstrated that 4-bit quantization can maintain the accuracy of the original models across most tasks.

The Best Quantization Methods to Run Llama 3.1 on Your GPU

The Best Quantization Methods to Run Llama 3.1 on Your GPU

Benjamin Marie
·
August 12, 2024
Read full story

Recent developments in even lower-precision quantization, such as 2-bit and 3-bit quantization, are now showing acceptable levels of degradation, except for some models like Llama 3 models which are particularly challenging to quantize accurately.

That said, 2-bit quantization still introduces noticeable accuracy loss in most cases.

One promising algorithm for low-bit quantization is VPTQ, proposed by Microsoft. It was introduced in October 2024 and has since shown excellent performance and efficiency in quantizing large models.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, we will:

  1. Review the VPTQ quantization algorithm.

  2. Demonstrate how to use VPTQ models, many of which are already available. For instance, we can easily find low-bit variants of Llama 3.3 70B, Llama 3.1 405B, and Qwen2.5 72B.

  3. Evaluate these models and discuss the results to understand when VPTQ models can be a good choice for LLMs in production.

Remarkably, 2-bit quantization with VPTQ almost achieves performance comparable to the original 16-bit model on tasks such as MMLU. Moreover, it enables running Llama 3.1 405B on a single GPU, while using less memory than a 70B model!

The notebook below shows the steps to run a VPTQ model and details my evaluation process:

Get the notebook (#139)

Vector Post-Training Quantization

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share