Accurate 2-bit Quantization: Run Massive LLMs on a Single Consumer GPU

70B models for consumer hardware

May 05, 2025

∙ Paid

2-bit quantization for LLMs can be surprisingly accurate when using state-of-the-art techniques. In a previous article, we saw how to quantize Qwen2.5 72B to 2-bit, reducing the model size to 23.8 GB while retaining 88% of its original accuracy, all with AutoRound and minimal tuning of quantization hyperparameters.

The Recipe for Extremely Accurate and Cheap Quantization of 70B+ LLMs

Benjamin Marie

November 25, 2024

Read full story

That said, some of these hyperparameters, especially at such low-bit precision, play a critical role in quantization success. In particular, group size can significantly affect model stability, accuracy, and final size.

In this article, we explore practical recipes for 2-bit quantization, with a focus on understanding and properly setting the group size. The goal is to help you anticipate its effect on model size and accuracy before running the quantization.

Our best 2-bit 72B model retains 96.5% of the original 16-bit accuracy for the IFEval benchmark while being 5.3x smaller.

The quantization recipes and evaluations discussed here are available in this notebook:

Get the notebook (#162)

The Kaitchup – AI on a Budget

Accurate 2-bit Quantization: Run Massive LLMs on a Single Consumer GPU

70B models for consumer hardware

The Recipe for Extremely Accurate and Cheap Quantization of 70B+ LLMs

What is Quantization Group Size?

This post is for paid subscribers