The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Intel AutoRound: Accurate Low-bit Quantization for LLMs

Intel AutoRound: Accurate Low-bit Quantization for LLMs

Between quantization-aware training and post-training quantization

Benjamin Marie's avatar
Benjamin Marie
Jun 27, 2024
∙ Paid
6

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Intel AutoRound: Accurate Low-bit Quantization for LLMs
3
Share
AutoRound with SignSGD — source (CC-BY)

There are many quantization methods to reduce the size of large language models (LLM). Most of them are only good enough for 4-bit quantization. Quantization to 3-bit and 2-bit usually results in a significant accuracy drop, making LLMs unusable for most language generation tasks.

Recently, better low-bit quantization methods have been proposed. For instance, I reviewed and experimented with AQLM which achieves 2-bit quantization while preserving most of the model's accuracy.

Run a 7.7x Smaller Mixtral-8x7B on Your GPU with AQLM 2-bit Quantization

Run a 7.7x Smaller Mixtral-8x7B on Your GPU with AQLM 2-bit Quantization

Benjamin Marie
·
February 22, 2024
Read full story

The main drawback of AQLM is that the quantization of large models takes many days. HQQ is another good alternative for low-bit quantization but requires further fine-tuning to preserve accuracy.

1-bit and 2-bit Llama 3: Quantization with HQQ and Fine-tuning with HQQ+

1-bit and 2-bit Llama 3: Quantization with HQQ and Fine-tuning with HQQ+

Benjamin Marie
·
May 30, 2024
Read full story

Intel is also very active in the research of better quantization algorithms. They propose AutoRound, a new quantization method adopting sign gradient descent (SignSD). AutoRound is especially accurate for low-bit quantization and quantizes faster than most other methods.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I review AutoRound. We will see how it works and how to quantize LLMs, such as Llama 3, with minimal accuracy drop. I found AutoRound to be a very good alternative to GPTQ and HQQ. It yields more accurate models.

I implemented the following notebook showing how to quantize LLMs with AutoRound, and evaluate/benchmark the resulting models:

Get the notebook (#82)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share