Intel AutoRound: Accurate Low-bit Quantization for LLMs

Between quantization-aware training and post-training quantization

Jun 27, 2024

∙ Paid

There are many quantization methods to reduce the size of large language models (LLM). Most of them are only good enough for 4-bit quantization. Quantization to 3-bit and 2-bit usually results in a significant accuracy drop, making LLMs unusable for most language generation tasks.

Recently, better low-bit quantization methods have been proposed. For instance, I reviewed and experimented with AQLM which achieves 2-bit quantization while preserving most of the model's accuracy.

Run a 7.7x Smaller Mixtral-8x7B on Your GPU with AQLM 2-bit Quantization

Benjamin Marie

February 22, 2024

Read full story

The main drawback of AQLM is that the quantization of large models takes many days. HQQ is another good alternative for low-bit quantization but requires further fine-tuning to preserve accuracy.

1-bit and 2-bit Llama 3: Quantization with HQQ and Fine-tuning with HQQ+

Benjamin Marie

May 30, 2024

Read full story

Intel is also very active in the research of better quantization algorithms. They propose AutoRound, a new quantization method adopting sign gradient descent (SignSD). AutoRound is especially accurate for low-bit quantization and quantizes faster than most other methods.

In this article, I review AutoRound. We will see how it works and how to quantize LLMs, such as Llama 3, with minimal accuracy drop. I found AutoRound to be a very good alternative to GPTQ and HQQ. It yields more accurate models.

I implemented the following notebook showing how to quantize LLMs with AutoRound, and evaluate/benchmark the resulting models:

Get the notebook (#82)

The Kaitchup – AI on a Budget

Intel AutoRound: Accurate Low-bit Quantization for LLMs

Between quantization-aware training and post-training quantization

Run a 7.7x Smaller Mixtral-8x7B on Your GPU with AQLM 2-bit Quantization

1-bit and 2-bit Llama 3: Quantization with HQQ and Fine-tuning with HQQ+

This post is for paid subscribers