1-bit and 2-bit Llama 3: Quantization with HQQ and Fine-tuning with HQQ+

Replacing Llama 3's parameters with 0s and 1s, does it work?

May 30, 2024

∙ Paid

Two cartoon-style llamas, one carrying a large '0' and the other carrying a large '1' on their backs. Both llamas have cheerful expressions and are depicted in muted, natural colors. The '0' and '1' are oversized, with simple, bold designs, and they are securely strapped to the llamas' backs with plain straps. The background is a simple, sunny landscape with a blue sky and a few fluffy clouds. The overall image should be fun and whimsical, but with a more subdued color palette. — Generated with DALL-E

1-bit quantization significantly reduces the size of large language models (LLMs) by replacing their weights with 0s and 1s. This quantization method is very aggressive compared to 4-bit quantization. Naive 1-bit quantization would turn any LLM into a gibberish generator.

Fine-tune Llama 3 on Your Computer

Benjamin Marie

April 22, 2024

Read full story

Several approaches have been proposed to improve 1-bit quantization for LLMs. For instance, HQQ 1-bit quantization has been applied to Llama 2 7B. While it damages the model, it has been shown that fine-tuning an adapter on top of a 1-bit LLM quantized with HQQ can recover a significant part of the lost accuracy.

In this article, I explore 1-bit and 2-bit quantizations with HQQ for Llama 3 8B and 70B. We will see that while it makes Llama 3 8B barely usable, fine-tuning an adapter on top of the model improves the results. 1-bit quantization, even with Llama 3 70B, damages the model too much and makes it unable to generate language. On the other hand, 2-bit quantization with HQQ works reasonably well for Llama 3 8B.

The following notebook shows how to quantize Llama 3 to 1-bit and 2-bit with HQQ and fine-tune an adapter on top of it:

Get the notebook (#74)

The Kaitchup – AI on a Budget

1-bit and 2-bit Llama 3: Quantization with HQQ and Fine-tuning with HQQ+

Replacing Llama 3's parameters with 0s and 1s, does it work?

Fine-tune Llama 3 on Your Computer

This post is for paid subscribers