The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
1-bit and 2-bit Llama 3: Quantization with HQQ and Fine-tuning with HQQ+
Copy link
Facebook
Email
Notes
More

1-bit and 2-bit Llama 3: Quantization with HQQ and Fine-tuning with HQQ+

Replacing Llama 3's parameters with 0s and 1s, does it work?

Benjamin Marie's avatar
Benjamin Marie
May 30, 2024
∙ Paid
3

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
1-bit and 2-bit Llama 3: Quantization with HQQ and Fine-tuning with HQQ+
Copy link
Facebook
Email
Notes
More
4
1
Share
Two cartoon-style llamas, one carrying a large '0' and the other carrying a large '1' on their backs. Both llamas have cheerful expressions and are depicted in muted, natural colors. The '0' and '1' are oversized, with simple, bold designs, and they are securely strapped to the llamas' backs with plain straps. The background is a simple, sunny landscape with a blue sky and a few fluffy clouds. The overall image should be fun and whimsical, but with a more subdued color palette.
Generated with DALL-E

1-bit quantization significantly reduces the size of large language models (LLMs) by replacing their weights with 0s and 1s. This quantization method is very aggressive compared to 4-bit quantization. Naive 1-bit quantization would turn any LLM into a gibberish generator.

Fine-tune Llama 3 on Your Computer

Fine-tune Llama 3 on Your Computer

Benjamin Marie
·
April 22, 2024
Read full story

Several approaches have been proposed to improve 1-bit quantization for LLMs. For instance, HQQ 1-bit quantization has been applied to Llama 2 7B. While it damages the model, it has been shown that fine-tuning an adapter on top of a 1-bit LLM quantized with HQQ can recover a significant part of the lost accuracy.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I explore 1-bit and 2-bit quantizations with HQQ for Llama 3 8B and 70B. We will see that while it makes Llama 3 8B barely usable, fine-tuning an adapter on top of the model improves the results. 1-bit quantization, even with Llama 3 70B, damages the model too much and makes it unable to generate language. On the other hand, 2-bit quantization with HQQ works reasonably well for Llama 3 8B.

The following notebook shows how to quantize Llama 3 to 1-bit and 2-bit with HQQ and fine-tune an adapter on top of it:

Get the notebook (#74)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More