1-bit and 2-bit Llama 3: Quantization with HQQ and Fine-tuning with HQQ+
Replacing Llama 3's parameters with 0s and 1s, does it work?
1-bit quantization significantly reduces the size of large language models (LLMs) by replacing their weights with 0s and 1s. This quantization method is very aggressive compared to 4-bit quantization. Naive 1-bit quantization would turn any LLM into a gibberish generator.
Several approaches have been proposed to improve 1-bit quantization for LLMs. For instance, HQQ 1-bit quantization has been applied to Llama 2 7B. While it damages the model, it has been shown that fine-tuning an adapter on top of a 1-bit LLM quantized with HQQ can recover a significant part of the lost accuracy.
In this article, I explore 1-bit and 2-bit quantizations with HQQ for Llama 3 8B and 70B. We will see that while it makes Llama 3 8B barely usable, fine-tuning an adapter on top of the model improves the results. 1-bit quantization, even with Llama 3 70B, damages the model too much and makes it unable to generate language. On the other hand, 2-bit quantization with HQQ works reasonably well for Llama 3 8B.
The following notebook shows how to quantize Llama 3 to 1-bit and 2-bit with HQQ and fine-tune an adapter on top of it: