In the previous article, we explored the process of accurately quantizing Llama 3.3 70B to enable it to run on a single GPU. The resulting 4-bit model achieves significant memory savings, occupying just 40 GB of GPU RAM compared to the original model's 141 GB. However, efforts to accurately quantize the model to even lower precision were unsuccessful.
Among the robust alternatives for low-precision quantization, Half-quadratic quantization (HQQ) stands out. Another technique, AQLM, is also accurate for low-bit quantization but is prohibitively expensive for models as large as 70 billion parameters.
While 2-bit HQQ is unlikely to cause catastrophic failures in the model, it is expected to result in significant performance degradation. To address this issue, one effective strategy is to recover the lost accuracy through fine-tuning using an adapter, similar to QLoRA-style fine-tuning.
In this article, we will see how to quantize Llama 3.3 to lower precisions using HQQ. Following this, we will explore fine-tuning the resulting quantized models. 2-bit quantization works fine, with fine-tuning being possible on a 32 GB GPU for short training sequences. 1-bit quantization was unsuccessful. We will evaluate the accuracy of the fine-tuned models and analyze the associated costs in terms of training time and memory consumption.
The following notebook implements the methods described in this article, showing how to fine-tune Llama 3.3 70B with a single GPU: