Fine-Tuning Llama 3.3 70B with a Single GPU

And how to fix a poorly accurate 2-bit model

Dec 12, 2024

∙ Paid

In the previous article, we explored the process of accurately quantizing Llama 3.3 70B to enable it to run on a single GPU. The resulting 4-bit model achieves significant memory savings, occupying just 40 GB of GPU RAM compared to the original model's 141 GB. However, efforts to accurately quantize the model to even lower precision were unsuccessful.

Quantize and Run Llama 3.3 70B Instruct on Your GPU

Benjamin Marie

December 9, 2024

Read full story

Among the robust alternatives for low-precision quantization, Half-quadratic quantization (HQQ) stands out. Another technique, AQLM, is also accurate for low-bit quantization but is prohibitively expensive for models as large as 70 billion parameters.

While 2-bit HQQ is unlikely to cause catastrophic failures in the model, it is expected to result in significant performance degradation. To address this issue, one effective strategy is to recover the lost accuracy through fine-tuning using an adapter, similar to QLoRA-style fine-tuning.

In this article, we will see how to quantize Llama 3.3 to lower precisions using HQQ. Following this, we will explore fine-tuning the resulting quantized models. 2-bit quantization works fine, with fine-tuning being possible on a 32 GB GPU for short training sequences. 1-bit quantization was unsuccessful. We will evaluate the accuracy of the fine-tuned models and analyze the associated costs in terms of training time and memory consumption.

The following notebook implements the methods described in this article, showing how to fine-tune Llama 3.3 70B with a single GPU:

Get the notebook (#129)

The Kaitchup – AI on a Budget

Fine-Tuning Llama 3.3 70B with a Single GPU

And how to fix a poorly accurate 2-bit model

Quantize and Run Llama 3.3 70B Instruct on Your GPU

Fine-Tuning a Quantized Llama 3.3 70B

This post is for paid subscribers