The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Fine-Tuning Llama 3.3 70B with a Single GPU

Fine-Tuning Llama 3.3 70B with a Single GPU

And how to fix a poorly accurate 2-bit model

Benjamin Marie's avatar
Benjamin Marie
Dec 12, 2024
∙ Paid
14

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Fine-Tuning Llama 3.3 70B with a Single GPU
1
Share
Generated with ChatGPT

In the previous article, we explored the process of accurately quantizing Llama 3.3 70B to enable it to run on a single GPU. The resulting 4-bit model achieves significant memory savings, occupying just 40 GB of GPU RAM compared to the original model's 141 GB. However, efforts to accurately quantize the model to even lower precision were unsuccessful.

Quantize and Run Llama 3.3 70B Instruct on Your GPU

Quantize and Run Llama 3.3 70B Instruct on Your GPU

Benjamin Marie
·
December 9, 2024
Read full story

Among the robust alternatives for low-precision quantization, Half-quadratic quantization (HQQ) stands out. Another technique, AQLM, is also accurate for low-bit quantization but is prohibitively expensive for models as large as 70 billion parameters.

While 2-bit HQQ is unlikely to cause catastrophic failures in the model, it is expected to result in significant performance degradation. To address this issue, one effective strategy is to recover the lost accuracy through fine-tuning using an adapter, similar to QLoRA-style fine-tuning.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, we will see how to quantize Llama 3.3 to lower precisions using HQQ. Following this, we will explore fine-tuning the resulting quantized models. 2-bit quantization works fine, with fine-tuning being possible on a 32 GB GPU for short training sequences. 1-bit quantization was unsuccessful. We will evaluate the accuracy of the fine-tuned models and analyze the associated costs in terms of training time and memory consumption.

The following notebook implements the methods described in this article, showing how to fine-tune Llama 3.3 70B with a single GPU:

Get the notebook (#129)

Fine-Tuning a Quantized Llama 3.3 70B

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share