Quantize and Run Llama 3.3 70B Instruct on Your GPU

4-bit👍, 3-bit👎, and 2-bit👎quantization

Dec 09, 2024

∙ Paid

Meta launched Llama 3 70B in April 2024, followed by a first major update in July 2024, introducing Llama 3.1 70B. Since the release of Llama 3.1, the 70B model remained unchanged. Qwen2.5 72B, and derivatives of Llama 3.1—like TULU 3 70B, which leveraged advanced post-training techniques—, among others, have significantly outperformed Llama 3.1 70B.

Llama 3.3 70B is a big step up from the earlier Llama 3.1 70B. The boost in performance comes from a better post-training process and probably newer training data. However, the model is very large, making it hard to run on a single GPU. Quantization can help shrink the model enough to work on one GPU, but it’s typically tricky to do without losing accuracy, especially for Llama 3 models which are notoriously difficult to quantize accurately.

In this article, we’ll take a quick look at what’s new in Llama 3.3. I’ll guide you through the process of quantizing the model to 4-bit precision while maintaining the same accuracy as the original model on benchmarks like MMLU. I’ll also share my experiments with 2-bit and 3-bit quantization. Finally, we’ll explore how much memory the quantized model saves and how to run it.

Finally, this article includes a notebook that implements my quantization recipe and shows how to evaluate and run the quantized model using vLLM.

Get the notebook (#128)

The quantized model is available here for free: 4-bit Llama 3.3 70B Instruct (llama license). You can support my work by subscribing to The Kaitchup:

If you are interested in fine-tuning Llama 3.3 70B with a single GPU, have a look at this article:

Fine-Tuning Llama 3.3 70B with a Single GPU

Benjamin Marie

December 12, 2024

Read full story

The Kaitchup – AI on a Budget

Quantize and Run Llama 3.3 70B Instruct on Your GPU

4-bit👍, 3-bit👎, and 2-bit👎quantization

Fine-Tuning Llama 3.3 70B with a Single GPU

Llama 3.3: What’s New?

This post is for paid subscribers