The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Quantize and Run Llama 3.3 70B Instruct on Your GPU
Copy link
Facebook
Email
Notes
More

Quantize and Run Llama 3.3 70B Instruct on Your GPU

4-bitšŸ‘, 3-bitšŸ‘Ž, and 2-bitšŸ‘Žquantization

Benjamin Marie's avatar
Benjamin Marie
Dec 09, 2024
āˆ™ Paid
11

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Quantize and Run Llama 3.3 70B Instruct on Your GPU
Copy link
Facebook
Email
Notes
More
1
1
Share
Generated with Grok

Meta launched Llama 3 70B in April 2024, followed by a first major update in July 2024, introducing Llama 3.1 70B. Since the release of Llama 3.1, the 70B model remained unchanged. Qwen2.5 72B, and derivatives of Llama 3.1—like TULU 3 70B, which leveraged advanced post-training techniques—, among others, have significantly outperformed Llama 3.1 70B.

Llama 3.3 70B is a big step up from the earlier Llama 3.1 70B. The boost in performance comes from a better post-training process and probably newer training data. However, the model is very large, making it hard to run on a single GPU. Quantization can help shrink the model enough to work on one GPU, but it’s typically tricky to do without losing accuracy, especially for Llama 3 models which are notoriously difficult to quantize accurately.

In this article, we’ll take a quick look at what’s new in Llama 3.3. I’ll guide you through the process of quantizing the model to 4-bit precision while maintaining the same accuracy as the original model on benchmarks like MMLU. I’ll also share my experiments with 2-bit and 3-bit quantization. Finally, we’ll explore how much memory the quantized model saves and how to run it.

Finally, this article includes a notebook that implements my quantization recipe and shows how to evaluate and run the quantized model using vLLM.

Get the notebook (#128)

The quantized model is available here for free: 4-bit Llama 3.3 70B Instruct (llama license). You can support my work by subscribing to The Kaitchup:

Subscribe

If you are interested in fine-tuning Llama 3.3 70B with a single GPU, have a look at this article:

Fine-Tuning Llama 3.3 70B with a Single GPU

Fine-Tuning Llama 3.3 70B with a Single GPU

Benjamin Marie
Ā·
December 12, 2024
Read full story

Llama 3.3: What’s New?

This post is for paid subscribers

Already a paid subscriber? Sign in
Ā© 2025 The Kaitchup
Privacy āˆ™ Terms āˆ™ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More