Quantize and Run Llama 3.3 70B Instruct on Your GPU
4-bitš, 3-bitš, and 2-bitšquantization
Meta launched Llama 3 70B in April 2024, followed by a first major update in July 2024, introducing Llama 3.1 70B. Since the release of Llama 3.1, the 70B model remained unchanged. Qwen2.5 72B, and derivatives of Llama 3.1ālike TULU 3 70B, which leveraged advanced post-training techniquesā, among others, have significantly outperformed Llama 3.1 70B.
Llama 3.3 70B is a big step up from the earlier Llama 3.1 70B. The boost in performance comes from a better post-training process and probably newer training data. However, the model is very large, making it hard to run on a single GPU. Quantization can help shrink the model enough to work on one GPU, but itās typically tricky to do without losing accuracy, especially for Llama 3 models which are notoriously difficult to quantize accurately.
In this article, weāll take a quick look at whatās new in Llama 3.3. Iāll guide you through the process of quantizing the model to 4-bit precision while maintaining the same accuracy as the original model on benchmarks like MMLU. Iāll also share my experiments with 2-bit and 3-bit quantization. Finally, weāll explore how much memory the quantized model saves and how to run it.
Finally, this article includes a notebook that implements my quantization recipe and shows how to evaluate and run the quantized model using vLLM.
The quantized model is available here for free: 4-bit Llama 3.3 70B Instruct (llama license). You can support my work by subscribing to The Kaitchup:
If you are interested in fine-tuning Llama 3.3 70B with a single GPU, have a look at this article: