The Recipe for Extremely Accurate and Cheap Quantization of 70B+ LLMs

Cost and accuracy for quantizing large models to 4-bit and 2-bit

Nov 25, 2024

∙ Paid

Quantizing large language models (LLMs), such as Llama 3.1 70B and Qwen2.5 72B, can significantly reduce their size with minimal performance loss. Once quantized, these models can run on lower-end GPUs, drastically cutting inference costs.

However, the quantization process can be resource-intensive and expensive, especially if you must search for good quantization hyperparameters. Depending on the algorithm used, the process might take days (e.g., with AQLM) or only a few hours (e.g., GPTQ), and the GPU memory requirements can vary significantly. Without careful planning, quantization costs can easily exceed $100 for the largest models.

Run a 7.7x Smaller Mixtral-8x7B on Your GPU with AQLM 2-bit Quantization

Benjamin Marie

February 22, 2024

Read full story

In this article, I’ll demonstrate how to successfully quantize a 70B LLM (such as Llama 3.1 and Qwen2.5 72B) to both 4-bit and 2-bit precision for under $10. This recipe uses AutoRound, a state-of-the-art, open source, and fast quantization framework developed by Intel. AutoRound efficiently uses CPU RAM and requires minimal GPU memory, making it both cost-effective and accessible. The resulting models retain 99.4% and 88.3% of the original performance when quantized to 4-bit and 2-bit, respectively.

You can try and reproduce this quantization, or apply it to your own models, directly in this notebook:

Get the notebook (#124)

The Kaitchup – AI on a Budget

The Recipe for Extremely Accurate and Cheap Quantization of 70B+ LLMs

Cost and accuracy for quantizing large models to 4-bit and 2-bit

Run a 7.7x Smaller Mixtral-8x7B on Your GPU with AQLM 2-bit Quantization

CPU and GPU Memory Requirements for Large Model Quantization

This post is for paid subscribers