The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
The Recipe for Extremely Accurate and Cheap Quantization of 70B+ LLMs

The Recipe for Extremely Accurate and Cheap Quantization of 70B+ LLMs

Cost and accuracy for quantizing large models to 4-bit and 2-bit

Benjamin Marie's avatar
Benjamin Marie
Nov 25, 2024
∙ Paid
10

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
The Recipe for Extremely Accurate and Cheap Quantization of 70B+ LLMs
3
1
Share
Generated with Grok

Quantizing large language models (LLMs), such as Llama 3.1 70B and Qwen2.5 72B, can significantly reduce their size with minimal performance loss. Once quantized, these models can run on lower-end GPUs, drastically cutting inference costs.

However, the quantization process can be resource-intensive and expensive, especially if you must search for good quantization hyperparameters. Depending on the algorithm used, the process might take days (e.g., with AQLM) or only a few hours (e.g., GPTQ), and the GPU memory requirements can vary significantly. Without careful planning, quantization costs can easily exceed $100 for the largest models.

Run a 7.7x Smaller Mixtral-8x7B on Your GPU with AQLM 2-bit Quantization

Run a 7.7x Smaller Mixtral-8x7B on Your GPU with AQLM 2-bit Quantization

Benjamin Marie
·
February 22, 2024
Read full story

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I’ll demonstrate how to successfully quantize a 70B LLM (such as Llama 3.1 and Qwen2.5 72B) to both 4-bit and 2-bit precision for under $10. This recipe uses AutoRound, a state-of-the-art, open source, and fast quantization framework developed by Intel. AutoRound efficiently uses CPU RAM and requires minimal GPU memory, making it both cost-effective and accessible. The resulting models retain 99.4% and 88.3% of the original performance when quantized to 4-bit and 2-bit, respectively.

You can try and reproduce this quantization, or apply it to your own models, directly in this notebook:

Get the notebook (#124)

CPU and GPU Memory Requirements for Large Model Quantization

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share