The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
A Comparison of 5 Quantization Methods for LLMs: GPTQ, AWQ, bitsandbytes, HQQ, and AutoRound

A Comparison of 5 Quantization Methods for LLMs: GPTQ, AWQ, bitsandbytes, HQQ, and AutoRound

8-bit, 4-bit, 3-bit, and 2-bit Qwen2.5 and Llama 3.x

Benjamin Marie's avatar
Benjamin Marie
Mar 10, 2025
∙ Paid
8

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
A Comparison of 5 Quantization Methods for LLMs: GPTQ, AWQ, bitsandbytes, HQQ, and AutoRound
Share

Large language models (LLMs) range in size from 100 million to over 100 billion parameters. Models exceeding 10 billion parameters typically require high-end professional GPUs for efficient inference.

To reduce inference cost, we can quantize the model. Quantization is a technique that compresses LLMs, allowing them to fit into smaller GPUs by reducing the precision of model parameters. This process typically involves converting 16-bit floating-point data types (FP16 or BF16) to low-bit representations, such as 8-bit or 4-bit. Over the past two years, significant advancements in quantization techniques have enabled precise low-precision conversions without degrading model performance on target tasks. As a result, quantization has become a compelling solution for reducing inference and fine-tuning costs, especially for larger models.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

For instance, with state-of-the-art quantization methods, we can quantize Qwen2.5 72B to 4-bit without any performance degradation in downstream tasks, reducing the model size from 140 GB to 40 GB.

However, selecting the best quantization method for a given model size, architecture, and data type remains challenging. Different methods have varying trade-offs:

  • Some are highly accurate but computationally expensive.

  • Others are faster and more cost-efficient but less stable.

  • Some have a good balance between speed, stability, and performance but may underperform in certain scenarios.

In Chapter 3 of The Kaitchup’s Book: LLMs on a Budget, I will explore these aspects in depth, covering popular quantization methods, including GPTQ, AWQ, GGUF, Bitsandbytes, HQQ, AutoRound, and VPTQ. This chapter will be published this month, and you can get the book here:

Get the Book

In this article, I present some of the results I compiled for this chapter. I compared quantization algorithms applied to Llama 3.x and Qwen2.5 models. Specifically, I evaluated GPTQ, AWQ, Bitsandbytes, HQQ, and AutoRound for 8-bit, 4-bit, 3-bit, and 2-bit quantization across the following instruction-tuned models:

  • Llama 3.2 3B

  • Llama 3.1 8B

  • Llama 3.3 70B

  • Qwen2.5 1.5B

  • Qwen2.5 7B

  • Qwen2.5 14B

  • Qwen2.5 32B

  • Qwen2.5 72B

All quantization methods discussed in this article can quantize a 70B model in under 12 hours using a single GPU.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share