A Comparison of 5 Quantization Methods for LLMs: GPTQ, AWQ, bitsandbytes, HQQ, and AutoRound
8-bit, 4-bit, 3-bit, and 2-bit Qwen2.5 and Llama 3.x
Large language models (LLMs) range in size from 100 million to over 100 billion parameters. Models exceeding 10 billion parameters typically require high-end professional GPUs for efficient inference.
To reduce inference cost, we can quantize the model. Quantization is a technique that compresses LLMs, allowing them to fit into smaller GPUs by reducing the precision of model parameters. This process typically involves converting 16-bit floating-point data types (FP16 or BF16) to low-bit representations, such as 8-bit or 4-bit. Over the past two years, significant advancements in quantization techniques have enabled precise low-precision conversions without degrading model performance on target tasks. As a result, quantization has become a compelling solution for reducing inference and fine-tuning costs, especially for larger models.
For instance, with state-of-the-art quantization methods, we can quantize Qwen2.5 72B to 4-bit without any performance degradation in downstream tasks, reducing the model size from 140 GB to 40 GB.
However, selecting the best quantization method for a given model size, architecture, and data type remains challenging. Different methods have varying trade-offs:
Some are highly accurate but computationally expensive.
Others are faster and more cost-efficient but less stable.
Some have a good balance between speed, stability, and performance but may underperform in certain scenarios.
In Chapter 3 of The Kaitchup’s Book: LLMs on a Budget, I will explore these aspects in depth, covering popular quantization methods, including GPTQ, AWQ, GGUF, Bitsandbytes, HQQ, AutoRound, and VPTQ. This chapter will be published this month, and you can get the book here:
In this article, I present some of the results I compiled for this chapter. I compared quantization algorithms applied to Llama 3.x and Qwen2.5 models. Specifically, I evaluated GPTQ, AWQ, Bitsandbytes, HQQ, and AutoRound for 8-bit, 4-bit, 3-bit, and 2-bit quantization across the following instruction-tuned models:
Llama 3.2 3B
Llama 3.1 8B
Llama 3.3 70B
Qwen2.5 1.5B
Qwen2.5 7B
Qwen2.5 14B
Qwen2.5 32B
Qwen2.5 72B
All quantization methods discussed in this article can quantize a 70B model in under 12 hours using a single GPU.