Make LLMs Faster and Lighter with W8A8 Quantization
Efficient Weight and Activation Quantization with llm-compressor
Quantization is one of the most widely used techniques to reduce the size of large language models (LLMs). It works by lowering the precision of the model’s weights, typically from 16-bit down to 8-bit, 4-bit, or even lower. For example, a 70B model like Llama 3.3 requires around 141 GB in 16-bit precision, but this can be reduced to just 37 GB when quantized to 4-bit using ExLlamaV3.
A variety of quantization methods exist, each with its own strengths and trade-offs, which I explore in detail in Chapter 3 of my book LLMs on a Budget.
However, most quantization techniques focus solely on the model’s weights. During inference, many intermediate tensors, commonly referred to as activations, are generated. The size of these activations largely depends on the sequence length and the model’s hidden size, and they are typically left in full precision. As a result, in scenarios involving quantized weights and long input sequences, activation memory usage can actually surpass that of the model weights!
Quantizing activations is a more complex task due to their dynamic nature. Unlike weights, which remain constant during inference, activation values vary significantly depending on the input data.
In this article, we’ll explore how to quantize activations effectively to further reduce memory consumption during inference. We’ll evaluate how activation quantization impacts memory usage, inference speed, and model accuracy. Our focus will be on llm-compressor and SmoothQuant.
You’ll also find a hands-on tutorial in the accompanying notebook, demonstrating how to apply activation quantization to LLMs in practice: