The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Make LLMs Faster and Lighter with W8A8 Quantization
Copy link
Facebook
Email
Notes
More

Make LLMs Faster and Lighter with W8A8 Quantization

Efficient Weight and Activation Quantization with llm-compressor

Benjamin Marie's avatar
Benjamin Marie
Apr 21, 2025
∙ Paid
8

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Make LLMs Faster and Lighter with W8A8 Quantization
Copy link
Facebook
Email
Notes
More
Share
Image generated with ChatGPT

Quantization is one of the most widely used techniques to reduce the size of large language models (LLMs). It works by lowering the precision of the model’s weights, typically from 16-bit down to 8-bit, 4-bit, or even lower. For example, a 70B model like Llama 3.3 requires around 141 GB in 16-bit precision, but this can be reduced to just 37 GB when quantized to 4-bit using ExLlamaV3.

Run Llama 3.3 70B on Your GPU with ExLlamaV3

Run Llama 3.3 70B on Your GPU with ExLlamaV3

Benjamin Marie
·
Apr 17
Read full story

A variety of quantization methods exist, each with its own strengths and trade-offs, which I explore in detail in Chapter 3 of my book LLMs on a Budget.

However, most quantization techniques focus solely on the model’s weights. During inference, many intermediate tensors, commonly referred to as activations, are generated. The size of these activations largely depends on the sequence length and the model’s hidden size, and they are typically left in full precision. As a result, in scenarios involving quantized weights and long input sequences, activation memory usage can actually surpass that of the model weights!

Quantizing activations is a more complex task due to their dynamic nature. Unlike weights, which remain constant during inference, activation values vary significantly depending on the input data.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and get instant access to 150+ AI notebooks, consider becoming a paid subscriber.

In this article, we’ll explore how to quantize activations effectively to further reduce memory consumption during inference. We’ll evaluate how activation quantization impacts memory usage, inference speed, and model accuracy. Our focus will be on llm-compressor and SmoothQuant.

You’ll also find a hands-on tutorial in the accompanying notebook, demonstrating how to apply activation quantization to LLMs in practice:

Get the notebook (#159)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More