The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

KV Cache Quantization for Memory-Efficient Inference with LLMs

Process longer sequences with Llama 3

Benjamin Marie's avatar
Benjamin Marie
Jun 17, 2024
∙ Paid
5
5
Share
Generated with DALL-E

Quantization reduces the size of a large language model (LLM) by lowering the precision of its parameters, e.g., from 16-bit to 4-bit.

However, quantizing the model only reduces the memory consumption of the model’s parameters. Although loading an LLM on a smaller GPU is possible, the inference process still demands additional memory to store primarily:

  • The activations, i.e., tensors created during the forward pass

  • The KV cache

These two components can both consume much more memory than the model itself depending on the inference hyperparameters and the model’s architecture. Notably, the size of the KV cache rapidly grows with longer contexts and deeper models. This is especially an issue for applications processing long contexts like RAG systems.

RAG for Mistral 7B Instruct with LlamaIndex and Transformers

RAG for Mistral 7B Instruct with LlamaIndex and Transformers

Benjamin Marie
·
March 25, 2024
Read full story

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I explain what is the KV cache and how to quantize it for Llama 3. We will see that quantization to 4-bit significantly reduces the memory consumption for inference, especially when processing long context.

The notebook applying KV cache quantization to Llama 3 is here:

Get the notebook (#79)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture