KV Cache Quantization for Memory-Efficient Inference with LLMs

Process longer sequences with Llama 3

Jun 17, 2024

∙ Paid

Quantization reduces the size of a large language model (LLM) by lowering the precision of its parameters, e.g., from 16-bit to 4-bit.

However, quantizing the model only reduces the memory consumption of the model’s parameters. Although loading an LLM on a smaller GPU is possible, the inference process still demands additional memory to store primarily:

The activations, i.e., tensors created during the forward pass
The KV cache

These two components can both consume much more memory than the model itself depending on the inference hyperparameters and the model’s architecture. Notably, the size of the KV cache rapidly grows with longer contexts and deeper models. This is especially an issue for applications processing long contexts like RAG systems.

RAG for Mistral 7B Instruct with LlamaIndex and Transformers

Benjamin Marie

March 25, 2024

Read full story

In this article, I explain what is the KV cache and how to quantize it for Llama 3. We will see that quantization to 4-bit significantly reduces the memory consumption for inference, especially when processing long context.

The notebook applying KV cache quantization to Llama 3 is here:

Get the notebook (#79)

The Kaitchup – AI on a Budget

KV Cache Quantization for Memory-Efficient Inference with LLMs

Process longer sequences with Llama 3

RAG for Mistral 7B Instruct with LlamaIndex and Transformers

This post is for paid subscribers