KV Cache Quantization for Memory-Efficient Inference with LLMs
Process longer sequences with Llama 3
Quantization reduces the size of a large language model (LLM) by lowering the precision of its parameters, e.g., from 16-bit to 4-bit.
However, quantizing the model only reduces the memory consumption of the model’s parameters. Although loading an LLM on a smaller GPU is possible, the inference process still demands additional memory to store primarily:
The activations, i.e., tensors created during the forward pass
The KV cache
These two components can both consume much more memory than the model itself depending on the inference hyperparameters and the model’s architecture. Notably, the size of the KV cache rapidly grows with longer contexts and deeper models. This is especially an issue for applications processing long contexts like RAG systems.
In this article, I explain what is the KV cache and how to quantize it for Llama 3. We will see that quantization to 4-bit significantly reduces the memory consumption for inference, especially when processing long context.
The notebook applying KV cache quantization to Llama 3 is here: