Estimating Memory Usage for LLMs During Inference (V2)
KV cache, GQA, FlashAttention, activations, batching...
Efficient memory management is essential for the optimal deployment of large language models (LLMs), particularly when running models locally on hardware with limited resources. By understanding how LLMs consume memory, you can ensure smoother performance and better resource utilization.
This article shows how to estimate the memory consumption of LLMs under various conditions, including different batch sizes and sequence lengths. I’ll also explain how optimization techniques like GQA, FlashAttention, and KV caching save memory. To illustrate, I’ll use Llama 3.3 70B as an example, estimating its memory footprint during inference. Additionally, I’ll introduce a new dedicated notebook designed to automate this estimation process.
You can access this notebook here: