The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Estimating Memory Usage for LLMs During Inference (V2)

Estimating Memory Usage for LLMs During Inference (V2)

KV cache, GQA, FlashAttention, activations, batching...

Benjamin Marie's avatar
Benjamin Marie
Jan 20, 2025
∙ Paid
8

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Estimating Memory Usage for LLMs During Inference (V2)
9
Share
Generated with ChatGPT

Efficient memory management is essential for the optimal deployment of large language models (LLMs), particularly when running models locally on hardware with limited resources. By understanding how LLMs consume memory, you can ensure smoother performance and better resource utilization.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

This article shows how to estimate the memory consumption of LLMs under various conditions, including different batch sizes and sequence lengths. I’ll also explain how optimization techniques like GQA, FlashAttention, and KV caching save memory. To illustrate, I’ll use Llama 3.3 70B as an example, estimating its memory footprint during inference. Additionally, I’ll introduce a new dedicated notebook designed to automate this estimation process.

You can access this notebook here:

Get the notebook (#137)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share