Llama 4 with 10M Tokens: How Much Does It Cost and Is It Worth It?

A KV Cache Story

Apr 08, 2025

∙ Paid

Is local RAG dead?

Definitely not but that’s the question coming up after Meta released Llama 4, which supports a context window of up to 10 million tokens. That’s large enough to hold full databases.

Llama 4 Scout, Maverick, and Behemoth: MoE, VLMs, and Very Long Context

Benjamin Marie

Apr 5

Read full story

If Llama 4 can actually retrieve and use information across such a long context, then the main purpose of RAG, selecting and injecting relevant info into a smaller context, might no longer be necessary. Instead of retrieving chunks, we could just load everything into the context from the start.

In this article, we’ll break down what it actually takes to work with a 10M-token context. How much memory does it require? How many GPU nodes do you need? Yes, we need nodes of it!

We’ll also cover some practical techniques to reduce memory usage that are easy to apply. Along the way, we’ll look at KV cache memory estimates, compare theoretical vs. real-world requirements using vLLM, and examine the cost of running such large contexts on cloud platforms like RunPod. We’ll also briefly explore how Llama 4 enables extended context lengths, and whether this makes RAG obsolete, or still the smarter choice for many tasks.

I’ve put together a notebook that estimates KV cache memory usage for a given context size, for Llama 4 and other models that use Grouped Query Attention (GQA).

Get the notebook (#156)

The Kaitchup – AI on a Budget

Llama 4 with 10M Tokens: How Much Does It Cost and Is It Worth It?

A KV Cache Story

Llama 4 Scout, Maverick, and Behemoth: MoE, VLMs, and Very Long Context

Estimating the Maximum Sequence Length Your GPU Can Handle with Llama 4

This post is for paid subscribers