The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Llama 4 with 10M Tokens: How Much Does It Cost and Is It Worth It?
Copy link
Facebook
Email
Notes
More

Llama 4 with 10M Tokens: How Much Does It Cost and Is It Worth It?

A KV Cache Story

Benjamin Marie's avatar
Benjamin Marie
Apr 08, 2025
∙ Paid
9

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Llama 4 with 10M Tokens: How Much Does It Cost and Is It Worth It?
Copy link
Facebook
Email
Notes
More
1
Share
Image generated with ChatGPT

Is local RAG dead?

Definitely not but that’s the question coming up after Meta released Llama 4, which supports a context window of up to 10 million tokens. That’s large enough to hold full databases.

Llama 4 Scout, Maverick, and Behemoth: MoE, VLMs, and Very Long Context

Llama 4 Scout, Maverick, and Behemoth: MoE, VLMs, and Very Long Context

Benjamin Marie
·
Apr 5
Read full story

If Llama 4 can actually retrieve and use information across such a long context, then the main purpose of RAG, selecting and injecting relevant info into a smaller context, might no longer be necessary. Instead of retrieving chunks, we could just load everything into the context from the start.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, we’ll break down what it actually takes to work with a 10M-token context. How much memory does it require? How many GPU nodes do you need? Yes, we need nodes of it!

We’ll also cover some practical techniques to reduce memory usage that are easy to apply. Along the way, we’ll look at KV cache memory estimates, compare theoretical vs. real-world requirements using vLLM, and examine the cost of running such large contexts on cloud platforms like RunPod. We’ll also briefly explore how Llama 4 enables extended context lengths, and whether this makes RAG obsolete, or still the smarter choice for many tasks.

I’ve put together a notebook that estimates KV cache memory usage for a given context size, for Llama 4 and other models that use Grouped Query Attention (GQA).

Get the notebook (#156)

Notebook #156

Estimating the Maximum Sequence Length Your GPU Can Handle with Llama 4

Keep reading with a 7-day free trial

Subscribe to The Kaitchup – AI on a Budget to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More