Is local RAG dead?
Definitely not but that’s the question coming up after Meta released Llama 4, which supports a context window of up to 10 million tokens. That’s large enough to hold full databases.
If Llama 4 can actually retrieve and use information across such a long context, then the main purpose of RAG, selecting and injecting relevant info into a smaller context, might no longer be necessary. Instead of retrieving chunks, we could just load everything into the context from the start.
In this article, we’ll break down what it actually takes to work with a 10M-token context. How much memory does it require? How many GPU nodes do you need? Yes, we need nodes of it!
We’ll also cover some practical techniques to reduce memory usage that are easy to apply. Along the way, we’ll look at KV cache memory estimates, compare theoretical vs. real-world requirements using vLLM, and examine the cost of running such large contexts on cloud platforms like RunPod. We’ll also briefly explore how Llama 4 enables extended context lengths, and whether this makes RAG obsolete, or still the smarter choice for many tasks.
I’ve put together a notebook that estimates KV cache memory usage for a given context size, for Llama 4 and other models that use Grouped Query Attention (GQA).
Estimating the Maximum Sequence Length Your GPU Can Handle with Llama 4
Keep reading with a 7-day free trial
Subscribe to The Kaitchup – AI on a Budget to keep reading this post and get 7 days of free access to the full post archives.