The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Stop Paying for FP16 KV Cache

Near-zero quality drop, big wins for long sequences and high concurrency

Benjamin Marie's avatar
Benjamin Marie
Jan 26, 2026
∙ Paid

When a language model generates text, it has to keep “looking back” at everything it has already produced in order to decide what comes next. To avoid doing the same work over and over, it saves some intermediate results in a small running memory called the KV cache, usually stored on the GPU.

The catch is that this cache grows with every new token the model generates. The longer the response, the more GPU memory the cache consumes. As that memory fills up, you can handle fewer requests at the same time. And if you run out of space entirely, what happens next depends on your setup: the system may stop the generation early, fall back to a much slower mode, or simply crash.

This can become an issue surprisingly fast. Take an 8B model like Qwen3 8B (36 layers, GQA with 8 KV heads and head_dim=128). With the KV cache stored in bf16/fp16 (2 bytes per value), you’re saving about:

  • ~144 KB of KV cache per generated token (K+V across all layers)

  • So ~10,000 tokens ≈ 1.5 GB of GPU memory per request, just for the KV cache

  • And 10 concurrent users ≈ 15 GB reserved for KV cache alone (before you even count model weights and other overhead)

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

There’s also a performance hit. Many inference workloads are limited by memory traffic, not raw compute. As the KV cache grows, the GPU has more and more data to read each step, and throughput can drop noticeably. In practice, the speed you get is often tied to the GPU’s memory bandwidth, so bigger cache → more data movement → slower generation.

The good news: the tooling here is getting much better, and modern inference frameworks can mitigate these problems in a few ways.

In this article, we’ll focus on compressing (quantizing) the KV cache and measuring the trade-off between speed/memory savings and model quality. This approach can improve throughput, increase the number of concurrent users you can serve, and reduce overall inference cost.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 The Kaitchup · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture