Stop Paying for FP16 KV Cache
Near-zero quality drop, big wins for long sequences and high concurrency
When a language model generates text, it has to keep “looking back” at everything it has already produced in order to decide what comes next. To avoid doing the same work over and over, it saves some intermediate results in a small running memory called the KV cache, usually stored on the GPU.
The catch is that this cache grows with every new token the model generates. The longer the response, the more GPU memory the cache consumes. As that memory fills up, you can handle fewer requests at the same time. And if you run out of space entirely, what happens next depends on your setup: the system may stop the generation early, fall back to a much slower mode, or simply crash.
This can become an issue surprisingly fast. Take an 8B model like Qwen3 8B (36 layers, GQA with 8 KV heads and head_dim=128). With the KV cache stored in bf16/fp16 (2 bytes per value), you’re saving about:
~144 KB of KV cache per generated token (K+V across all layers)
So ~10,000 tokens ≈ 1.5 GB of GPU memory per request, just for the KV cache
And 10 concurrent users ≈ 15 GB reserved for KV cache alone (before you even count model weights and other overhead)
There’s also a performance hit. Many inference workloads are limited by memory traffic, not raw compute. As the KV cache grows, the GPU has more and more data to read each step, and throughput can drop noticeably. In practice, the speed you get is often tied to the GPU’s memory bandwidth, so bigger cache → more data movement → slower generation.
The good news: the tooling here is getting much better, and modern inference frameworks can mitigate these problems in a few ways.
In this article, we’ll focus on compressing (quantizing) the KV cache and measuring the trade-off between speed/memory savings and model quality. This approach can improve throughput, increase the number of concurrent users you can serve, and reduce overall inference cost.


