TurboQuant: Finally, Fast and Widely Available Low-Bit KV Cache Quantization?
The Weekly Kaitchup #135
Hi everyone,
In this edition of The Weekly Kaitchup, let’s discuss TurboQuant, the KV cache quantization method proposed by Google and New York University.
I’ve already discussed it here: There are many quantization methods that promise major reductions in KV-cache memory usage and, in some cases, faster inference.
Yet most inference engines still do not implement them, and instead rely on limited, fairly naive quantization schemes. In vLLM, the main branch only supports 8-bit KV-cache quantization (FP8 and INT8). llama.cpp supports several mostly block-based GGML KV-cache quantizers down to 4-bit, but they do not implement KV-specific techniques like explicit outlier-token tracking or context-sensitive token handling. In 2026, that feels underwhelming: the literature already contains well over 100 methods that can deliver better trade-offs in speed, memory efficiency, and accuracy.
The core problem is that integrating state-of-the-art quantization methods into highly optimized inference engines is a difficult engineering challenge. It has to be done without hurting inference speed, while still supporting a wide variety of LLM architectures, GPUs, and backends.
If we want better quantization to make its way into these engines, we need real momentum behind it. Typically, that means a major lab publishing a new simple method with clear benchmark results and clear practical use cases, followed by a few influential voices in the AI community amplifying it.
I think that is exactly what may be happening now with TurboQuant, a paper that was published almost a year ago, and recently promoted by Google with an easy-to-understand blog post.
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
To be presented at ICLR 2026 (Rio de Janeiro, next month).
TurboQuant is a training-free way to turn each key/value vector from a chunk of FP16 numbers into a much smaller code, while trying to preserve the geometry that attention cares about. Its core trick is to first apply a fixed random rotation that makes the vector’s coordinates look much more uniform and predictable, then quantize those rotated coordinates with a precomputed scalar codebook.
Google says it can push cache precision into the 2.5–3.5 bit range without badly hurting long-context behavior.
How TurboQuant works for KV-cache quantization
KV cache grows linearly with sequence length, so during decoding it becomes both a memory problem and a bandwidth problem. A useful and fast KV-cache quantizer therefore has to work online, without calibration or expensive per-vector optimization.
The most practical part of TurboQuant for this use case is TurboQuant_mse.
For each K or V vector, TurboQuant applies a fixed random orthogonal rotation, then quantizes each rotated coordinate to the nearest value in a small precomputed codebook. To reconstruct the vector, it looks up those codebook values and applies the inverse rotation. The rotation itself does not distort the vector, since it is orthogonal. It simply changes the basis so that basic scalar quantization works much better.
So the core algorithm is very simple: generate one rotation matrix, precompute the centroids, rotate the vector, store centroid indices, then reconstruct with a centroid lookup and inverse rotation.
Why does this help?
After rotation, high-dimensional unit vectors have coordinates that are much more uniform and close to independent. In practice, for typical attention head sizes, the coordinates are tightly concentrated around zero and behave roughly like a Gaussian. That means a single fixed scalar codebook can work well across many KV vectors, without calibration.
In effect, TurboQuant makes hard-to-quantize KV vectors look statistically regular enough that a cheap fixed quantizer works nearly optimally.
In a KV cache, this means each vector can be stored as a norm plus packed low-bit indices, rather than as a dense FP16 vector.
For the paper’s generation experiments, the authors also use an outlier-aware mixed-precision scheme across channels. Their 2.5-bit format is not a literal 2.5-bit scalar type. Instead, some outlier channels get more bits than others. One example in the paper is 32 outlier channels at 3 bits and 96 regular channels at 2 bits, which averages to 2.5 bits per channel for a head dimension of 128. They apply the same idea again for 3.5-bit quantization.
The full TurboQuant method adds a second stage called TurboQuant_prod, which is designed for inner products. After the MSE-optimized quantization step, it computes the residual error and stores a 1-bit QJL sketch (a sign-only random-projection summary of a vector) of that residual. The reason is that minimizing reconstruction error alone can still bias inner-product estimates, and attention scores depend on inner products. So TurboQuant first reconstructs the vector well, then uses one extra bit of residual information to reduce inner-product bias.
Note: Several early implementations are converging on the view that the drop-in cache format should usually start with the MSE stage alone, while the QJL residual is more natural when you also own the attention kernel and can consume the two-part representation directly. One independent Triton implementation reports that naively adding the QJL correction back into reconstructed cache vectors hurt quality badly, and that switching to the MSE variant fixed the issue and I’ve found that a llama.cpp discussion reaches a similar conclusion for early prototypes.
Evaluation: In Progress…
This is an “old” paper, and that shows in the evaluation.
The authors focus only on long-context benchmarks, specifically LongBench and Needle-in-a-Haystack.
Those benchmarks are useful for showing that KV-cache compression does not harm long-context retrieval or cause the model to forget information from the prompt. But they are relatively simple tasks, and they do not answer the more important question: does the model remain as capable overall?
A second limitation is that the evaluation only uses older models, namely Ministral and Llama 3.1 8B.
The third issue is that the baselines are also quite old. Most of the newer methods, for instance, the numerous ones presented at NeurIPS 2025, are absent from the comparisons.
So we don’t really know how good it is for newer models, think hybrid like Qwen3.5, and compared to other recent methods.
What we know for sure: it’s better than naive quantization and simple to implement. And that’s what matters the most to improve the state of KV cache quantization in the inference frameworks.
TurboQuant is especially attractive for serving because it is online and calibration-free. You do not need to train a per-model codebook for the cache or run a big offline fitting step before generation. That makes it fundamentally different from many quantization methods that work well offline but are awkward for live KV-cache writes during decoding.
Another implication is that memory savings alone are not enough. You need the right kernels. Google’s later write-up says TurboQuant reduced KV memory by at least 6x and, in an optimized setup, 4-bit TurboQuant reached up to 8x speedup for attention-logit computation on H100 versus unquantized 32-bit keys.
Community implementation and replication
The community is already moving, but it is still early and fragmented.
In the llama.cpp ecosystem, there is an open upstream feature request for TurboQuant support, plus a discussion thread with prototype implementations. It seems to work.
One standalone implementation reports matching the paper’s per-vector MSE numbers for 3- and 4-bit quantization and lays out both a safe dequantize-then-attend path and a future fused-attention path. Separately, the turboquant_plus project reports a llama.cpp/Metal integration on Apple Silicon, with turbo3 and turbo4 KV-cache types, end-to-end serving, and prefill throughput roughly at q8_0 parity while compressing the KV cache by about 4.6x. None of that is merged upstream yet, but it is a credible sign that the format is implementable in a real engine.
On the mainstream serving-engine side, I found open requests, not mature support, in vLLM. vLLM has an open feature request describing the KV-cache integration points. It looks very promising with regular updates. I’ll run a large-scale evaluation once it’s stable.
Finally, a good KV cache quantizer that will be widely available?
The Salt
The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.
This week, we reviewed:
⭐Efficient Exploration at Scale
Efficient Reasoning with Balanced Thinking
Efficient Training-Free Multi-Token Prediction via Embedding-Space Probing
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% discount for group subscriptions):
Have a nice weekend!





Thanks for the write-up to help make sense of all the hype. Quite funny that it took a blog post to boost the year old idea.
hi hi … I was waiting for this from you. Thank you