MMLU-Pro Has an Answer Leak (and It’s Just Whitespace)
The Weekly Kaitchup #126
Hi everyone,
In this edition of The Weekly Kaitchup, we discuss:
Cheaper Inference at Scale: vLLM’s KV Cache Offloading Updates
The leading spaces in MMLU-Pro
FineTranslations and TranslateGemma
vLLM’s KV Cache Offloading Updates
A few months ago, vLLM 0.11.0 introduced a new KV offloading connector aimed at keeping inference throughput high by moving KV cache between GPU memory and a larger tier, especially CPU DRAM.
The motivation is twofold:
Reuse KV for shared prefixes to avoid expensive prefill recomputation
Avoid “preemption pain” when GPU KV space runs out, rather than discarding KV and later recomputing it, vLLM can offload KV to CPU RAM and reload it when the request resumes.
What’s new versus earlier KV cache offloading is mainly the connector architecture and execution model. The older Connector API was synchronous, loading/storing KV would block the engine, preventing parallel batch handling. The offloading connector is built on the newer asynchronous Connector API (added in vLLM 0.9.0), so KV transfers can overlap with ongoing model compute. It also formalizes offloading as a pluggable backend where you implement a transfer function between media
vLLM ships a native CPU backend for DRAM offloading (with simple CLI enablement via --kv_offloading_backend native --kv_offloading_size ...).
The other big “new” piece is that the vLLM team redesigned KV’s physical memory layout to make transfers efficient. Previously, a logical KV block was fragmented across layers (and sometimes split into separate K and V chunks), which is fine for attention compute but terrible for offloading because it forces many tiny copies (effective block sizes of just a few KB). In 0.12.0 they upstreamed a layout that packs a logical block’s KV for all layers contiguously, increasing physical block size by ~2 * num_layers (moving many models from KB-scale to ~0.5–2+ MB blocks).
Why this is good?
Their numbers are very convincing:
It turns KV reuse into a throughput maximizer while keeping latency benefits. Reported results show CPU-cache hits can cut TTFT by ~2x–22x (prompt-size dependent), but the bigger win is system throughput: with many concurrent requests, token/s can improve up to ~9x as CPU hit rate rises, because the GPU spends far less time recomputing prefill KV and can serve more parallel work. Since offloading itself is asynchronous, cache misses see minimal TTFT impact, and end-to-end tests show DMA-based copying generally yields better throughput than custom kernels because it interferes less with model computation, especially after the 0.12.0 layout change and the additional robustness/performance fixes targeted for 0.14.0.
I’m planning to benchmark all of this in an upcoming article.
Speaking of the KV cache: NVIDIA just published a new KV cache pruning method that looks promising, KVZap:
KVzap: Fast, Adaptive, and Faithful KV Cache Pruning
KVzap is a KV-cache pruning method that shrinks the stored keys/values along the time axis by predicting which past tokens will matter for future attention, then discarding the rest. Its starting point is an expensive “oracle” idea: you can score each token’s KV importance by making the model reconstruct the prompt (a copy/paste-style self-consistency test) and measuring how strongly later positions attend back to that token (with an improved, contribution-normalized score sometimes called KVzip+). KVzap avoids running that costly procedure at inference by training a tiny per-layer surrogate (either a linear layer or small 2-layer MLP) that takes the layer’s hidden states and directly predicts per-head importance scores (in log-space).
At inference, during prefill and decoding, it runs this lightweight scorer, keeps a sliding window of the most recent tokens (e.g., 128) to preserve local context, and then threshold-prunes: any KV pair whose predicted score is below a fixed threshold is dropped, yielding input-adaptive compression (dense/complex prompts keep more, repetitive prompts keep fewer) with minimal overhead.
This is not implemented in vLLM/SGLang, but I’m hopeful it will be.
Benchmarks are easy to game.
Case in point: leading spaces in MMLU-Pro.
MMLU-Pro is one of the most widely used LLM benchmarks. You’ll find MMLU-Pro scores in nearly every technical report for newly released models. But like most benchmarks, it isn’t flawless.
This week’s example comes from a post by Eric W. Tramel on X:
In some MMLU-Pro categories, the “correct answer” strings include a leading space.
I manually inspected the dataset on Hugging Face (TIGER-Lab/MMLU-Pro) and confirmed this. For example, here are items from the Math category:
A muscle fiber contracts by $3.5 \mathrm{~cm}$ and in doing so lifts a weight. Calculate the work performed by the fiber. Assume the muscle fiber obeys Hooke's law $F=-k x$ with a force constant $k$ of $750 . \mathrm{N} \mathrm{m}^{-1}$.
[
"0.25 $\\mathrm{~J}$",
" 0.46$\\mathrm{~J}$",
"0.85 J",
"0.52 J",
"1.00 J",
"0.30 J",
"0.75 $\\mathrm{~J}$",
"0.90 J",
"0.60 $\\mathrm{~J}$",
"0.65 J"
]
A system consisting of $82.5 \mathrm{~g}$ of liquid water at $300 . \mathrm{K}$ is heated using an immersion heater at a constant pressure of 1.00 bar. If a current of $1.75 \mathrm{~A}$ passes through the $25.0 \mathrm{ohm}$ resistor for 100 .s, what is the final temperature of the water?
[
" 322$\\mathrm{~K}$",
"335$\\mathrm{~K}$",
"340$\\mathrm{~K}$",
"310$\\mathrm{~K}$",
"345$\\mathrm{~K}$",
"330$\\mathrm{~K}$",
"300$\\mathrm{~K}$",
"325$\\mathrm{~K}$",
"315$\\mathrm{~K}$",
"350$\\mathrm{~K}$"
]The correct answer is indeed the second option for the first question, and the first option for the second.
I then asked ChatGPT whether anything in the formatting might be leaking the answer, and it spotted the issue immediately:
Yes — the second item has a pretty strong “giveaway,” and both items have some formatting/whitespace quirks that can accidentally leak the key.If it’s visible to humans, it’s visible to LLMs during evaluation, too.
FineTranslations and TranslateGemma
Hugging Face just released FineTranslations: 1T+ tokens of parallel text spanning 500+ languages, created by taking non-English documents from FineWeb2 and translating them into English with Gemma 3 27B.
So to be precise: this release is multilingual → English (X→EN) parallel data. There isn’t a separate “English→X” version of the dataset. If you want EN→X, you’d typically swap the pairs (use the English translation as source and the original non-English as target).
That said, the direction matters:
The source side (X) is mostly human-written web text.
The English side (EN) is LLM-generated.
That makes FineTranslations an excellent fit for:
training or adapting models for X→EN translation,
mining multilingual signals while keeping an English target,
and even English-only training on the translated side (depending on your setup).
About the “Edu” part
They also released an Edu-filtered variant (FineTranslations-Edu).
Importantly, this is not “English source text from FineWeb-Edu.”1
Instead, it’s FineTranslations with an extra education/quality filtering step applied on top of the translated English content.
Caveats I Want to Point Out
If your goal is EN→X, swapping the pairs means your English source is synthetic, while your target is human text. In practice, this can be a tougher training regime: models may need more careful filtering, tuning, or longer training to get stable gains.
If you’re generating your own synthetic parallel data specifically for EN→X, a common approach is:
start from high-quality human English (e.g., curated English corpora),
then translate into the target language with a model that’s excellent at generating that target language.
Related release: TranslateGemma
On the same topic, Google released TranslateGemma (4B / 12B / 27B): Gemma 3–based models tuned specifically for translation across 55 languages, and designed to work in both directions depending on the source/target language codes you provide.
The models: TranslateGemma
More details here: TranslateGemma: A new suite of open translation models
I’m building specialized calibration datasets to make good quantized versions of TranslateGemma. I’ll probably publish everything this weekend.
The Salt
The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.
This week, we review:
⭐Lost in the Noise: How Reasoning Models Fail with Contextual Distractors
DR-LoRA: Dynamic Rank LoRA for Mixture-of-Experts Adaptation
Token-Level LLM Collaboration via FusionRoute
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% discount for group subscriptions):
Have a nice weekend!
I’m writing this because I got confused for a moment. I really thought they had also translated FineWeb-Edu into different languages.






