The Kaitchup Index: A Leaderboard for Quantized LLMs
Comparing formats like GGUF, GPTQ, and AWQ, with different bitwidths
GGUF, AWQ, GPTQ, bitsandbytes, 16-bit, 4-bit, 3-bit, 2-bit... Which model is best for your setup?
New large language models (LLMs) are released regularly. Often, the community, or the model provider, publishes quantized versions in various formats and bitwidths to reduce their memory consumption. However, 99% of these variants are never properly evaluated, making it extremely difficult to know which one performs best on your specific hardware.
That’s why I created The Kaitchup Index, a leaderboard that benchmarks LLMs and their most popular quantized versions.
You can check it out here (via AirTable):
The benchmark behind The Kaitchup Index mainly evaluates factual accuracy, world knowledge, and instruction following. Since a high-quality LLM shouldn't only perform well in English, the benchmark is also heavily multilingual, though limited to high-resource languages such as French, German, Chinese, Korean, Arabic, Japanese, and others. Consequently, the rankings can be significantly different from standard English-only benchmarks.
The Kaitchup Index also introduces a “Quantization Fidelity” metric, designed to measure how closely a quantized model replicates the behavior of its original counterpart. Unlike traditional methods that rely on indicators such as logits or KL divergence, this metric evaluates fidelity based on the actual tokens and sequences generated across a wide range of inference hyperparameter settings. I plan to write a technical report about this. Stay tuned!
Note: I'm also working on a per-language leaderboard, which breaks down results by language. The evaluations are already complete. I'm just finalizing the formatting. Expect it to go live later next week.
I’ll regularly publish updates, with comments on performance, when new models are released.
The primary purpose of The Kaitchup Index is to compare the quantization variants of the same model. You can use the “filter” function of the table to help you with that.
Benchmark Characteristics
All benchmark sequences are under 4096 tokens.
The dataset used is private and unreleased.
The Kaitchup Index does not assess performance on:
Long-context tasks
Coding
Creative writing or open-ended generation
If a model has a reasoning mode, it is switched off to reduce inference cost. I may add results with reasoning “on” later.
Why the Dataset and Method Are Not Public
The data and scoring methodology used in The Kaitchup Index are intentionally kept private. This is to avoid a major issue with public benchmarks: there’s no guarantee that LLMs haven’t been exposed to them during training, or that providers didn’t use benchmark results to influence model development or checkpoint selection.
As a result, the Kaitchup Index is not reproducible, and I don’t recommend relying on it alone. Use it as one of several tools to guide your decisions, alongside your own evaluations and other public benchmarks where appropriate.
Currently, due to budget constraints, I can only evaluate models that fit on an H100 GPU. However, I plan to expand coverage regularly with new models and quantized variants.
How To Support The Kaitchup Index
For now, all results are public and freely accessible. However, given the high cost of running evaluations, this model may not be sustainable long-term, though I’ll do my best to keep it that way.
If you’d like to support the project, you can:
Subscribe to The Kaitchup
Or, if you're already a subscriber (or prefer a one-time gesture), you can simply “buy me a coffee”
☕ $5 = 5 H100 GPU hours, enough to evaluate one more small model.
With enough support, I’ll be able to expand the benchmark to cover larger models and more quantized variants.
“Please, Evaluate This Model!”
Evaluating every model and its quantized variants is simply too expensive and time-consuming, so I have to make choices. That said, I welcome suggestions!
There are just two requirements:
The model or quantization format must be supported by vLLM
It must fit on a single 96 GB GPU (ideally, even 80 GB)
📢 Note: I don’t take suggestions via private messages.
If you have a request, publish it publicly (in the comments below, or on LinkedIn, X, …), tag me, and I’ll respond there.
Hardware
RunPod
I built the Kaitchup Index using RunPod, primarily RTX 4090 and RTX 6000 Pro GPUs.
Larger or slower models are evaluated on the RTX 6000 Pro.
Hyperbolic
I also use an H100 spot instance from Hyperbolic.
Transparency
The Kaitchup Index is entirely self-funded. I don’t accept any payments or incentives from model providers to evaluate their models.
That said, many of the companies whose models appear on the Index do have employees who are paid subscribers to The Kaitchup.
Some of the evaluated models were quantized by me, and you’ll see those listed with the
kaitchup/
prefix in their model ID. I’ve considered removing them to avoid any perceived conflict of interest, but since they’re not particularly strong performers, I decided to leave them in.I only evaluate models with open weights that are already publicly available.
Please don’t send me your private checkpoints. I won’t evaluate them.
Awesome; this is Perfect! More than a year ago we started with your guidance for local hardware configs. Your work has become foundational to our knowledgebase deploying LLMs. Just browsed your index and in total agreement because the models we have selected for app dev as best performing for our requirements are all on your index!! Definitely agree that google/gemma-3-27b-it-qat-q4_0-gguf is a good one. And chose the unsloth quantized versions as credible for fine-tuning, so seeing on your list affirms our decision! Credibility behind the quantizing is extremely important, including protecting IP locally by not injecting security concerns i.e. making unexpected outbound or telemetry connections. a “Quantization Fidelity” metric will be valuable.