The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Serving ExLlamaV3 Models with tabbyAPI: Accuracy, Speed, and Recommendations

With comparisons against AutoRound and GGUF models served with vLLM

Benjamin Marie's avatar
Benjamin Marie
Jan 19, 2026
∙ Paid

Quantizing LLMs has two main benefits: smaller models that fit on smaller GPUs, and faster inference, especially for single prompts and small batches, when using a uniform bit-width like 4-bit or 8-bit.

But uniform quantization isn’t ideal. Some layers matter more than others, and some modules (notably self-attention) are more sensitive than MLP blocks. That’s why some “natively quantized” releases (e.g., GPT-OSS) keep self-attention at higher precision.

Mixed-precision quantization tackles this by assigning different bit-widths across layers/modules. GGUF supports mixed precision, and Unsloth’s GGUF models use it to improve accuracy, but these approaches can be slower and/or harder to deploy without the right inference stack.

ExLlama is designed for this trade-off: given a target average bits-per-weight (bpw), it automatically allocates higher precision to sensitive parts and lower precision elsewhere to maximize accuracy at the target bpw. At low targets like ~2.5 bpw, ExLlamaV3 can significantly outperform uniform methods.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

ExLlamaV3 is also fast, though it’s typically served via tabbyAPI rather than mainstream stacks like vLLM or SGLang.

In this article, we’ll quantify accuracy and speed for ExLlamaV3 + tabbyAPI. We’ll quantize Qwen3 4B Instruct then compare against GGUF and AutoRound at (roughly) the same average bit-width served with vLLM.

ExLlamaV3 and tabbyAPI are very user-friendly. This tutorial can be applied to any LLMs supportedby ExLlamaV3.

Notebook (ExLlamaV3 quantization + tabbyAPI serving):

Get the notebook (#195)

Related article:

Run Llama 3.3 70B on Your GPU with ExLlamaV3

Run Llama 3.3 70B on Your GPU with ExLlamaV3

Benjamin Marie
·
April 17, 2025
Read full story

How to Quantize LLMs with ExLlamaV3?

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 The Kaitchup · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture