Serving ExLlamaV3 Models with tabbyAPI: Accuracy, Speed, and Recommendations
With comparisons against AutoRound and GGUF models served with vLLM
Quantizing LLMs has two main benefits: smaller models that fit on smaller GPUs, and faster inference, especially for single prompts and small batches, when using a uniform bit-width like 4-bit or 8-bit.
But uniform quantization isn’t ideal. Some layers matter more than others, and some modules (notably self-attention) are more sensitive than MLP blocks. That’s why some “natively quantized” releases (e.g., GPT-OSS) keep self-attention at higher precision.
Mixed-precision quantization tackles this by assigning different bit-widths across layers/modules. GGUF supports mixed precision, and Unsloth’s GGUF models use it to improve accuracy, but these approaches can be slower and/or harder to deploy without the right inference stack.
ExLlama is designed for this trade-off: given a target average bits-per-weight (bpw), it automatically allocates higher precision to sensitive parts and lower precision elsewhere to maximize accuracy at the target bpw. At low targets like ~2.5 bpw, ExLlamaV3 can significantly outperform uniform methods.
ExLlamaV3 is also fast, though it’s typically served via tabbyAPI rather than mainstream stacks like vLLM or SGLang.
In this article, we’ll quantify accuracy and speed for ExLlamaV3 + tabbyAPI. We’ll quantize Qwen3 4B Instruct then compare against GGUF and AutoRound at (roughly) the same average bit-width served with vLLM.
ExLlamaV3 and tabbyAPI are very user-friendly. This tutorial can be applied to any LLMs supportedby ExLlamaV3.
Notebook (ExLlamaV3 quantization + tabbyAPI serving):
Related article:


