Get the Best from GGUF Models: Optimize Your Inference Hyperparameters
The default hyperparameters are suboptimal for quantized models
When running LLMs locally, GGUF is by far the most popular format. It’s compact, easy to distribute, and works seamlessly with inference frameworks like llama.cpp
or user-friendly tools like ollama
.
But it’s important to realize that GGUF models are not the same as the original releases. Unless you’re using a full 16-bit version, what you’re running is a quantized version of the model. That means the weights have been compressed, usually down to 8-bit, 4-bit, or even 2-bit precision, to reduce size and often speed up inference.
This kind of quantization tends to preserve accuracy quite well, especially at 4-bit, and for English tasks with short to medium-sized inputs. But past research shows that in certain settings, like long sequences or multilingual inputs, the quality can drop noticeably.
There's also another problem that’s easy to overlook: the inference settings. When a model is released, the authors usually recommend a set of default values for temperature, top-p, and so on. These are tuned to get good results on standard benchmarks. But once the model is quantized, its internal probability distribution shifts. The original hyperparameters may no longer be optimal.
In this article, I show that this is especially true for very low-bit models, such as those quantized to 2-bit (e.g., Q2_K format). I found that quantized models tend to be much more sensitive to temperature and top-p. Even small changes can lead to noticeable drops in accuracy.
To explore this, I ran 300 different hyperparameter combinations across Qwen3 models, using several quantization methods such as AWQ, bitsandbytes 4-bit, and GGUF at both 2-bit and 4-bit levels. I used vLLM for inference and evaluated performance using IFEval’s accuracy.
The notebook below shows how the tests were run and how each model responded to different hyperparameter settings:
All the experiments have been conducted with RunPod (referral link).