Get the Best from GGUF Models: Optimize Your Inference Hyperparameters

The default hyperparameters are suboptimal for quantized models

Jun 23, 2025

∙ Paid

When running LLMs locally, GGUF is by far the most popular format. It’s compact, easy to distribute, and works seamlessly with inference frameworks like llama.cpp or user-friendly tools like ollama.

GGUF Quantization with Imatrix and K-Quantization to Run LLMs on Your CPU

Benjamin Marie

September 9, 2024

Read full story

But it’s important to realize that GGUF models are not the same as the original releases. Unless you’re using a full 16-bit version, what you’re running is a quantized version of the model. That means the weights have been compressed, usually down to 8-bit, 4-bit, or even 2-bit precision, to reduce size and often speed up inference.

This kind of quantization tends to preserve accuracy quite well, especially at 4-bit, and for English tasks with short to medium-sized inputs. But past research shows that in certain settings, like long sequences or multilingual inputs, the quality can drop noticeably.

There's also another problem that’s easy to overlook: the inference settings. When a model is released, the authors usually recommend a set of default values for temperature, top-p, and so on. These are tuned to get good results on standard benchmarks. But once the model is quantized, its internal probability distribution shifts. The original hyperparameters may no longer be optimal.

In this article, I show that this is especially true for very low-bit models, such as those quantized to 2-bit (e.g., Q2_K format). I found that quantized models tend to be much more sensitive to temperature and top-p. Even small changes can lead to noticeable drops in accuracy.

To explore this, I ran 300 different hyperparameter combinations across Qwen3 models, using several quantization methods such as AWQ, bitsandbytes 4-bit, and GGUF at both 2-bit and 4-bit levels. I used vLLM for inference and evaluated performance using IFEval’s accuracy.

The notebook below shows how the tests were run and how each model responded to different hyperparameter settings:

Get the notebook (#173)

All the experiments have been conducted with RunPod (referral link).

The Kaitchup – AI on a Budget

Get the Best from GGUF Models: Optimize Your Inference Hyperparameters

The default hyperparameters are suboptimal for quantized models

GGUF Quantization with Imatrix and K-Quantization to Run LLMs on Your CPU

This post is for paid subscribers