The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Get the Best from GGUF Models: Optimize Your Inference Hyperparameters

Get the Best from GGUF Models: Optimize Your Inference Hyperparameters

The default hyperparameters are suboptimal for quantized models

Benjamin Marie's avatar
Benjamin Marie
Jun 23, 2025
∙ Paid
6

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Get the Best from GGUF Models: Optimize Your Inference Hyperparameters
Share

When running LLMs locally, GGUF is by far the most popular format. It’s compact, easy to distribute, and works seamlessly with inference frameworks like llama.cpp or user-friendly tools like ollama.

GGUF Quantization with Imatrix and K-Quantization to Run LLMs on Your CPU

GGUF Quantization with Imatrix and K-Quantization to Run LLMs on Your CPU

Benjamin Marie
·
September 9, 2024
Read full story

But it’s important to realize that GGUF models are not the same as the original releases. Unless you’re using a full 16-bit version, what you’re running is a quantized version of the model. That means the weights have been compressed, usually down to 8-bit, 4-bit, or even 2-bit precision, to reduce size and often speed up inference.

This kind of quantization tends to preserve accuracy quite well, especially at 4-bit, and for English tasks with short to medium-sized inputs. But past research shows that in certain settings, like long sequences or multilingual inputs, the quality can drop noticeably.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

There's also another problem that’s easy to overlook: the inference settings. When a model is released, the authors usually recommend a set of default values for temperature, top-p, and so on. These are tuned to get good results on standard benchmarks. But once the model is quantized, its internal probability distribution shifts. The original hyperparameters may no longer be optimal.

In this article, I show that this is especially true for very low-bit models, such as those quantized to 2-bit (e.g., Q2_K format). I found that quantized models tend to be much more sensitive to temperature and top-p. Even small changes can lead to noticeable drops in accuracy.

To explore this, I ran 300 different hyperparameter combinations across Qwen3 models, using several quantization methods such as AWQ, bitsandbytes 4-bit, and GGUF at both 2-bit and 4-bit levels. I used vLLM for inference and evaluated performance using IFEval’s accuracy.

The notebook below shows how the tests were run and how each model responded to different hyperparameter settings:

Get the notebook (#172)

All the experiments have been conducted with RunPod (referral link).

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share