Lessons from GGUF Evaluations: Ternary Qwen3.5, Bricked Minimax
The Weekly Kaitchup #132
Hi everyone,
In this edition of The Weekly Kaitchup:
Ternary Qwen3.5 and Minimax M2.5 GGUF evaluations
My first quantized checkpoints for Qwen3.5 (only the 27B variant for now).
A few words about the new LFM2 24B A2B by LiquidAI.
After nearly two weeks of painfully slow evaluations, I’ve collected results for about a dozen GGUF checkpoints, across multiple precisions and quantization formats, for two models: Qwen3.5 397B-A17B and Minimax M2.5.
This was slow for a simple reason: a proper evaluation means prompting each GGUF model with many questions from real benchmarks. Running those prompts strictly sequentially (one request at a time) takes forever. But pushing heavy concurrency isn’t a great option either: GGUF models (with llama.cpp, used as the backend) aren’t designed as high-throughput inference engines in the same way that vLLM or SGLang are.
Because running full suites was impractical, I evaluated subsets of the following benchmarks:
MMLU-Pro: 500 questions
GPQA Diamond: 50 questions
LiveCodeBench v6: 100 problems
Math-500: 100 questions
Using H200s (via Hyperbolic, at $2.59/hour), each GGUF evaluation took ~10–20 hours.
Painful, but worth it.
Results: Qwen3.5 397B-A17B
Note: I only tested Unsloth’s GGUF variants, because those were the only ones available when I ran these evaluations.
The results were so surprising that I reran them with different configuration hyperparameters to double-check. Same conclusion: no mistake.
Ternary weights (TQ1_0), where parameters (excluding some layers) become {-1, 0, +1}, still track the original model closely.
The benchmark error increased by only ~18.4%, while memory dropped from ~800 GB to ~94 GB.
At 2-bit (e.g., UD-IQ2_M, ~137 GB), the performance difference compared to the original model is nearly not visible (within the benchmarks’ margin of error).
For Qwen3.5 397B-A17B, these GGUF quantizations look remarkably safe: you’re still getting something very close to the original model.
Results: Minimax M2.5
I ran the same evaluation protocol on Minimax M2.5, this time including GGUFs produced by multiple groups (not just Unsloth).
Here, the picture flips completely.
The degradation is striking, even Q4 variants can land far from the original model’s performance. In practice, the quantized model feels severely impaired.
What makes this particularly important is that the quantization approach is broadly the same as what produced excellent results on Qwen3.5 above.
A note on comparability
This evaluation is designed to measure post-quantization degradation, so I used temperature = 0 to keep behavior roughly deterministic. It can be tempting to directly compare absolute numbers across models (Qwen3.5 397B-A17B vs Minimax M2.5), but I wouldn’t recommend it. Different models should be compared on larger suites and using provider-recommended decoding settings and hyperparameters.
What’s going on?
This isn’t the first time I’ve seen a model collapse under otherwise “good” quantization. A well-known example is Llama 3.1 8B, which degrades noticeably at 4-bit, whereas Qwen2 (released around the same period) remains much more robust.
The empirical lesson:
Not all models are equally robust to quantization.
Some tolerate aggressive low-bit formats surprisingly well while others fall apart.
That also means we can’t rely on a single rule of thumb like:
“Just use Q4_K_M, it’ll be fine.”
You might end up with a model that performs dramatically worse than the original, and you may not notice unless you also test the full-precision original model, which is often expensive and time-consuming.
The bigger problem: we’re mostly blind
We lack systematic, difficult-benchmark evaluation for GGUF models.
Metrics like perplexity or KL divergence aren’t sufficient. A tiny change in perplexity can look negligible and still translate into a large difference in real task performance. For quantization, what matters is what happens when the model actually generates tokens in response to hard prompts, and whether the outputs still hold up.
Doing some research on LLM quantization?
If you work on quantization and submit papers to top-tier conferences, use real benchmarks, generating tokens (and NOT exploiting only logits), to evaluate quantization quality. Papers using perplexity are increasingly rejected, and that’s good. I saw way too many quantization methods preserving perplexity but producing completely useless models.
So what do we do?
This is a hard problem:
GGUF evaluation is expensive because inference is slow.
There’s a race to be first to publish GGUF conversions, and those releases get downloaded heavily. Downstream, they are then heavily promoted by all the tools using GGUF models.
That creates weak incentives to run careful, time-consuming, expensive benchmark evaluations before publishing.
On my side, I’ll keep evaluating more models, as many as I can. My next targets are the Qwen3.5 Medium models released this week.
Quantized Qwen3.5 27B: NVFP4, MXFP4, and INT4
This wasn’t straightforward: Qwen3.5 is still poorly supported across most quantization and inference stacks.
AutoRound can quantize Qwen3.5 MoE, but the output can’t be loaded in vLLM yet.
LLM Compressor isn’t compatible with Transformers v5, which is required to load Qwen3.5.
vLLM stable doesn’t support Qwen3.5 either.
Codex quickly helped me work through the tooling gaps and fix most frameworks. After some trial and error, I was able to produce:
an INT4 version of Qwen3.5 27B with AutoRound, and
NVFP4/MXFP4 versions with LLM Compressor.
These quantized models are available here, and they’re all vLLM-compatible (tested with the nightly version on Verda’s H200s and Google Colab’s A100):
Acknowledgments
Thank you Verda for providing the needed compute. I used their H200s for this. Verda is a European, AI-focused cloud and GPU infrastructure provider with sovereignty, sustainability, data privacy, and performance at its core. Check them out if interested.
AutoRound should improve MoE support soon (a PR is opened), it can already quantize the MoE models, but since vLLM can’t load the result yet, they’re not very useful in practice.
Next week (assuming everything goes smoothly), I’ll publish a full write-up covering how to quantize Qwen3.5, along with benchmark results comparing the quantized variants to the original models.
LFM2 24B A2B: The Largest LFM2
Liquid AI released LFM2 24B A2B just a few hours before Qwen3.5 arrived, unfortunate timing...
It’s their largest LFM2 release to date, and training is still ongoing (seen 17T training tokens so far). Liquid has said they plan to ship an LFM2.5 update within the next few months.
LFM2 (and 2.5) models are built around efficient gated short-convolution blocks plus a smaller number of grouped-query attention blocks, by increasing depth and expert count while keeping the per-token compute path lean. It has 40 layers and 64 experts per MoE block with top-4 routing, with roughly 2.3 billion active parameters per forward pass, and is shipped as an instruct model without reasoning traces via lightweight post-training.

The model is very fast, but it’s unclear how well it performs, as they only published comparisons against other older LFM2 checkpoints. But if you liked LFM2 8B A1B and have the hardware to run a larger variant, you should definitely try this one, which seems much smarter.
reference: LFM2-24B-A2B: Scaling Up the LFM2 Architecture
The Salt
The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.
I published a review of GLM-5:
This week, we review:
⭐The Art of Efficient Reasoning: Data, Reward, and Optimization
Does Your Reasoning Model Implicitly Know When to Stop Thinking?
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% discount for group subscriptions):
Have a nice weekend!







I got excited, thinking you had released a quantized version of Qwen 3.5 357B that would run on VLLM. But alas, I was mistaken.
Quick follow-up question...
If Qwen3.5-397B quantizes beautifully with UD-IQ2_M, did you try Qwen3.5-122B-A10B or any of the mid-sized Qwens (35B-A3B or 27B)? Question being...does the quality largely remain intact, because if so, that could be an amazing opportunity. 2bit quantization could offset bandwidth limitations on DGX Spark, or memory size constraints on 5090)