Hi Everyone,
I’ve launched the Summer Sale for The Kaitchup: get 30% off forever on the yearly subscription.
The same deal applies to my other newsletter, The Salt, which focuses more on scientific papers and research insights.
Both offers are valid until June 30.
In this edition of The Weekly Kaitchup:
Survey: Which Quantization Format Do You Use?
Mistral Small Updated for Better Instruction-Following
Gemma 3n, SGLang with Transformers, Tower+, …
Survey: Which Quantization Format Do You Use?
Last week, I ran a survey asking which quantization formats people are currently using.
As expected, GGUF came out on top. It’s clearly the dominant format, even among experienced LLM users, who make up most of The Kaitchup’s audience. What surprised me more was how close AWQ came to it. I suspect this is partly due to how easy it is to obtain and use. Models quantized with AWQ are widely available, and the main quantization method associated with it is quite robust, largely thanks to its activation-aware design. Even though AutoAWQ is now deprecated, tools like LLM Compressor continue to make it very easy to generate high-quality AWQ models without much tuning or risk of failure.
This stands in stark contrast to GPTQ. Most of the GPTQ models on Hugging Face were produced using older toolchains and default settings, and a lot of them don’t hold up under evaluation. While GPTQ is technically more flexible, supporting more group sizes and quantization widths beyond 4-bit, the quality of community-made GPTQ models is inconsistent. They often require more careful inspection and benchmarking before being usable.
It was also interesting to see bitsandbytes quantization rank higher in this survey than formats like GPTQ and FP8. Outside of QLoRA fine-tuning, I don’t see a compelling reason to use it as it runs slower than AWQ and usually delivers slightly worse accuracy. So its popularity probably reflects its integration into the QLoRA training stack more than inference use.
The point of this survey was to help me decide where to focus for my upcoming large-scale evaluation of quantized models. I’ll be publishing a leaderboard in July, based on a new benchmark set that no existing model could have been tuned against. Expect to see mostly Llama 3, Qwen3, Gemma 3, and Mistral variants in the first round.
For compute, I’ve been testing Hyperbolic’s spot H100s, which are priced at $0.99/hour. That’s less than half the cost of RunPod’s community H100 instances. I was skeptical at first, spot pricing usually comes with interruptions, but so far, I’ve been able to run uninterrupted for several days straight. Honestly, at that price, it’s hard not to recommend. If you want to try it out and support my work, you can use my referral link (you’ll also get $6 in credits, which covers about 6 hours of H100 time).
Mistral Small Updated for Better Instruction-Following
The first version of Mistral Small 3 was released back in February and has since gone through several updates, first within the 3.1 series, and now to the current 3.2 release.
With this 3.2, Mistral AI improved the model as follows:
Mistral Small is designed specifically for efficient inference. It follows a wide-and-shallow architecture, meaning it uses a larger hidden size than most other models in the 24B parameter range, but with significantly fewer layers. Despite this unconventional structure, the model performs well across a broad range of benchmarks.
I published a detailed article reviewing the initial release, along with a guide on how to fine-tune the base model. You can read it here:
Mistral Small 3.2 is released as an instruction-tuned model. Because of that, I don’t recommend fine-tuning it directly; doing so could interfere with the careful post-training alignment and optimization already applied by Mistral AI.
More News This Week
Google has released a safetensors version of Gemma 3n, whose “preview” version has been trending on Hugging Face for several months, even before it reached broad compatibility with major inference frameworks. I’ll be publishing a full review of the model next week, along with a guide on how to fine-tune the base version.
SGLang is now fully compatible with all models supported by the Hugging Face Transformers library. This mirrors a recent update in vLLM, which added support for Transformers as a backend, enabling broader model interoperability with minimal configuration changes.
LLM Compressor is approaching a more user-friendly implementation of FP4 quantization, with support already compatible with vLLM. Looking ahead, FP4 is likely to gain significant traction by the end of the year and could become the default quantization format by 2026. On Blackwell GPUs, FP4 models will likely be the standard, given the hardware-level optimizations for that format.
If you're looking for a high-quality translation model to run locally, Tower+ is currently the state-of-the-art. It's released under a CC-BY-NC-SA license, so it's not suitable for commercial products, but for personal or internal use, it performs exceptionally well. It delivers strong translation quality while keeping your data private and fully on-device.
The Salt
The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.
I reviewed in The Weekly Salt:
⭐ RLPR: Extrapolating RLVR to General Domains without Verifiers
All is Not Lost: LLM Recovery without Checkpoints
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% discount for group subscriptions):
Have a nice weekend!