Hi Everyone,
I’ve launched the Summer Sale for The Kaitchup: get 30% off forever on the yearly subscription.
The same deal applies to my other newsletter, The Salt, which focuses more on scientific papers and research insights.
Both offers are valid until June 30.
It’s been a quiet week, and summer is approaching. I’d bet the major labs are putting the final touches on big releases before their researchers head off for the summer break.
In this edition of The Weekly Kaitchup:
Survey: Which Quantization Format Do You Use?
Axolotl + LLM Compressor: Fine-Tune, Prune, and Quantize with a Single Command
Survey: Which Quantization Format Do You Use?
I'm currently running a large-scale evaluation of publicly available quantized LLMs using a completely new benchmark. The first results will likely be published in July.
To guide this work, I’d like to know which quantization format The Kaitchup readers use most. I'll prioritize evaluation in that format.
MLX is another popular option, but I don’t currently have the hardware to support it.
I’ll publish the results of this survey in the next Weekly Kaitchup.
Note: I already know the clear winner for the most-used format, but I’m very curious to see which one comes in second!
Axolotl + LLM Compressor: Fine-Tune, Prune, and Quantize with a Single Command
Sparse models like Sparse Llama have demonstrated that structured pruning, specifically 2:4 sparsity, where two out of every four contiguous weight elements are set to zero, can eliminate a substantial fraction of weights in key projection and feedforward layers without materially degrading task-specific performance.
These structured patterns are aligned with hardware-accelerated sparsity supported in NVIDIA’s Hopper architecture.
Axolotl can now leverage this sparsity-quantization pipeline.
Fine-tuning under sparsity constraints is achieved by applying deterministic pruning masks. This modifier operates at the parameter level, applying static sparsity patterns that zero out specified tensor elements based on regex-defined layer names (e.g., q_proj.weight
, k_proj.weight
, etc.). During optimization, gradients are only computed and applied to nonzero weights. This ensures the 2:4 sparsity pattern is preserved throughout training. The resulting model maintains its structured sparsity while adapting to the task-specific data distribution.
Once fine-tuning converges, the model is passed through LLM Compressor’s quantization module. Post-training quantization is performed using the oneshot
API, which applies operator-aware quantization recipes to all layers except the final classification head. For instance, FP8_DYNAMIC quantization maps both weights and activations into an 8-bit floating point representation with dynamic scale factors per channel. This method does not require calibration data and avoids the complexity of range estimation via histogram-based calibration, making it suitable for automation in training pipelines. We saw how to use LLM Compressor and these techniques in this article:
The resulting model is both pruned and quantized and remains fully compatible with the Hugging Face transformers
.
Empirical experiments on the TLDR summarization task show that the 2:4 sparse FP8 model achieves a 2.8x reduction in latency and a 4.9x reduction in memory footprint relative to the dense baseline.

The full workflow is scriptable via YAML-based configurations.
source: Axolotl meets LLM Compressor: Fast, sparse, open
Reading Recommendation for the Weekend
An excellent read for anyone looking to fully understand one of the most critical components for efficient LLM inference: The KV Cache
The Salt
The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.
I reviewed in The Weekly Salt:
⭐ MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
Optimizing Length Compression in Large Reasoning Models
Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% discount for group subscriptions):
Have a nice weekend!