The Weekly Kaitchup #97

Survey - Axolotl + LLM Compressor

Jun 20, 2025

Hi Everyone,

I’ve launched the Summer Sale for The Kaitchup: get 30% off forever on the yearly subscription.

The same deal applies to my other newsletter, The Salt, which focuses more on scientific papers and research insights.

Get the discount (The Salt)

Both offers are valid until June 30.

It’s been a quiet week, and summer is approaching. I’d bet the major labs are putting the final touches on big releases before their researchers head off for the summer break.

In this edition of The Weekly Kaitchup:

Survey: Which Quantization Format Do You Use?
Axolotl + LLM Compressor: Fine-Tune, Prune, and Quantize with a Single Command

Survey: Which Quantization Format Do You Use?

I'm currently running a large-scale evaluation of publicly available quantized LLMs using a completely new benchmark. The first results will likely be published in July.

To guide this work, I’d like to know which quantization format The Kaitchup readers use most. I'll prioritize evaluation in that format.

MLX is another popular option, but I don’t currently have the hardware to support it.

I’ll publish the results of this survey in the next Weekly Kaitchup.

Note: I already know the clear winner for the most-used format, but I’m very curious to see which one comes in second!

Axolotl + LLM Compressor: Fine-Tune, Prune, and Quantize with a Single Command

Sparse models like Sparse Llama have demonstrated that structured pruning, specifically 2:4 sparsity, where two out of every four contiguous weight elements are set to zero, can eliminate a substantial fraction of weights in key projection and feedforward layers without materially degrading task-specific performance.

These structured patterns are aligned with hardware-accelerated sparsity supported in NVIDIA’s Hopper architecture.

Axolotl can now leverage this sparsity-quantization pipeline.

Fine-tuning under sparsity constraints is achieved by applying deterministic pruning masks. This modifier operates at the parameter level, applying static sparsity patterns that zero out specified tensor elements based on regex-defined layer names (e.g., q_proj.weight, k_proj.weight, etc.). During optimization, gradients are only computed and applied to nonzero weights. This ensures the 2:4 sparsity pattern is preserved throughout training. The resulting model maintains its structured sparsity while adapting to the task-specific data distribution.

Once fine-tuning converges, the model is passed through LLM Compressor’s quantization module. Post-training quantization is performed using the oneshot API, which applies operator-aware quantization recipes to all layers except the final classification head. For instance, FP8_DYNAMIC quantization maps both weights and activations into an 8-bit floating point representation with dynamic scale factors per channel. This method does not require calibration data and avoids the complexity of range estimation via histogram-based calibration, making it suitable for automation in training pipelines. We saw how to use LLM Compressor and these techniques in this article:

Make LLMs Faster and Lighter with W8A8 Quantization

Benjamin Marie

Apr 21

Read full story

The resulting model is both pruned and quantized and remains fully compatible with the Hugging Face transformers.

Empirical experiments on the TLDR summarization task show that the 2:4 sparse FP8 model achieves a 2.8x reduction in latency and a 4.9x reduction in memory footprint relative to the dense baseline.

The full workflow is scriptable via YAML-based configurations.

source: Axolotl meets LLM Compressor: Fast, sparse, open

Reading Recommendation for the Weekend

An excellent read for anyone looking to fully understand one of the most critical components for efficient LLM inference: The KV Cache

Ahead of AI

Understanding and Coding the KV Cache in LLMs from Scratch

KV caches are one of the most critical techniques for efficient inference in LLMs in production. KV caches are an important component for compute-efficient LLM inference in production. This article explains how they work conceptually and in code with a from-scratch, human-readable implementation…

22 days ago · 180 likes · 13 comments · Sebastian Raschka, PhD

The Salt

The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.

I reviewed in The Weekly Salt:

Reward Correct CoT for Better Reasoning Models

Benjamin Marie

Jun 19

Read full story

⭐ MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
Optimizing Length Compression in Large Reasoning Models
Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

That’s all for this week.

If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% discount for group subscriptions):

Share The Kaitchup – AI on a Budget

Have a nice weekend!

The Kaitchup – AI on a Budget