Hi Everyone,
In this edition of The Weekly Kaitchup:
QuTLASS: Efficient Inference with 4-bit Models for Blackwell GPUs
Voxtral: “The Best ASR Model in the World”?
RAT: Recurrent Attention Transformer
QuTLASS: Efficient Inference with 4-bit Models for Blackwell GPUs
The FP4 wave is coming. QuTLASS will be, I think, one of the building blocks for the efficient and easy deployment of FP4 models.
QuTLASS is a new low-precision kernel library built on top of NVIDIA CUTLASS, designed to enable efficient 4-bit inference (MXFP4) for large language models (LLMs) running on the new NVIDIA Blackwell GPUs. It is made by IST-DASLab, the same people behind GPTQ and Marlin.
The main addition in QuTLASS is support for microscaling, a feature introduced in Blackwell hardware. Microscaling allows scale factors to be efficiently applied across small groups of elements, which enables accurate low-bit matrix multiplications without losing numerical stability.
More on the importance of the group size in this article:
QuTLASS takes advantage of this by providing both quantization routines and matmul kernels for Blackwell’s architecture.
It supports W4A4 quantization (4-bit weights and activations), and includes fused kernels that combine Hadamard rotation, quantization, and scaling in a single step. This is important because Hadamard transforms help spread out information across tensor elements, which improves the quality of quantization. The rotation size is matched to the microscaling group size (e.g., 32), and arbitrary rotation matrices can be used at runtime.
QuTLASS includes several quantization methods and two types of matmul kernels:
A CUTLASS-backed MXFP4: MXFP4 kernel that supports block-scale reordering.
A custom prototype kernel optimized for small batch sizes that avoids the reordering step.
In benchmarks, it shows significant end-to-end inference speedups over standard BF16 inference in Transformers, especially on large LLMs (8B–14B). Speed gains increase with batch size and sequence length, peaking around 4x faster than BF16 on Blackwell GPUs.
QuTLASS is still an early release (v0.0.1) but already shows strong potential for efficient LLM inference using 4-bit compute on Blackwell GPUs.
GitHub: IST-DASLab/qutlass
Voxtral: “The Best ASR Model in the World”?
The Voxtral models are open-source speech understanding models available in two sizes:
They are built for advanced speech-to-text and audio understanding tasks and released under the Apache 2.0 license.
Technically, Voxtral integrates transcription, question answering, summarization, and function-calling in a single model, eliminating the need to chain multiple models. It supports long-form audio with a 32k token context length (up to 30–40 minutes), and operates natively across multiple languages. The models are built on Mistral Small 3.1. It seems to retain advanced language modeling capabilities for downstream tasks.
Benchmark results show that Voxtral outperforms Whisper large-v3, GPT-4o mini, and Gemini 2.5 Flash in transcription tasks, including both short- and long-form English datasets (e.g., LibriSpeech, Switchboard, Earnings21/22) and multilingual benchmarks like Mozilla Common Voice and FLEURS.
In audio understanding, Voxtral demonstrates strong performance in answering questions from spoken content and translating speech.
Given these results, Mistral AI calls it the best ASR model in the world.
Note: A few years ago, I spent a significant amount of time evaluating automatic speech recognition (ASR) models. I can say from experience that this task is far from straightforward. One of the main challenges is that different ASR models apply different post-processing strategies. For example, some models might include punctuation or special characters in their output, while others might omit them entirely. These inconsistencies complicate direct comparisons.
The problem is that the commonly used evaluation metric, Word Error Rate (WER), as used by Mistral AI in this case, is highly sensitive to such differences. The presence or absence of even a single symbol or punctuation mark can disproportionately impact the score. This can either unfairly benefit or penalize a model depending on how the reference transcript is formatted. It is very strange to me that ASR researchers are still using WER for evaluation. In machine translation research, we had similar problems with BLEU. We (nearly) got rid of it and use neural metrics instead.
Ultimately, ensuring that all models are evaluated under exactly the same conditions while using WER is extremely difficult, if not impossible. Even small variations in formatting or preprocessing can skew the results, making it hard to draw truly fair comparisons.
RAT: Recurrent Attention Transformer
RAT is yet another hybrid architecture made to improve Transformer efficiency on long sequences by combining lightweight recurrence with global attention. To process input, the model partitions sequences into fixed‑length chunks. Within each chunk, it uses a simple gated RNN-style recurrence to capture local token dependencies. Across chunks, it applies standard softmax attention over the compressed representations. By tuning the chunk size L, RAT interpolates between pure recurrence (for large L) and full attention (for L = 1).
Empirical results from the paper show that with a chunk size of 16, RAT speeds up training on sequences of 100k tokens by roughly 7x, and achieves around 9x faster generation for sequences of 4k tokens, while maintaining comparable accuracy to full-attention models.
Paper: RAT: Bridging RNN Efficiency and Attention Accuracy in Language Modeling
The authors also trained a 1.3 B parameter model from scratch, evaluating it across six short‑context reasoning tasks, 14 long‑context tasks (e.g., LongBench), and four supervised fine‑tuning (SFT) benchmarks. RAT matched or even slightly exceeded the performance of standard Transformers, and a hybrid variant mixing RAT with local attention further improved results.
Always interesting to see new attempts for hybrid architecture. They also released their training code (I didn’t try it):
GitHub: CLAIRE-Labo/RAT
The Salt
The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.
I reviewed in The Weekly Salt:
Test-Time Scaling with Reflective Generative Model
⭐KV Cache Steering for Inducing Reasoning in Small Language Models
One Token to Fool LLM-as-a-Judge
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% discount for group subscriptions):
Have a nice weekend!