Hi Everyone,
In this edition of The Weekly Kaitchup:
Mistral Small 3.1: Now Multimodal
Clarifying TRL's Recent Changes
Projects in Progress @ The Kaitchup
Toolbox Update
LLMs on a Budget: Chapter 3
Other Highlights:
TULU 3.1 but with LoRA
Ultra-Efficient LLMs: MLA, vocabulary reduction, and low-bit quantization
"Fine-tunability" metric to benchmark LLM adaptability
The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
I'm currently offering a 30% discount on the annual subscription to The Kaitchup and The Kaitchup Pro!
Mistral Small 3.1: Mistral Small 3 Becomes Multimodal
Mistral Small 3 is a solid open-source LLM (Apache 2.0) that performs well on benchmarks and runs efficiently, partly due to its unusually wide neural architecture. We took a closer look at it here:
This week, Mistral AI released an update, Mistral Small 3.1, which is even better and multimodal:
As we saw yesterday, adding multimodal support can sometimes boost language performance.
Does it hold true for Mistral Small 3.1?
Mistral AI has published benchmark results:
Version 3.1 outperforms other multimodal models on public benchmarks, but there’s no direct comparison with the earlier Mistral Small release.
I haven’t evaluated the models myself, but we can compare the numbers from their model cards. Here’s Mistral Small 3.1 Instruct:
Here is the previous version:
The new version doesn’t seem significantly better than the previous one for language tasks. It’s worth using only if you need vision support or strong multilingual capabilities.
Since it's a recent multimodal model, getting it to run now with common inference frameworks can be a bit tricky. Make sure you're using the latest versions of both Transformers and vLLM, as recommended by Mistral AI in the model card.
I tried quantizing it, but no luck so far—I’ll give it another shot next week.
Note: I’m also still working on quantizing Gemma 3 using methods like AutoRound. AutoRound does support Gemma 3, but it can’t export to GPTQ yet (or, more precisely, it exports a GPTQ model that doesn’t seem to work)—only its own “autoround” format works, which isn’t compatible with most frameworks. GPTQModel will support Gemma 3 (multimodal version) once the Transformers implementation matures (right now there’s still no support for AutoModelForCausalLM
for Gemma 3). Once that’s ready, I plan to quantize it with GPTQModel using the AutoRound algorithm for better accuracy.
This could take a while. In the meantime, you can quantize the model using bitsandbytes. There are also several GGUF versions already floating around—though it’s unclear how well those actually perform.
Clarifications on TRL's Recent Changes
Over the past few months, especially since Hugging Face started integrating GRPO, TRL has changed quite a bit. Many older notebooks that you can find online and using TRL are now broken. What’s worse is that some of the changes affect dataset preprocessing in ways that aren’t clearly disclosed to users.
Since I’ve written a lot of tutorials using TRL, I think it’s important to clarify how TRL behaves now.
Back in early 2024, things were simpler. You could preprocess the training data yourself, pass data_text_field='text'
to SFTConfig
, and that was mostly it. TRL didn’t modify your data much beyond tokenization. You could apply your own chat template and preprocess everything beforehand, confident that TRL wouldn’t interfere.
Now, though, TRL applies a bunch of internal processing—without being very explicit about it. As a result, it’s hard to tell exactly what TRL is doing to your data. That makes it tricky to ensure you're using the same prompt format at inference time that the model actually saw during training.
For example, doing this
ds_train = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:10000]")
def process(row):
prompt_messages = tokenizer.apply_chat_template(row["messages"], tokenize=False)
row["text"] = prompt_messages
return row
ds_train = ds_train.map(
process,
num_proc= multiprocessing.cpu_count(),
load_from_cache_file=False,
)
will now throw an error when you try to instantiate the SFTTrainer
with your dataset. TRL expects to handle a "messages"
column on its own and completely ignores your "text"
column—even if you explicitly set data_text_field='text'
in the SFTConfig
. I believe this is currently not intended and will be corrected in the coming months.
To work around this, you’ll need to remove all other columns from your dataset.
ds_train = ds_train.remove_columns(["messages", "prompt"])
Even if you do that, TRL currently shows this:
TRL gives the impression that it’s converting your data (specifically, the "text"
column because only this column remains) to ChatML, then applying a chat template on top of something that’s already formatted with a chat template. 🤨
Luckily, these lines are mostly just misleading. You can safely ignore them if your dataset is already preprocessed into plain text. But if it’s not, you’re basically at TRL’s mercy. You won’t really have control over what it does now with your data—or what it might do in future updates.
TRL is clearly moving in a direction where it assumes you want it to handle chat formatting automatically, using a chat template included in the tokenizer. And since only instruct model tokenizers typically include chat templates, TRL is now (implicitly) assuming you’re fine-tuning an instruct model, i.e., a model that has already been post-trained, not a base model.
Fine-tuning base models with TRL—whether with no chat template or a custom one—is becoming more of a hassle than it should be.
I’m currently updating the AI notebooks and toolboxes at The Kaitchup to support these recent TRL changes, but it might take one or two weeks to sort everything out (more than dozens of my notebooks use TRL).
On-Going Projects at The Kaitchup
Toolbox Update: PyTorch 2.6, GRPO, and Fixing Issues
I'm updating the AI toolboxes to ensure compatibility with PyTorch 2.6. Some recently updated dependencies, like vLLM, now work better and run faster with this version. Transformers seems to experience some issues with PyTorch older than 2.5.1.
I'm also integrating GRPO and addressing issues introduced by recent package changes—particularly with TRL, as mentioned earlier.
The Llama 3x toolbox update will be released first, likely next week, followed by the Qwen2.5 toolbox shortly after.
You can get the toolboxes here:
Note: These toolboxes are included with The Kaitchup Pro subscription (the highest tier).
Chapter 3 of LLMs on a Budget
Chapter 3 of The Kaitchup’s Book is currently under review. All sections and notebooks are nearly ready for release.
I’m planning to publish it next Friday or over the next weekend.
Note: The book is included with The Kaitchup Pro subscription (the highest tier).
Long-Term Projects
Most articles on The Kaitchup focus on reviewing existing methods and models to make AI more efficient. I’ve started working on deeper, more original content—projects that include new experimental results and aim to give more unique insights into the current state of local LLMs. These are closer to “research” articles, and naturally, they take longer to write. Here's a look at what I’m currently working on:
🔧 TULU 3.1… but with LoRA
TULU 3.1 is an open-source model released by AI2, based on Llama 3.1. Along with the model, they also shared their training pipeline and datasets. They fine-tuned the Llama 3.1 8B base model using a well-curated instruction dataset (SFT), followed by post-training with DPO and GRPO.
My goal is to see how close we can get to TULU 3.1’s performance using LoRA instead of full fine-tuning. While earlier work—including the original LoRA paper and many follow-ups—suggests that LoRA can match full fine-tuning, most comparisons use weak baselines with suboptimal hyperparameters or small task-specific datasets.
But what happens if we use TULU's well-optimized full fine-tuning recipe and a large dataset for general-purpose instruction fine-tuning? Can LoRA still compete?
If it can—even with a slight drop in performance—it would be a huge win for efficiency.
I've been working on this for the last couple of months, testing different hyperparameter combinations and fine-tuning adapters for long periods. So far, my LoRA fine-tuning is still significantly behind the performance AI2 got with full fine-tuning, despite using the same base model and data.
That said, I’m not done yet. I have several more ideas to try.
⚠️ Note: Don’t expect this article to come out soon. I want to take the time to confirm everything properly.
⚡ Super-Efficient LLMs with MLA, Vocabulary Reduction, and Low-Bit Quantization
We’ve explored many ways to make models more efficient in The Kaitchup, often without sacrificing accuracy—and sometimes even improving it. Some examples:
Swapping GQA for MLA
Using low-bit quantization with adapters to preserve accuracy
Reducing vocabulary size to shrink the model’s largest activation layers, which lowers memory usage (especially during fine-tuning)
Each of these works well on its own. But what if we combine them?
That’s the goal here: to build a super compact, fast model by plumbing all these methods together—then testing if the final model still holds up in performance.
I started this a few weeks ago, using Qwen2.5 72B Instruct as the base model. I chose this model because it’s still one of the best open models available, and it handles low-bit quantization really well.
This is mostly an engineering challenge, but I’m optimistic the results will be worth it.
📏 A "Fine-Tunability" Metric for LLMs
When new models are released, they're evaluated on public benchmarks—and almost always edge out their predecessors by a bit. But these benchmarks have limitations:
Results are often not comparable between reports, due to differences in frameworks, settings, or evaluation methods.
Widely-used benchmarks like MMLU or IFEval are increasingly overfitted and no longer reflect real-world model performance.
Benchmark data (or close variations) frequently leak into training sets of new models, inflating their scores artificially.
While new benchmarks are still useful for measuring accuracy on fixed tasks, I think we’re missing another important metric that matters even more: how easily a model can adapt to new data.
This is crucial for real-world use. Companies often have decent base models and internal datasets—but what they really need to know is:
Can the model quickly and efficiently learn from our data using current frameworks and methods?
For instance, in my experience, Qwen2.5 fine-tunes more easily and effectively than Llama models in many of my projects. But there’s no metric today, to the best of my knowledge, that captures this kind of "learnability."
That’s what I’ve started exploring—a way to measure what I’m calling fine-tunability. This is a long-term project involving some significant research but I’m very excited about this.
The Salt
The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.
I reviewed in The Weekly Salt:
⭐Transformers without Normalization
⭐ Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models
Support The Kaitchup by becoming a Pro subscriber:
What You'll Get
Priority Support – Fast, dedicated assistance whenever you need it to fine-tune or optimize your LLM/VLM. I answer all your questions!
Lifetime Access to All the AI Toolboxes – Repositories containing Jupyter notebooks optimized for LLMs and providing implementation examples of AI applications.
Full Access to The Salt – Dive deeper into exclusive research content. Already a paid subscriber to The Salt? You’ll be refunded for the unused time!
Early Access to Research – Be the first to access groundbreaking studies and models by The Kaitchup.
30% Discount for Group Subscriptions – Perfect for teams and collaborators.
The Kaitchup’s Book – A comprehensive guide to LLM fine-tuning. Already bought it? You’ll be fully refunded!
All Benefits from Regular Kaitchup Subscriptions – Everything you already love, plus more. Already a paid subscriber? You’ll be refunded for the unused time!
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% (or 30% for Pro subscribers) discount for group subscriptions):
Have a nice weekend!
Ack. Maybe I should downgrade TRL while we wait for them to correct it.