The Weekly Kaitchup #77
Qwen2.5-VL - Mistral AI and Apache 2.0 - vLLM v1, Janus 7B, and DeepSeek-R1 1.58-bit
Hi Everyone,
In this edition of The Weekly Kaitchup:
Qwen2.5-VL!
Mistral AI Back to Apache 2.0?
vLLM v1, Janus 7B, and DeepSeek-R1 1.58-bit
GRPO on a Budget?
Book Update
My original plan was to publish the third chapter of my book, LLMs on a Budget, today. However, I have decided to postpone its release until later next month (February).
This chapter focuses on popular and state-of-the-art quantization methods for LLMs. Initially, the chapter was meant to provide a practical guide on quantization methods and their usage, one by one. However, I’ve now decided to expand it into a more comprehensive survey. I believe this approach will add significantly more value to the chapter, as it will compare quantization accuracy, speed, and cost across Llama 3x and Qwen2.5 models. The survey will cover multiple quantization techniques, including GPTQModel, AWQ, VPTQ (maybe…), bitsandbytes, llama.cpp, AutoRound, and HQQ, tested at different bitwidths.
Running all these experiments will take time. If you're interested in tracking my progress, you can do so indirectly by checking the models I publish in the Kaitchup space on Hugging Face. I have already released new models quantized with GPTQModel, a promising successor to AutoGPTQ, which does come with some caveats that further delayed my progress: quantizing a 70B model requires over 200 GB of CPU memory, 3-bit quantization is buggy, and compilation fails on Colab.
Stay tuned for more updates!
Qwen2.5-VL!
While everyone is still talking about DeepSeek-R1, you might have missed an exciting update: this week, the Qwen team released Qwen2.5-VL!
Qwen2-VL was still among the best vision-language models, but Qwen2.5-VL takes things even further. It can handle significantly more complex tasks and performs on par with top commercial models.
I won’t dive into all the details just yet; I’ll be publishing an article next week.
In the meantime, you can check out the models here:
Over the past two weeks, the performance gap between commercial and open models has narrowed significantly! China's AI advancements (e.g., by Alibaba (Qwen) and DeepSeek AI) are accelerating at an incredible pace. I’m eager to see how Meta, OpenAI, Google, and Anthropic will respond to this rapid progress!
Mistral AI Back with Apache 2.0?
Mistral AI released yesterday Mistral Small 3, a not-so-small 24B parameter model, available as base and instruct models:
According to their own evaluation, this model outperforms Qwen2.5 32B and performs comparably to Llama 3.3 70B Instruct.
Mistral AI has also emphasized the model's inference throughput, claiming it is faster than other LLMs of similar size and three times faster than Llama 3.3 70B. I haven’t yet examined the architecture in detail, but they mentioned it has fewer layers, which explains the speed boost. I plan to review the model soon and will attempt to quantize it if its architecture is supported by AutoRound.
However, the biggest news from this release is that Mistral AI appears to be recommitting to releasing models under an Apache 2.0 license—which is fantastic! Mistral 7B and Mixtral were both excellent models, and I wrote extensively about them. But then, they started releasing most of their newer models under the MRL license, which restricts commercial use. Because of that, I (and many others) could no longer write tutorials demonstrating them.
And More: vLLM v1, Janus 7B, and DeepSeek-R1 1.58-bit
vLLM v1 Is Coming!
The vLLM team released the version 0.7.0 of vLLM. The most notable improvement is that torch.compile is now fully integrated into vLLM.
What’s more, is that this new version also integrates vLLM1 v1 which exploits an entirely new engine that they claim to be 1.7x faster.
vLLM V1: A Major Upgrade to vLLM's Core Architecture
This engine is not activated by default. To use it, you just need to set this environment variable:
VLLM_USE_V1=1
Janus Pro 7B by DeepSeek AI
DeepSeek AI has released Janus Pro 7B, a multimodal model that can process both text and images as input—like most vision-language models (VLMs)—but what sets it apart is its ability to generate both text and images, a feature that remains rare.
I haven’t tested it yet, but the demo looks impressive, especially for a model that’s only 15GB. It’s released under an MIT license.
You can check it out here:
DeepSeek-R1 1.58-bit by Unsloth
We know that larger models are easier to quantize. Even sub-2-bit quantization can perform well for very large models, as we saw with VPTQ and Llama 3.1 405B:
Unsloth released a low-bit version of DeepSeek-R1.
Run DeepSeek R1 Dynamic 1.58-bit
It seems that they have simply used llama.cpp with an importance matrix (imatrix) to calibrate the quantization.
Are they good models? They didn’t formally evaluate the model so we don’t know. I would assume that the R1 version based on Llama 3.3 70B works better while consuming the same amount of memory as the 1.58-bit.
I think the 2.51-bit version could be the most useful, but this should be confirmed with a proper evaluation.
Failing GRPO on a Budget, for Now
Last week, I mentioned that I was planning to publish an article explaining GRPO, the reinforcement learning method used to train DeepSeek-R1 and Qwen. As part of this, I experimented with the GRPO implementation in TRL. However, I found that the implementation is not yet mature and likely requires a few more days or weeks of work. Despite its official release in TRL this week, GRPO remains quite buggy for memory-efficient use cases.
I encountered several issues, including:
Buggy gradient checkpointing
Limited training logs, which only report the training loss for some reason
Excessive memory consumption when running vLLM (the model is loaded in float32, which is hardcoded)
Incompatibility between vLLM and LoRA adapters, preventing generation
Strict dataset column requirements, where unused columns must be removed
Evaluation step failures, potentially related to gradient checkpointing issues
Note: Some of these issues might have been corrected since. This GRPO implementation is regularly updated.
I started to open issues in the TRL repository and will wait for corrections.
Additionally, I faced similar problems with online DPO, which also doesn’t work with LoRA adapters or gradient checkpointing. Online DPO is a well-regarded method, considered superior to standard DPO, and has been available in TRL for several months. Given its popularity, I’m curious to know who is actively using TRL’s implementation and for what purposes. Since gradient checkpointing doesn’t work, even a tiny model would consume a lot of memory.
GPU Selection of the Week:
To get the prices of GPUs, I use Amazon.com. If the price of a GPU drops on Amazon, there is a high chance that it will also be lower at your favorite GPU provider. All the links in this section are Amazon affiliate links.
NVIDIA RTX 50XX GPUs are officially released but already sold out, as expected. I won’t track their prices until I can find them at a “reasonable” price.
RTX 4090 (24 GB): None at a reasonable price.
RTX 4080 SUPER (16 GB): None at a reasonable price.
RTX 4070 Ti SUPER (16 GB): INNO3D nVidia GeForce RTX 4070 Ti Super Twin X2 16GB GDDR6X
RTX 4060 Ti (16 GB): GeForce RTX 4060 Ti Jetstream
The Salt
The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.
I reviewed in The Weekly Salt:
Evolving Deeper LLM Thinking
O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning
⭐DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Support The Kaitchup by becoming a Pro subscriber:
What You'll Get
Priority Support – Fast, dedicated assistance whenever you need it to fine-tune or optimize your LLM/VLM. I answer all your questions!
Lifetime Access to All the AI Toolboxes – Repositories containing Jupyter notebooks optimized for LLMs and providing implementation examples of AI applications.
Full Access to The Salt – Dive deeper into exclusive research content. Already a paid subscriber to The Salt? You’ll be refunded for the unused time!
Early Access to Research – Be the first to access groundbreaking studies and models by The Kaitchup.
30% Discount for Group Subscriptions – Perfect for teams and collaborators.
The Kaitchup’s Book – A comprehensive guide to LLM fine-tuning. Already bought it? You’ll be fully refunded!
All Benefits from Regular Kaitchup Subscriptions – Everything you already love, plus more. Already a paid subscriber? You’ll be refunded for the unused time!
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% (or 30% for Pro subscribers) discount for group subscriptions):
Have a nice weekend!