The Weekly Kaitchup #77

Qwen2.5-VL - Mistral AI and Apache 2.0 - vLLM v1, Janus 7B, and DeepSeek-R1 1.58-bit

Benjamin Marie

Jan 31, 2025

Hi Everyone,

In this edition of The Weekly Kaitchup:

Qwen2.5-VL!
Mistral AI Back to Apache 2.0?
vLLM v1, Janus 7B, and DeepSeek-R1 1.58-bit
GRPO on a Budget?

Book Update

My original plan was to publish the third chapter of my book, LLMs on a Budget, today. However, I have decided to postpone its release until later next month (February).

This chapter focuses on popular and state-of-the-art quantization methods for LLMs. Initially, the chapter was meant to provide a practical guide on quantization methods and their usage, one by one. However, I’ve now decided to expand it into a more comprehensive survey. I believe this approach will add significantly more value to the chapter, as it will compare quantization accuracy, speed, and cost across Llama 3x and Qwen2.5 models. The survey will cover multiple quantization techniques, including GPTQModel, AWQ, VPTQ (maybe…), bitsandbytes, llama.cpp, AutoRound, and HQQ, tested at different bitwidths.

Running all these experiments will take time. If you're interested in tracking my progress, you can do so indirectly by checking the models I publish in the Kaitchup space on Hugging Face. I have already released new models quantized with GPTQModel, a promising successor to AutoGPTQ, which does come with some caveats that further delayed my progress: quantizing a 70B model requires over 200 GB of CPU memory, 3-bit quantization is buggy, and compilation fails on Colab.

Stay tuned for more updates!

Qwen2.5-VL!

While everyone is still talking about DeepSeek-R1, you might have missed an exciting update: this week, the Qwen team released Qwen2.5-VL!

Qwen2-VL was still among the best vision-language models, but Qwen2.5-VL takes things even further. It can handle significantly more complex tasks and performs on par with top commercial models.

Run Qwen2-VL on Your Computer with Text, Images, and Video, Step by Step

Benjamin Marie

September 2, 2024

Read full story

I won’t dive into all the details just yet; I’ll be publishing an article next week.

In the meantime, you can check out the models here:

Qwen2.5-VL Models

Over the past two weeks, the performance gap between commercial and open models has narrowed significantly! China's AI advancements (e.g., by Alibaba (Qwen) and DeepSeek AI) are accelerating at an incredible pace. I’m eager to see how Meta, OpenAI, Google, and Anthropic will respond to this rapid progress!

Mistral AI Back with Apache 2.0?

Mistral AI released yesterday Mistral Small 3, a not-so-small 24B parameter model, available as base and instruct models:

mistralai/Mistral-Small-24B-Base-2501

According to their own evaluation, this model outperforms Qwen2.5 32B and performs comparably to Llama 3.3 70B Instruct.

Mistral AI has also emphasized the model's inference throughput, claiming it is faster than other LLMs of similar size and three times faster than Llama 3.3 70B. I haven’t yet examined the architecture in detail, but they mentioned it has fewer layers, which explains the speed boost. I plan to review the model soon and will attempt to quantize it if its architecture is supported by AutoRound.

However, the biggest news from this release is that Mistral AI appears to be recommitting to releasing models under an Apache 2.0 license—which is fantastic! Mistral 7B and Mixtral were both excellent models, and I wrote extensively about them. But then, they started releasing most of their newer models under the MRL license, which restricts commercial use. Because of that, I (and many others) could no longer write tutorials demonstrating them.

And More: vLLM v1, Janus 7B, and DeepSeek-R1 1.58-bit

vLLM v1 Is Coming!

The vLLM team released the version 0.7.0 of vLLM. The most notable improvement is that torch.compile is now fully integrated into vLLM.

Torch Compile: 2x Faster Llama 3.2 with Low Effort

Benjamin Marie

November 11, 2024

Read full story

What’s more, is that this new version also integrates vLLM1 v1 which exploits an entirely new engine that they claim to be 1.7x faster.

vLLM V1: A Major Upgrade to vLLM's Core Architecture

This engine is not activated by default. To use it, you just need to set this environment variable:

VLLM_USE_V1=1

Janus Pro 7B by DeepSeek AI

DeepSeek AI has released Janus Pro 7B, a multimodal model that can process both text and images as input—like most vision-language models (VLMs)—but what sets it apart is its ability to generate both text and images, a feature that remains rare.

I haven’t tested it yet, but the demo looks impressive, especially for a model that’s only 15GB. It’s released under an MIT license.

You can check it out here:

deepseek-ai/Janus-Pro-7B

DeepSeek-R1 1.58-bit by Unsloth

We know that larger models are easier to quantize. Even sub-2-bit quantization can perform well for very large models, as we saw with VPTQ and Llama 3.1 405B:

2-bit VPTQ: 6.5x Smaller LLMs, >95% Accuracy

Benjamin Marie

Jan 27

Read full story

Unsloth released a low-bit version of DeepSeek-R1.

Run DeepSeek R1 Dynamic 1.58-bit

It seems that they have simply used llama.cpp with an importance matrix (imatrix) to calibrate the quantization.

GGUF Quantization with Imatrix and K-Quantization to Run LLMs on Your CPU

Benjamin Marie

September 9, 2024

Read full story

Are they good models? They didn’t formally evaluate the model so we don’t know. I would assume that the R1 version based on Llama 3.3 70B works better while consuming the same amount of memory as the 1.58-bit.

I think the 2.51-bit version could be the most useful, but this should be confirmed with a proper evaluation.

Failing GRPO on a Budget, for Now

Last week, I mentioned that I was planning to publish an article explaining GRPO, the reinforcement learning method used to train DeepSeek-R1 and Qwen. As part of this, I experimented with the GRPO implementation in TRL. However, I found that the implementation is not yet mature and likely requires a few more days or weeks of work. Despite its official release in TRL this week, GRPO remains quite buggy for memory-efficient use cases.

I encountered several issues, including:

Buggy gradient checkpointing
Limited training logs, which only report the training loss for some reason
Excessive memory consumption when running vLLM (the model is loaded in float32, which is hardcoded)
Incompatibility between vLLM and LoRA adapters, preventing generation
Strict dataset column requirements, where unused columns must be removed
Evaluation step failures, potentially related to gradient checkpointing issues

Note: Some of these issues might have been corrected since. This GRPO implementation is regularly updated.

I started to open issues in the TRL repository and will wait for corrections.

Additionally, I faced similar problems with online DPO, which also doesn’t work with LoRA adapters or gradient checkpointing. Online DPO is a well-regarded method, considered superior to standard DPO, and has been available in TRL for several months. Given its popularity, I’m curious to know who is actively using TRL’s implementation and for what purposes. Since gradient checkpointing doesn’t work, even a tiny model would consume a lot of memory.

GPU Selection of the Week:

To get the prices of GPUs, I use Amazon.com. If the price of a GPU drops on Amazon, there is a high chance that it will also be lower at your favorite GPU provider. All the links in this section are Amazon affiliate links.

NVIDIA RTX 50XX GPUs are officially released but already sold out, as expected. I won’t track their prices until I can find them at a “reasonable” price.

RTX 50 and DIGITS: What Does It Mean for Local AI?

Benjamin Marie

Jan 8

Read full story

RTX 4090 (24 GB): None at a reasonable price.
RTX 4080 SUPER (16 GB): None at a reasonable price.
RTX 4070 Ti SUPER (16 GB): INNO3D nVidia GeForce RTX 4070 Ti Super Twin X2 16GB GDDR6X
RTX 4060 Ti (16 GB): GeForce RTX 4060 Ti Jetstream

The Salt

The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.

I reviewed in The Weekly Salt:

More Papers on Deeper and Efficient "Thinking" for LLMs

Benjamin Marie

Jan 29

Read full story

Evolving Deeper LLM Thinking
O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning
⭐DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Support The Kaitchup by becoming a Pro subscriber:

What You'll Get

Priority Support – Fast, dedicated assistance whenever you need it to fine-tune or optimize your LLM/VLM. I answer all your questions!
Lifetime Access to All the AI Toolboxes – Repositories containing Jupyter notebooks optimized for LLMs and providing implementation examples of AI applications.
Full Access to The Salt – Dive deeper into exclusive research content. Already a paid subscriber to The Salt? You’ll be refunded for the unused time!
Early Access to Research – Be the first to access groundbreaking studies and models by The Kaitchup.
30% Discount for Group Subscriptions – Perfect for teams and collaborators.
The Kaitchup’s Book – A comprehensive guide to LLM fine-tuning. Already bought it? You’ll be fully refunded!
All Benefits from Regular Kaitchup Subscriptions – Everything you already love, plus more. Already a paid subscriber? You’ll be refunded for the unused time!

Subscribe to The Kaitchup Pro

That’s all for this week.

If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% (or 30% for Pro subscribers) discount for group subscriptions):

Share The Kaitchup – AI on a Budget

Have a nice weekend!

The Kaitchup – AI on a Budget

The Weekly Kaitchup #77

Qwen2.5-VL - Mistral AI and Apache 2.0 - vLLM v1, Janus 7B, and DeepSeek-R1 1.58-bit

Book Update

Qwen2.5-VL!

Run Qwen2-VL on Your Computer with Text, Images, and Video, Step by Step

Mistral AI Back with Apache 2.0?

And More: vLLM v1, Janus 7B, and DeepSeek-R1 1.58-bit

vLLM v1 Is Coming!

Torch Compile: 2x Faster Llama 3.2 with Low Effort

Janus Pro 7B by DeepSeek AI

DeepSeek-R1 1.58-bit by Unsloth

2-bit VPTQ: 6.5x Smaller LLMs, >95% Accuracy

GGUF Quantization with Imatrix and K-Quantization to Run LLMs on Your CPU

Failing GRPO on a Budget, for Now

GPU Selection of the Week:

RTX 50 and DIGITS: What Does It Mean for Local AI?

The Salt

More Papers on Deeper and Efficient "Thinking" for LLMs

What You'll Get

Discussion about this post