The Weekly Kaitchup #70

EuroLLM - Unsloth's Quantization - PaliGemma 2 - bitsandbytes updates

Dec 06, 2024

Hi Everyone,

In this edition of The Weekly Kaitchup:

EuroLLM: A European LLM with Good Reasoning and Translation Accuracy
Unsloth’s Dynamic 4-bit Quantization
New VLMs: Google Released PaliGemma 2
Faster Bitsandbytes, Faster QLoRA

Toolbox update: The Qwen2.5 and Llama 3.1/3.2 toolboxes now include all you need for LoRA/QLoRA fine-tuning and DPO training with multi-GPU configurations (PyTorch FSDP).

Book update: The second chapter of The Kaitchup’s Book, titled “Prepare Your Training Dataset,” has been published. The upcoming chapter, focusing on quantization for LLMs, is scheduled for release in January. The book is currently available at a 30% discount:

Get the book

EuroLLM: A European LLM with Good Reasoning and Translation Accuracy

EuroLLM are open large language models developed collaboratively by leading European private and public research institutions, including Unbabel, Instituto Superior Técnico, Instituto de Telecomunicações, University of Edinburgh, Aveni, University of Paris-Saclay, University of Amsterdam, Naver Labs, and Sorbonne Université.

Models are available in both instruct and base versions, with 1.5B and 9B parameter variants accessible here:

Hugging Face Collection: EuroLLM (Apache 2.0 license)

I find the name somewhat misleading, as it initially gave me the impression that the model only supports European languages, which isn’t the case.

The supported languages include: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish, Arabic, Catalan, Chinese, Galician, Hindi, Japanese, Korean, Norwegian, Russian, Turkish, and Ukrainian.

The models' performance is not particularly remarkable, likely due to being trained on a relatively modest dataset of 4 trillion tokens. They seem to struggle with world knowledge tasks, such as MMLU and MMLU-PRO, which is expected given the size of the pre-training data. However, they appear to be good in reasoning tasks for models of this size, according to MuSR evaluations. Additionally, they clearly stand out for their high accuracy in translation tasks (instruct version).

The high translation accuracy is not surprising to me, given that this collaboration brings together some of the world's leading machine translation experts. For instance, Unbabel, which dominated most WMT tasks this year, and the University of Edinburgh, a driving force behind many of the most advanced developments in machine translation over the past two decades, are key contributors to this effort.

I quantized the 9B models with AutoRound:

Same quantization recipe as usual:

The Recipe for Extremely Accurate and Cheap Quantization of 70B+ LLMs

Benjamin Marie

November 25, 2024

Read full story

I only evaluated the accuracy on English tasks. If you intend to use these models for other languages, I strongly recommend conducting thorough evaluations in those languages before deployment.

The models are here:

Hugging Face Collection: EuroLLM Quantized (Apache 2.0 license)

Unsloth’s Dynamic 4-bit Quantization

Unsloth has introduced its own quantization algorithm, built on top of bitsandbytes, making it compatible with most inference frameworks.

Details about this quantization method are limited, as Unsloth to the best of my knowledge has only released quantized models without code. According to Unsloth, their algorithm significantly outperforms standard bitsandbytes quantization, particularly for vision-language models such as Pixtral and Llama 3.2 Vision.

For example, here are the quantization errors reported for Pixtral:

I assume that they have computed those for standard bitsandbytes quantization. To address these quantization errors, their method incorporates awareness of outliers and important activations, which somewhat reminds me of AWQ. Parameters that do not quantize well—such as outliers—are retained in 16-bit precision. Since these cases are rare, this approach minimally increases the model size compared to more naive quantization techniques.

As far as I know, Unsloth does not plan to provide further documentation or comparisons with state-of-the-art methods like AQLM or AutoRound, which also support vision-language models (VLMs). Unsloth doesn’t publish papers.

Nevertheless, this method appears to be a solid alternative to the original bitsandbytes quantization for VLMs. It also has the advantage of being especially suitable for fine-tuning with QLoRA.

You can access models quantized with this method here:

Hugging Face Collection: Unsloth 4-bit Dynamic Quants

source: Unsloth - Dynamic 4-bit Quantization

New VLMs: Google Released PaliGemma 2

Since PaliGemma, many other vision-language models have been released, and my personal favorites are Qwen2-VL and MOLMo, both of which perform very well across a wide range of tasks.

Now, Google has introduced an updated version: PaliGemma 2.

Hugging Face Collection: PaliGemma 2 (gemma license; commercial use OK)

While continuing to use SigLIP for vision tasks, PaliGemma 2 integrates the very capable Gemma 2 for text decoding. The model is available in three sizes—3B, 10B, and 28B parameters—and supports input image resolutions of 224x224, 448x448, and 896x896.

However, its performance on public benchmarks looks underwhelming. In their technical report, Google refrained from comparing PaliGemma 2 with state-of-the-art vision-language models like Qwen2-VL or Pixtral, instead opting for older models as baselines.

Google (and Hugging Face) claim the models are designed for straightforward fine-tuning on specific tasks. I plan to test this claim by fine-tuning the models for tasks they haven’t been explicitly trained on and will share results if they prove noteworthy.

For more information, Hugging Face published a blog post (which sounds a bit promotional). You can find it here:

Welcome PaliGemma 2 – New vision language models by Google

Faster Bitsandbytes, Faster QLoRA

Bitsandbytes is one of the most widely used quantization methods, known for its accuracy and excellent support for QLoRA fine-tuning. However, its primary drawback is its slow performance, as we saw in this article:

The Best Quantization Methods to Run Llama 3.1 on Your GPU

Benjamin Marie

August 12, 2024

Read full story

A new update makes bitsandbytes 15% faster for inference with 4-bit models.

If you're using H100 GPUs, the library has now improved support for 8-bit quantization, making 8-bit models much faster for inference. This is particularly beneficial for massive models like Llama 3.1 405B, which require multiple H100 GPUs to run efficiently, even when quantized to 8-bit.

GPU Selection of the Week:

To get the prices of GPUs, I use Amazon.com. If the price of a GPU drops on Amazon, there is a high chance that it will also be lower at your favorite GPU provider. All the links in this section are Amazon affiliate links.

RTX 4090 (24 GB): ASUS ROG Strix GeForce RTX™ 4090 BTF OC
RTX 4080 SUPER (16 GB): GIGABYTE GeForce RTX 4080 Super Gaming OC
RTX 4070 Ti SUPER (16 GB): ZOTAC Gaming GeForce RTX 4070 Ti Super Trinity Black Edition
RTX 4060 Ti (16 GB): Asus Dual GeForce RTX™ 4060 Ti EVO OC Edition 16GB

The Salt

The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.

This week, I explained how AI2 synthesized and curated many datasets for the post-training (SFT, DPO, and RLVR) of the very good TULU 3 models:

TÜLU 3's High-Quality Synthetic Datasets for Post-Training LLMs

Benjamin Marie

December 5, 2024

Read full story

I also reviewed:

Speculative Decoding with Token-Specific Difficulty

Benjamin Marie

December 3, 2024

Read full story

⭐Draft Model Knows When to Stop: A Self-Verification Length Policy for Speculative Decoding
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration
LongKey: Keyphrase Extraction for Long Documents

That’s all for this week.

If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% (or 30% for Pro subscribers) discount for group subscriptions):

Share The Kaitchup – AI on a Budget

Have a nice weekend!

The Kaitchup – AI on a Budget

The Weekly Kaitchup #70

EuroLLM - Unsloth's Quantization - PaliGemma 2 - bitsandbytes updates

EuroLLM: A European LLM with Good Reasoning and Translation Accuracy

The Recipe for Extremely Accurate and Cheap Quantization of 70B+ LLMs

Unsloth’s Dynamic 4-bit Quantization

New VLMs: Google Released PaliGemma 2

Faster Bitsandbytes, Faster QLoRA

The Best Quantization Methods to Run Llama 3.1 on Your GPU

GPU Selection of the Week:

The Salt

TÜLU 3's High-Quality Synthetic Datasets for Post-Training LLMs

Speculative Decoding with Token-Specific Difficulty

Discussion about this post