The Weekly Kaitchup #71

efficient long context - Phi-4 - Molmo recipe

Benjamin Marie

Dec 13, 2024

Hi Everyone,

In this edition of The Weekly Kaitchup:

TGI and Unsloth Updates for Longer Context
Phi-4: Good Synthetic Data is All You Need
Molmo: The Recipe for a Good VLM

TGI and Unsloth Updates for Longer Context

Each week, I come across several new papers addressing the challenge of making LLMs more efficient with long sequences. This remains one of the most pressing issues with LLMs, as their underlying Transformer models have attention mechanisms with computational complexity that scales quadratically with sequence length.

Dozens of research papers propose different solutions to tackle this issue, including entirely new architectures designed to mitigate or eliminate the problem. Unfortunately, most of these ideas stay in the research phase and are never implemented in major frameworks, which means they rarely see practical use.

This week, however, Unsloth and TGI both introduced updates that significantly simplify handling long contexts, improving both fine-tuning and inference efficiency for long contexts.

Unsloth can now achieve 89K context fine-tuning on an 80GB GPU for 70B models—13 times longer than Transformers with FlashAttention!

This is made possible thanks to Cut Cross Entropy (CCE) which avoids materializing logits via on-the-fly matrix multiplications, drastically reducing VRAM usage. CCE also skips gradient computations for small probabilities, improving performance while maintaining high accuracy, enabling 70B models to achieve 1.85x longer contexts and 8B models to reach 3.6x longer contexts. By combining CCE with Unsloth's gradient checkpointing, which efficiently offloads activations to system RAM, these improvements multiply—allowing 70B models to handle 12-13x longer sequences.

source: Fine-tune Llama 3.3 with Unsloth

As for TGI, Hugging Face published some impressive benchmark results showing much more efficient inference for long context compared to vLLM.

The performance gains come from several key updates: custom kernels like flashinfer and flashdecoding make long prompts faster and scheduling more efficient. Optimized prefix caching speeds up query matching with very little delay (~6 microseconds). Moreover, their new chunking code helps manage compute resources better, saving VRAM and boosting speed.

For very long contexts (100k+ tokens), they cut memory use by removing unnecessary logits. For example, logits for Llama 3.1-8B can use 25.6GB of memory—more than the model itself (16GB). Now, it doesn’t save the logits by default, saving memory. If you need logits, you can turn them back on with the --enable-prefill-logprobs flag, but this will reduce the number of tokens you can process.

In a nutshell, this looks similar to what Unsloth did for fine-tuning: fewer logits = longer sequences.

source: TGI v3 overview

Both of these updates will also benefit vision-language models (VLM) which convert images into long sequences of image tokens.

Run and Serve Faster VLMs Like Pixtral and Phi-3.5 Vision with vLLM

Benjamin Marie

September 19, 2024

Read full story

Phi-4: Good Synthetic Data is All You Need

Microsoft has released Phi-4. For now, it is only available in Microsoft Azure but they plan to push it on the Hugging Face Hub soon, probably next week.

They published a technical report describing the model and the training data:

Phi-4 Technical Report

Let’s discuss the most surprising: The benchmark results.

Phi-4, a 14B parameter model, demonstrates performance that matches or surpasses much larger models like Llama 3.3 and Qwen2.5, which are 70B parameters—over 5x larger. Such impressive results are becoming characteristic of the Phi models, known for their bizarre performance on benchmarks that don’t show in real-world use cases. However, this new release feels different. Unlike earlier technical reports, where data contamination was scarcely addressed, Phi-4 adopts a new and more rigorous decontamination strategy. This involves removing benchmarks from the training data, mainly based on an n-gram matching strategy, aligning closely with established practices (e.g., similar to the approach AI2 employed for the TULU 3 models).

We decontaminate against the ARC-Easy, MBPP, phibench, CommonsenseQA, WinoGrande, mcphi, MedQA, MATH, AGIEval, PIQA, OpenBookQA, HellaSwag, GPQA, mt-bench, MMLUPro, GSM8k, HumanEval, arena hard, ARC-Challenge, and MMLU benchmarks. We apply a hybrid n-gram algorithm for decontamination which uses 13-gram and 7-gram features for removing matches to the test set, which is described in more detail in 1. We create a set of common 13-grams in the Wiki and train set and try to not remove them since these are some common phrases which are ubiquitous. Some examples include ’a i only b ii only c iii only d ii and iii’, ’a true true b false false c true false d false true’, ’logically equivalent b contradictory c neither logically equivalent nor contradictory but consistent d’, ’a (ii) and (iv) only b (i) and (iii) only c (i) (ii)’, ’b e b a b e c c b d c e d’.

It will be interesting to see the initial feedback from users to determine whether real-world applications align with the benchmark performance.

A more significant highlight in the technical report is the discussion of datasets. Microsoft provides an unusually detailed breakdown of their process for creating synthetic datasets for training—a section I highly recommend reading.

Although Microsoft still refers to Phi-4 as a "small model," its 14B parameters are quite substantial for consumer GPUs. Full fine-tuning of a model of this size on a single GPU is impractical. I plan to quantize it to 4-bit, which should enable more efficient use on limited hardware. Given the model's size, the quantization is unlikely to significantly impact its performance which I expect to be identical to the original model.

Molmo: The Recipe for a Good VLM

In October, AI2 introduced Molmo, a series of open multimodal vision-language models (VLMs) with 1B, 7B, and 72B parameters. The release coincided with other major launches, including Mistral AI’s Pixtral, Qwen2-VL, and later Llama 3.2 Vision. In my opinion, Molmo was forgotten way too quickly as many chose the disappointing Llama 3.2 Vision instead.

Molmo stands out as a fully open model that delivers exceptional performance:

With Molmo, you get better performance than Llama 3.2 90B, with a smaller model.

This week, AI2 published all the details of how they made the models and the training dataset (called PixMo):

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

I really recommend reading the paper if you are interested in understanding how VLMs work.

GPU Selection of the Week:

To get the prices of GPUs, I use Amazon.com. If the price of a GPU drops on Amazon, there is a high chance that it will also be lower at your favorite GPU provider. All the links in this section are Amazon affiliate links.

RTX 4090 (24 GB): ASUS TUF Gaming GeForce RTX™ 4090 OG
RTX 4080 SUPER (16 GB): GIGABYTE GeForce RTX 4080 Super WINDFORCE V2 16G
RTX 4070 Ti SUPER (16 GB): GIGABYTE GeForce RTX 4070 Ti Super WINDFORCE OC
RTX 4060 Ti (16 GB): Asus Dual GeForce RTX™ 4060 Ti EVO OC Edition 16GB

The Salt

The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.

This week, I reviewed:

"Reverse Thinking" for Better LLM Reasoning

Benjamin Marie

December 10, 2024

Read full story

⭐Reverse Thinking Makes LLMs Stronger Reasoners
⭐Evaluating Language Models as Synthetic Data Generators
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Support The Kaitchup by becoming a Pro subscriber:

What You'll Get

Priority Support – Fast, dedicated assistance whenever you need it to fine-tune or optimize your LLM/VLM. I answer all your questions!
Lifetime Access to All the AI Toolboxes – Repositories containing Jupyter notebooks optimized for LLMs and providing implementation examples of AI applications.
Full Access to The Salt – Dive deeper into exclusive research content. Already a paid subscriber to The Salt? You’ll be refunded for the unused time!
Early Access to Research – Be the first to access groundbreaking studies and models by The Kaitchup.
30% Discount for Group Subscriptions – Perfect for teams and collaborators.
The Kaitchup’s Book – A comprehensive guide to LLM fine-tuning. Already bought it? You’ll be fully refunded!
All Benefits from Regular Kaitchup Subscriptions – Everything you already love, plus more. Already a paid subscriber? You’ll be refunded for the unused time!

Subscribe to The Kaitchup Pro

That’s all for this week.

If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% (or 30% for Pro subscribers) discount for group subscriptions):

Share The Kaitchup – AI on a Budget

Have a nice weekend!

Trelis Research

Dec 21

Crazy how we default save so many logits when they basically are never needed except if someone is doing beam search or something. I hadn’t thought about that.

Expand full comment

John Saunders

Dec 13

Hmm. If Microsoft thought those benchmarks needed decontamination, when will we see other model results using decontamination, and what methods will be used?

3 more comments...

The Kaitchup – AI on a Budget

The Weekly Kaitchup #71

efficient long context - Phi-4 - Molmo recipe

TGI and Unsloth Updates for Longer Context

Run and Serve Faster VLMs Like Pixtral and Phi-3.5 Vision with vLLM

Phi-4: Good Synthetic Data is All You Need

Molmo: The Recipe for a Good VLM

GPU Selection of the Week:

The Salt

"Reverse Thinking" for Better LLM Reasoning

What You'll Get

Discussion about this post