The Weekly Kaitchup #68

Judge Arena - Fine-Tuning VLMs - Orca AgentInstruct

Benjamin Marie

Nov 22, 2024

Hi Everyone,

In this edition of The Weekly Kaitchup:

Judge Arena: A New Leaderboard for LLMs as Evaluators
An Open Recipe to Fine-Tune VLMs (like Qwen2-VL)
Orca AgentInstruct: A Synthetic Instruction Dataset by Microsoft for Instruct Fine-Tuning

For Black Friday, I’m offering a 30% discount on the yearly subscription to The Kaitchup:

Get the 30% discount

Or opt for The Kaitchup Pro to get all the AI toolboxes, The Kaitchup’s book, yearly subscriptions to The Kaitchup and The Salt, and more.

Subscribe to The Kaitchup Pro

Note: If you are already a paid subscriber and wish to upgrade, the remaining value of your current subscription will be automatically deducted from the cost of The Kaitchup Pro subscription. For example, if you have six months remaining on an annual subscription, 50% of the annual cost will be deducted by Substack. Additionally, if you’ve already purchased the book and subscribe to The Kaitchup Pro, you’ll receive a refund for the book purchase.

This week, I launched the Llama 3.1/3.2 Toolbox: a set of notebooks for fine-tuning, aligning, quantizing, running, and serving Llama 3 models. The toolbox also supports models built on Llama 3, including the highly impressive Tulu models released by AI2 this week.

Get the Llama 3.1/3.2 Toolbox

The toolboxes only support single GPU fine-tuning and preference optimization for now. Next month, I’ll add multi-GPU supports for DPO and SFT through PyTorch’s FSDP.

Judge Arena: A New Leaderboard for LLMs as Evaluators

Two weeks ago, I shared a guide on using LLMs to evaluate the outputs of other LLMs, (LLM-as-a-judge):

LLM as a Judge: Evaluate Your LLMs with Another LLM

Benjamin Marie

November 7, 2024

Read full story

Selecting the right LLM to evaluate other LLMs can be challenging. A dedicated benchmark for assessing LLMs as evaluators can greatly assist in making an informed choice.

The Judge Arena by Hugging Face provides an interactive solution to this. Users participate by running evaluation models (judges) on test samples and voting for the evaluations they find most accurate. These community votes are aggregated to rank the models on a public leaderboard, which is regularly updated to reflect collective preferences.

How it works?

The platform allows users to submit their own inputs or use generated ones. Each round involves two judges providing scores and critiques, which users review before casting their vote. To prevent bias, the names of the models are only revealed after voting. The selected judges include 18 generative LLMs from various organizations, representing both open-source and proprietary systems. The focus is on models that can score and critique effectively.

The first results show competitive performance between open-source and proprietary models, with smaller models like Qwen 2.5 and Llama 3.1 performing well against larger counterparts. Early trends align with research suggesting that certain models, such as those in the Llama series, are particularly suited for evaluation tasks.

Since the benchmark relies on its users making votes, its accuracy will improve over time.

More details here: Judge Arena: Benchmarking LLMs as Evaluators

An Open Recipe to Fine-Tune VLMs (like Qwen2-VL)

Another great contribution from Hugging Face this week is their fine-tuning recipe for vision-language models (VLMs). This tutorial demonstrates how to easily fine-tune VLMs on your chosen dataset using tools similar to those commonly used for fine-tuning LLMs.

The recipe employs QLoRA, enabling model quantization and the fine-tuning of an adapter on top of Qwen2-VL 7B. If you're familiar with fine-tuning LLMs, you'll find the process straightforward. The main difference lies in the data preprocessing steps, which are specific to vision-language tasks.

While we’ve previously explored fine-tuning smaller models like Florence-2, which is lightweight and compatible with consumer hardware, fine-tuning larger models like Qwen2-VL 7B requires more powerful computational resources, even with QLoRA. For this task, Hugging Face used an A100 GPU.

Fine-tune a Multimodal Chat Model with Florence-2 on Your Computer

Benjamin Marie

July 8, 2024

Read full story

Most of the memory usage is attributed to encoding, particularly the KV cache and related components, which handle the very long “multimodal” sequences. These sequences include numerous image tokens required to encode the images within the prompt.

Despite the efficiency of Hugging Face's fine-tuning recipe, there are several areas where memory usage could potentially be reduced. They used a 32-bit (albeit fused) AdamW optimizer, which could be replaced with a lower-precision alternative. The model seems to use its default sequence length, which may not be necessary for all tasks. The training batch size per device was set to four, which could be lowered to one to save memory. Additionally, the absence of FlashAttention means that memory efficiency during attention operations wasn’t optimized.

It’s likely feasible to fine-tune Qwen2-VL 7B on a 48 GB GPU, or possibly even less, using Transformers. While QLoRA was employed in this recipe, its effectiveness in this context is debatable—it adds computational overhead and thus slows down fine-tuning while saving relatively little memory compared to the activations' memory usage.

Nonetheless, the Hugging Face recipe is a valuable resource and can be adapted for fine-tuning many vision-language models. Studying their approach can provide a strong foundation for optimizing your own projects and achieving similar results with less powerful hardware.

Moreover, if you are searching for efficient fine-tuning of VLMs, Unsloth released a new version capable of fine-tuning Qwen2-VL 7B on a 16 GB GPU! It also supports Pixtral and Llama 3.2 Vision models.

Note: For this new release, Unsloth put the focus on Llama 3.2 Vision (the blog post title is “Llama 3.2 Vision fine-tuning”) to attract more readers/users but Pixtral and especially Qwen2-VL are both better and cheaper to run.

Orca AgentInstruct: A Synthetic Instruction Dataset by Microsoft for Instruct Fine-Tuning

Microsoft released orca-agentinstruct-1M-v1.

This large dataset consists of approximately 1 million synthetic instruction pairs generated using the AgentInstruct framework, which leverages publicly available web content as seeds. The data spans a variety of tasks, including text editing, creative writing, coding, and reading comprehension. It is designed for instruction tuning of base LLMs.

By post-training Mistral-7b on a larger version of this dataset (comprising ~25 million instruction pairs), Microsoft observed notable performance improvements in benchmarks such as AGIEval, MMLU, GSM8K, BBH, and AlpacaEval when compared to the original Mistral-7b-Instruct.

This dataset is structured as conversational pairs, with fields capturing the roles and content of each exchange. You can directly use it to fine-tune LLMs like Qwen2.5 and Llama 3.1/3.2 using the same recipes presented in these articles:

Fine-Tuning Meta's Llama 3.2 1B & 3B Models on Budget GPUs

Benjamin Marie

September 30, 2024

Read full story

Qwen2.5 QLoRA, LoRA, and Full Fine-tuning on Your Computer

Benjamin Marie

September 23, 2024

Read full story

GPU Cost Tracker

This section keeps track, week after week, of the cost of GPUs. It only covers consumer GPUs, from middle-end, e.g., RTX 4060, to high-end, e.g., RTX 4090.

While consumer GPUs have much less memory than GPUs dedicated to AI, they are more cost-effective, by far, for inference with small batches and fine-tuning LLMs with up to ~35B parameters using PEFT methods.

GPU Benchmarking: What Is the Best GPU for LoRA, QLoRA, and Inference?

Benjamin Marie

July 18, 2024

Read full story

To get the prices of GPUs, I use Amazon.com. If the price of a GPU drops on Amazon, there is a high chance that it will also be lower at your favorite GPU provider. All the links in this section are Amazon affiliate links.

GPU Selection of the Week:

RTX 4090 (24 GB): ASUS TUF Gaming GeForce RTX™ 4090 OG OC
RTX 4080 SUPER (16 GB): GIGABYTE GeForce RTX 4080 Super WINDFORCE V2
RTX 4070 Ti SUPER (16 GB): MSI GeForce RTX 4070 Ti Super 16G Ventus 3X Black OC
RTX 4060 Ti (16 GB): MSI Gaming GeForce RTX 4060 Ti 16GB

The Salt

The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.

This week, I reviewed:

Vocabulary Parallelism for More Efficient LLMs

Benjamin Marie

November 19, 2024

Read full story

⭐Balancing Pipeline Parallelism with Vocabulary Parallelism
DELIFT: Data Efficient Language model Instruction Fine Tuning
Counterfactual Generation from Language Models

That’s all for this week.

If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% (or 30% for Pro subscribers) discount for group subscriptions):

Share The Kaitchup – AI on a Budget

Have a nice weekend!

The Kaitchup – AI on a Budget

The Weekly Kaitchup #68

Judge Arena - Fine-Tuning VLMs - Orca AgentInstruct

Judge Arena: A New Leaderboard for LLMs as Evaluators

LLM as a Judge: Evaluate Your LLMs with Another LLM

An Open Recipe to Fine-Tune VLMs (like Qwen2-VL)

Fine-tune a Multimodal Chat Model with Florence-2 on Your Computer

Orca AgentInstruct: A Synthetic Instruction Dataset by Microsoft for Instruct Fine-Tuning

Fine-Tuning Meta's Llama 3.2 1B & 3B Models on Budget GPUs

Qwen2.5 QLoRA, LoRA, and Full Fine-tuning on Your Computer

GPU Cost Tracker

GPU Benchmarking: What Is the Best GPU for LoRA, QLoRA, and Inference?

GPU Selection of the Week:

The Salt

Vocabulary Parallelism for More Efficient LLMs

Discussion about this post