The Weekly Kaitchup #47

Gemma 2 - Open LLM Leaderboard V2 - MARS5

Benjamin Marie

Jun 28, 2024

Hi Everyone,

In this edition of The Weekly Kaitchup:

Google Gemma 2: Larger and Better
Open LLM Leaderboard V2
MARS5: Small Text-to-speech Models for Accurate Voice Cloning

The Kaitchup has 4,146 subscribers. Thanks a lot for your support!

If you are a free subscriber, consider upgrading to paid to access all the notebooks (80+) and more than 100 articles.

AI Notebooks and Articles Published this Week by The Kaitchup

Simple QLoRA Fine-tuning with Axolotl

Benjamin Marie

June 24, 2024

Read full story

#81 Easy Fine-tuning with Axolotl -- Example with Llama 3

Intel AutoRound: Accurate Low-bit Quantization for LLMs

Benjamin Marie

June 27, 2024

Read full story

#82 Smaller LLMs with AutoRound Low-bit Quantization

Google Gemma 2: Larger and Better

Google released Gemma 2: Two new large language models with 9B and 27B parameters.

I particularly like the fact that they released a 27B version. This size is rare in the world of LLMs. Quantized to 4-bit, I expect the 27B version to only occupy 15 GB of memory. If you have a high-end consumer GPU (24 GB of memory), it still leaves 9 GB of memory for batch decoding and processing long sequences of tokens.

The main differences with the first version of Gemma are:

Sliding window attention to reduce the computational cost of the attention over long sequences
Prevents logits from growing excessively by scaling them to a fixed range, improving training.
Leverage a larger teacher model to train a smaller model (for the 9B model).
Model Merging: Combines two or more LLMs into a single new model. Note: I don’t know what they did here but it sounds interesting.

Moreover, Gemma 2 has been trained on many more tokens: 13T for the 27B version and 8T for the 9B version.

As for the performance on public benchmarks, they are good but without surprises:

A comparison with Qwen2 is missing. I expect Qwen2 7B to be significantly better on these benchmarks than Gemma 2 9B while being smaller. Google has submitted Gemma 2 to the new Open LLM leaderboard. The results will be interesting.

You can find Gemma 2 here (the models without “pytorch” in their name are compatible with HF Transformers):

Gemma 2

Hopefully, the HF implementation will be less buggy than what it was for Gemma 1.

Fine-tune a Better Google Gemma with Unsloth and Distilled DPO

Benjamin Marie

March 18, 2024

Read full story

Blog post describing Gemma 2:

Gemma 2 is now available to researchers and developers

Open LLM Leaderboard V2

The Open LLM leaderboard by Hugging Face was one of the most popular leaderboards for LLMs. Many LLM makers relied on it to compare LLMs and claim better performance.

However, it became evident that the tasks used by the leaderboard were becoming too easy. The evaluation datasets are so popular that they might have contaminated the training data of recent LLMs or biased LLM developers to optimize performance specifically for these datasets.

Hugging Face refreshed the Open LLM leaderboard by replacing the evaluation tasks with much more difficult tasks.

🤗 Open LLM Leaderboard

IFEval: A dataset testing models' ability to follow explicit instructions, focusing on formatting rather than content.
BBH (Big Bench Hard): A subset of 23 challenging tasks from BigBench, evaluating language models using objective metrics on tasks like multistep arithmetic and language understanding, correlating well with human preferences.
MATH: A compilation of high-school level competition problems.
GPQA (Graduate-Level Google-Proof Q&A Benchmark): A challenging dataset with questions crafted by PhD-level experts, ensuring difficulty and factual accuracy, with restricted access to minimize data contamination.
MuSR (Multistep Soft Reasoning): A dataset of algorithmically generated complex problems requiring models to integrate reasoning with long-range context parsing.
MMLU-PRO (Massive Multitask Language Understanding - Professional): An improved version of MMLU with higher difficulty, reducing noise and increasing quality by presenting models with more choices and expert-reviewed questions.

For now, Qwen2 72B is ranked first on this new leaderboard, largely outperforming all the other models, including Llama 3, on all these tasks except IFEval.

Qwen2 vs. Llama 3: QLoRA Learning Curves and Quantization Performance

Benjamin Marie

June 13, 2024

Read full story

Behind the leaderboard, they still use the Evaluation Harness. You can reproduce the same results on your computer.

Behind the OpenLLM Leaderboard: The Evaluation Harness

Benjamin Marie

December 21, 2023

Read full story

MARS5: Small Text-to-speech Models for Accurate Voice Cloning

The authors of these models, CAMB.AI, present them like this on the GitHub webpage:

With just 5 seconds of audio and a snippet of text, MARS5 can generate speech even for prosodically hard and diverse scenarios like sports commentary, anime and more.

This sounds like a significant breakthrough, as text-to-speech (TTS) models usually require a lot of data, are larger models, or are not open.

MARS5 models are available on the Hugging Face Hub:

CAMB-AI/MARS5-TTS (AGPL-3.0 license)

You will also find a notebook in the HF hub repository for testing the models.

Their neural architecture is quite complex with auto-regressive and non-autoregressive components. The auto-regressive component is very similar to Mistral but is much smaller.

The Salt

The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.

This week in the Salt, I reviewed:

Tokenization Falling Short: The Curse of Tokenization
Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models
⭐Instruction Pre-Training: Language Models are Supervised Multitask Learners
A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems