The Weekly Kaitchup #50

SmolLM - Q-GaLore - FlashAttention-3

Benjamin Marie

Jul 19, 2024

Hi Everyone,

In this edition of The Weekly Kaitchup:

SmolLM: Tiny LLMs by Hugging Face
Q-GaLore: Train From Scratch 7B Parameter LLMs with a 16 GB GPU
FlashAttention-3: Is It Useful to You?

The 25% discount on the yearly subscription is still available until tomorrow.

Get the discount

That’s 50% cheaper than the monthly subscription (over a year).

If you are a free subscriber, consider upgrading to paid to access all the notebooks (80+) and more than 100 articles.

AI Notebooks and Articles Published this Week by The Kaitchup

Fine-tune Gemma 2 on Your Computer with LoRA and QLoRA

Benjamin Marie

July 15, 2024

Read full story

Notebook: #87 Fine-tune Gemma 2 on Your Computer -- With Transformers and Unsloth

GPU Benchmarking: What Is the Best GPU for LoRA, QLoRA, and Inference?

Benjamin Marie

July 18, 2024

Read full story

Notebook: #88 GPU Benchmarking for LoRA, QLoRA Fine-tuning, and Inference with and without 4-bit Quantization

SmolLM: Tiny LLMs by Hugging Face

Hugging Face released SmolLM: 135M, 360M, and 1.7B parameter LLMs.

Hugging Face Collection: SmolLM (Apache 2.0 license)

Instruct versions are also available for chat applications.

LLMs of these sizes can easily be fully fine-tuned on low-end consumer GPUs. For instance, fine-tuning the 1.7B version only requires a 12 GB GPU. According to Hugging Face and public benchmarks, these models perform better than other LLMs of similar size:

Apple’s OpenELM are also models of similar size but Hugging Face didn’t do the comparison. SmolLM 135M is even smaller than the smallest Apple OpenELM (270M).

Fine-tune Tiny Chat Models with Apple OpenELM and ORPO

Benjamin Marie

May 9, 2024

Read full story

In an upcoming article, we will try to fine-tune the 135M and 370M versions to make a tiny chat model. We will see whether we can get better results than with OpenELM.

Q-GaLore: Train LLMs from Scratch with a 16 GB GPU

A few months ago, I presented GaLore, a method projecting the gradients into low-rank subspaces to minimize the memory footprint of their update. With GaLore, full fine-tuning and pre-training from scratch of 7B parameter LLMs are possible with a 32 GB GPU (24 GB GPU with layerwise updates).

GaLore: Full Fine-tuning on Your GPU

Benjamin Marie

April 4, 2024

Read full story

A new variant introducing quantization into GaLore, Q-GaLore, is now available:

Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients

As we can see in the figure above, the main difference with GaLore is that Q-GaLore does the projection with a 4-bit data type while GaLore does it into a 16-bit subspace. The weights of the model are also quantized to INT8 (8-bit).

Thanks to these quantizations, fully fine-tuning and pre-training from scratch 7B parameter LLMs are possible with a $500 GPU, such as an RTX 4060 16 GB.

With all these quantizations, we could expect a significant accuracy drop or even unstable training. However, Q-GaLore seems to be able to preserve accuracy:

The authors released a Q-GaLore implementation:

GitHub: VITA-Group/Q-GaLore

This implementation is not mature enough for me to try it. Once it’s implemented into Hugging Face Transformers, I’ll write about it.

FlashAttention-3: Is It Useful to You?

FlashAttention is a technique that can significantly speed up attention computation, especially for long sequences of tokens. This speed-up is mainly achieved by better exploiting the SRAM, the small, very fast, and expensive on-chip memory of GPUs.

Use FlashAttention-2 for Faster Fine-tuning and Inference

Benjamin Marie

November 16, 2023

Read full story

Yet, for Hopper GPUs like the H100, FlashAttention still underexploits the GPU capabilities. FlashAttention-2 only uses 35% of the H100.

FlashAttention-3 primary purpose is to better exploit the H100 architecture. This third version of FlashAttention exploits up to 75% of the GPUs, significantly improving the efficiency of the attention computation. It’s a 2.0x acceleration with FP16 parameters.

The authors of FlashAttention-3 published a report describing how they proceeded:

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Do you need FlashAttention-3?

If you use a H100 GPU, yes, FlashAttention will significantly reduce fine-tuning and inference times. However, if you don’t have a H100, FlashAttention-3 should be as fast as FlashAttention-2. The paper doesn’t report on the performance of FlashAttention-3 with other GPUs but I would assume that nothing changed.

The Salt

The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.

Next week in The Salt, I will discuss perplexity and explain why perplexity can’t be used to compare the performance of two different LLMs.

This week in the Salt, I reviewed:

⭐On Leakage of Code Generation Evaluation Datasets
Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps
Self-Recognition in Language Models
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model
The Salt - Curated AI
Multimodal Self-instruct and Leakage of Code Benchmarks
Reviewed this week ⭐On Leakage of Code Generation Evaluation Datasets Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps Self-Recognition in Language Models Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model…
Read more
a year ago · Benjamin Marie

Looking for more professional services around LLMs?

Have a look at The Kaitchup Pro:

The Kaitchup Pro

That’s all for this week.

If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% discount for group subscriptions):

Share The Kaitchup – AI on a Budget

Have a nice weekend!

The Kaitchup – AI on a Budget

The Weekly Kaitchup #50

SmolLM - Q-GaLore - FlashAttention-3

AI Notebooks and Articles Published this Week by The Kaitchup

Fine-tune Gemma 2 on Your Computer with LoRA and QLoRA

GPU Benchmarking: What Is the Best GPU for LoRA, QLoRA, and Inference?

SmolLM: Tiny LLMs by Hugging Face

Fine-tune Tiny Chat Models with Apple OpenELM and ORPO

Q-GaLore: Train LLMs from Scratch with a 16 GB GPU

GaLore: Full Fine-tuning on Your GPU

FlashAttention-3: Is It Useful to You?

Use FlashAttention-2 for Faster Fine-tuning and Inference

The Salt

Discussion about this post