Hi Everyone,
In this edition of The Weekly Kaitchup:
SmolLM: Tiny LLMs by Hugging Face
Q-GaLore: Train From Scratch 7B Parameter LLMs with a 16 GB GPU
FlashAttention-3: Is It Useful to You?
The 25% discount on the yearly subscription is still available until tomorrow.
That’s 50% cheaper than the monthly subscription (over a year).
If you are a free subscriber, consider upgrading to paid to access all the notebooks (80+) and more than 100 articles.
AI Notebooks and Articles Published this Week by The Kaitchup
Notebook: #87 Fine-tune Gemma 2 on Your Computer -- With Transformers and Unsloth
Notebook: #88 GPU Benchmarking for LoRA, QLoRA Fine-tuning, and Inference with and without 4-bit Quantization
SmolLM: Tiny LLMs by Hugging Face
Hugging Face released SmolLM: 135M, 360M, and 1.7B parameter LLMs.
Hugging Face Collection: SmolLM (Apache 2.0 license)
Instruct versions are also available for chat applications.
LLMs of these sizes can easily be fully fine-tuned on low-end consumer GPUs. For instance, fine-tuning the 1.7B version only requires a 12 GB GPU. According to Hugging Face and public benchmarks, these models perform better than other LLMs of similar size:
Apple’s OpenELM are also models of similar size but Hugging Face didn’t do the comparison. SmolLM 135M is even smaller than the smallest Apple OpenELM (270M).
In an upcoming article, we will try to fine-tune the 135M and 370M versions to make a tiny chat model. We will see whether we can get better results than with OpenELM.
Q-GaLore: Train LLMs from Scratch with a 16 GB GPU
A few months ago, I presented GaLore, a method projecting the gradients into low-rank subspaces to minimize the memory footprint of their update. With GaLore, full fine-tuning and pre-training from scratch of 7B parameter LLMs are possible with a 32 GB GPU (24 GB GPU with layerwise updates).
A new variant introducing quantization into GaLore, Q-GaLore, is now available:
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients
As we can see in the figure above, the main difference with GaLore is that Q-GaLore does the projection with a 4-bit data type while GaLore does it into a 16-bit subspace. The weights of the model are also quantized to INT8 (8-bit).
Thanks to these quantizations, fully fine-tuning and pre-training from scratch 7B parameter LLMs are possible with a $500 GPU, such as an RTX 4060 16 GB.
With all these quantizations, we could expect a significant accuracy drop or even unstable training. However, Q-GaLore seems to be able to preserve accuracy:
The authors released a Q-GaLore implementation:
GitHub: VITA-Group/Q-GaLore
This implementation is not mature enough for me to try it. Once it’s implemented into Hugging Face Transformers, I’ll write about it.
FlashAttention-3: Is It Useful to You?
FlashAttention is a technique that can significantly speed up attention computation, especially for long sequences of tokens. This speed-up is mainly achieved by better exploiting the SRAM, the small, very fast, and expensive on-chip memory of GPUs.
Yet, for Hopper GPUs like the H100, FlashAttention still underexploits the GPU capabilities. FlashAttention-2 only uses 35% of the H100.
FlashAttention-3 primary purpose is to better exploit the H100 architecture. This third version of FlashAttention exploits up to 75% of the GPUs, significantly improving the efficiency of the attention computation. It’s a 2.0x acceleration with FP16 parameters.
The authors of FlashAttention-3 published a report describing how they proceeded:
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
Do you need FlashAttention-3?
If you use a H100 GPU, yes, FlashAttention will significantly reduce fine-tuning and inference times. However, if you don’t have a H100, FlashAttention-3 should be as fast as FlashAttention-2. The paper doesn’t report on the performance of FlashAttention-3 with other GPUs but I would assume that nothing changed.
The Salt
The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.
Next week in The Salt, I will discuss perplexity and explain why perplexity can’t be used to compare the performance of two different LLMs.
This week in the Salt, I reviewed:
⭐On Leakage of Code Generation Evaluation Datasets
Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps
Self-Recognition in Language Models
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model
Looking for more professional services around LLMs?
Have a look at The Kaitchup Pro:
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% discount for group subscriptions):
Have a nice weekend!