Hi Everyone,
In this edition of The Weekly Kaitchup:
Don’t Do Gradient Accumulation for Small Batch Sizes!
Better Embeddings with Context
Aria: A New State-of-the-Art VLM, Again!
The first chapter of The Kaitchup’s book will be released on October 15th. If you purchased the book, you will receive the chapters at the email address you provided. You can still purchase the book with a 30% discount with the discount code "PRESALE2":
More info about the book:
You can also get the book for free by subscribing to The Kaitchup Pro:
Don’t Do Gradient Accumulation for Small Batch Sizes!
When fine-tuning LLMs locally, using large batch sizes is often impractical due to high GPU memory consumption. Instead, we simulate larger batch sizes through gradient accumulation. Rather than updating the model weights after each batch, gradient accumulation holds gradients across several smaller mini-batches, only updating the weights after a specified number of batches has been processed. This method effectively mimics a larger batch size before applying the update.
In theory, setting a batch size of 1 and accumulating gradients over 32 batches is equivalent to using a batch of 32. However, I've observed that TRL and Transformers (including Unsloth) tend to be less accurate with small batch sizes, even with gradient accumulation.
Below are some experiments with Llama 3.2 and SmolM-135M using TRL and Transformers:
batch_size=1 and gradient_accumulation_steps=32 is much worse than batch_size= 32 and gradient_accumulation_steps=1 while they are mathematically equivalent.
I could also confirm it with Qwen2.5. The precision of the model's parameters doesn't matter. It happens with bf16 and fp32 weights. I opened an issue in the TRL repo several days ago but nothing much happened since.
I also posted it on X and Reddit. For now, we couldn’t figure out what could be the reason.
I’m now waiting for HF people to investigate this. Meanwhile, I strongly recommend avoiding gradient accumulation with a per-device batch size lower than 8.
Better Embeddings with Context
Dense document embeddings, such as those used for RAG, are key for neural retrieval but are often created without considering document context. This work by Cornell University suggests that including neighboring documents in embeddings can improve retrieval, especially for specific use cases:
Contextual Document Embeddings
The authors introduce a contrastive learning approach that factors in document neighbors and a new architecture that encodes this neighbor information directly.
This embedding model operates in two stages. In the first stage, a subset of the corpus is embedded using a "first-stage" model to gather key dataset information. In the second stage, queries and documents are embedded, leveraging the corpus information obtained from the first stage. Importantly, the first stage can be run offline, allowing only the second-stage model weights to be used during inference.
The model outperforms standard bi-encoders and sets new records on the MTEB benchmark without needing complex techniques, making them broadly useful for contrastive learning tasks.
They released one context model on the HF hub:
It is already supported by sentence-transformer so you can already use it with RAG.
The model card provides code examples to run both stages.
Aria: A New State-of-the-Art VLM, Again!
The last 30 days have been very fruitful for VLMs: Qwen2-VL, Pixtral, Molmo, Llama 3.2 Vision, and now Aria!
Hugging Face Hub: rhymes-ai/Aria (Apache 2.0 license)
Aria is an open-source multimodal native mixture-of-experts (MoE) model capable of processing multiple input types such as text, code, image, and video.
Aria’s architecture uses a fine-grained MoE decoder, i.e., with multiple small expert subnetworks, with 3.5B active parameters for a total of 24.9B parameters, designed for efficient speed and parameter use, and includes a lightweight visual encoder (438M parameters).
The model was trained with a 4-stage process: language pre-training, multimodal pre-training, multimodal long-context pre-training, and multimodal post-training.
They evaluated it on multimodal benchmarks:
They didn’t include Qwen2-VL in their evaluation but it seems better than Pixtral which is itself better than Qwen2-VL 7B (according to Mistral AI’s evaluation). However, note that Aria has twice as many parameters as Pixtral. While Aria only uses a portion of them for inference, we still need to load the entire model, i.e., Pixtral is much more memory-efficient. Note: As usual, these benchmark numbers don’t inspire trust. Rather than computing the number for the model by themselves, they chose to copy scores already published somewhere else, without being able to check whether these scores were computed with the same hyperparameters, prompts, etc. As far as I can tell, none of these scores are truly comparable.
GPU Cost Tracker
This section keeps track, week after week, of the cost of GPUs. It only covers consumer GPUs, from middle-end, e.g., RTX 4060, to high-end, e.g., RTX 4090.
While consumer GPUs have much less memory than GPUs dedicated to AI, they are more cost-effective, by far, for inference with small batches and fine-tuning LLMs with up to ~35B parameters using PEFT methods.
To get the prices of GPUs, I use Amazon.com. If the price of a GPU drops on Amazon, there is a high chance that it will also be lower at your favorite GPU provider. All the links in this section are Amazon affiliate links.
GPU Selection of the Week:
RTX 4090 (24 GB): ASUS TUF Gaming GeForce RTX™ 4090 OG OC
RTX 4080 SUPER (16 GB): GIGABYTE GeForce RTX 4080 Super WINDFORCE V2
RTX 4070 Ti SUPER (16 GB): MSI Gaming RTX 4070 Ti Super 16G AERO
RTX 4060 Ti (16 GB): ZOTAC Gaming GeForce RTX 4060 Ti 16GB AMP
This week, I’ve seen a notable price increase for the RTX 4090. I don’t expect prices to drop significantly until Black Friday. With NVIDIA soon ending production of new RTX 4090 units in preparation for the RTX 50xx release, it's likely the 4090 won’t become much cheaper for some time.
The Salt
The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.
I reviewed AdEMAMix, a new optimizer designed to train large language models (LLMs) faster and more effectively than AdamW. In my experiments, AdEMAMix shows potential advantages for lengthy pre-training but may be too costly in terms of time and resources when tuning hyperparameters for shorter training runs. In contrast, AdamW often performs adequately with default hyperparameters, making it a more convenient choice for shorter training tasks.
This week, I briefly reviewed:
⭐Law of the Weakest Link: Cross Capabilities of Large Language Models
⭐Contextual Document Embeddings
Training Language Models on Synthetic Edit Sequences Improves Code Synthesis
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% (or 30% for Pro subscribers) discount for group subscriptions):
Have a nice weekend!
On the issue of gradient accumulation, I learned this from a friend late this spring:
For example, with `bs=1` and `gradient_accumulation_steps=2`, there are two sequences of `attention_mask`:
- seq1: [1,1,1,1,1,1,1,1,0,0]
- seq2: [1,1,0,0,0,0,0,0,0,0]
In theory, when calculating the loss for backpropagation, seq1 should have 80% weight and seq2 should have 20% weight, but the implementation of HF is straightforward (seq1_loss + seq2_loss)/2, which means that two sequences with different actual lengths are treated as the same weight.
I don't know if this has anything to do with it, but the bias may be mitigated when `gradient_accumulation_steps` is large, so the effect is not so bad.