Better Packing for Fine-Tuning LLMs with the First Fit Decreasing (FFD) Strategy
No more over-segmentation, no more cross-contamination
A few weeks ago, I published an article comparing different batching strategies: packing, standard padding, and padding-free. We found that packing was often unsuitable due to the segmentation of training samples and cross-contamination.
Padding-free became my preferred alternative to standard padding, as it significantly sped up training with minimal drawbacks.
With the release of TRL 0.19.0, a new default packing strategy was introduced: First Fit Decreasing, which performs even better than padding-free.
Packing with FFD
We saw that packing is subject to cross-contamination and may spread samples over many sequences. The new FFD variant does packing more intelligently and is not subject to cross-contamination, as it also has some features of padding-free.
Think of each sequence as a stick of varying length and of GPU memory as boxes that can each hold sticks up to a fixed size (for example, 1,000 tokens). The algorithm does four simple things:
Trim anything too long. Any stick longer than the box size is cut down so nothing is bigger than the box.
Sort by length. Lay all remaining sticks on the table from longest to shortest.
Pack with “first-fit-decreasing.” Pick up the longest stick and drop it into the first box that still has room; if none have room, open a new box. Repeat for the next-longest stick, always using the first box that fits. This quickly fills every gap without much wasted space.
Glue and label. After all sticks are packed, the algorithm glues the sticks in each box end-to-end and writes a label that says where each new, packed sequence starts and ends so the model can read them back correctly.
It’s not entirely padding-free.
Moreover, it still truncates but avoids having multiple pieces of a long training sample in different sequences.
I updated the article and accompanying notebook to add more experimental results with this new strategy. You can check it here:
Padding-Free vs. Packing: Fast and Efficient Fine-Tuning for LLMs Explained
Padding is a widely used technique during training to ensure that all sequences within a batch have the same length. While effective, padding can lead to significant waste in both memory and computation.