Padding-Free vs. Packing: Fast and Efficient Fine-Tuning for LLMs Explained

Padding-free is faster and without cross-contamination

May 28, 2025

∙ Paid

Padding is a widely used technique during training to ensure that all sequences within a batch have the same length. While effective, padding can lead to significant waste in both memory and computation.

Padding Large Language Models

Benjamin Marie

August 11, 2023

Read full story

A popular alternative is packing, which aims to maximize sequence utilization and reduce compute costs. Many online tutorials showcase the efficiency of packing during supervised fine-tuning (SFT) of large language models (LLMs), often boasting a significant reduction in training time.

However, packing is frequently used without a full understanding of what it does under the hood. Although it can significantly accelerate fine-tuning, achieving these gains without compromising model performance can be complex, and in some cases, not even feasible. Packing, in other words, comes with trade-offs.

Recently, Hugging Face introduced a third strategy for batching training examples with TRL: padding-free. This new approach avoids the inefficiencies of padding and sidesteps the compromises associated with packing.

In this article, I’ll explain how both packing (FFD and wrapped) and padding-free batching strategies work in the context of supervised fine-tuning. We’ll then run a series of experiments to assess their impact on learning dynamics. Finally, I’ll dive into the often-overlooked costs of packing and discuss why it’s best suited for specific use cases, such as continued pre-training, rather than general-purpose fine-tuning.

As we’ll see, the padding-free approach significantly accelerates training with only minimal downsides, making it a strong replacement for standard padding.

I used the following notebook to fine-tune Llama 3.1 8B using padding, packing, and padding-free batching here:

Get the notebook (#167)

Padding vs. Packing vs. Padding-Free

Update (July 2, 2025): In TRL 0.19.0, Hugging Face introduced a new packing strategy called First Fit Decreasing (FFD). This method combines the strengths of the original packing approach with the efficiency benefits of padding-free batching. It is now the default packing strategy in TRL. I’ve updated this article to include FFD, which is now discussed at the end of this section.

The Kaitchup – AI on a Budget

Padding-Free vs. Packing: Fast and Efficient Fine-Tuning for LLMs Explained

Padding-free is faster and without cross-contamination

Padding Large Language Models

Padding vs. Packing vs. Padding-Free

This post is for paid subscribers