The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Padding-Free vs. Packing: Fast and Efficient Fine-Tuning for LLMs Explained

Padding-Free vs. Packing: Fast and Efficient Fine-Tuning for LLMs Explained

Padding-free is faster and without cross-contamination

Benjamin Marie's avatar
Benjamin Marie
May 28, 2025
∙ Paid
6

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Padding-Free vs. Packing: Fast and Efficient Fine-Tuning for LLMs Explained
7
Share
Image generated with ChatGPT

Padding is a widely used technique during training to ensure that all sequences within a batch have the same length. While effective, padding can lead to significant waste in both memory and computation.

Padding Large Language Models

Padding Large Language Models

Benjamin Marie
·
August 11, 2023
Read full story

A popular alternative is packing, which aims to maximize sequence utilization and reduce compute costs. Many online tutorials showcase the efficiency of packing during supervised fine-tuning (SFT) of large language models (LLMs), often boasting a significant reduction in training time.

However, packing is frequently used without a full understanding of what it does under the hood. Although it can significantly accelerate fine-tuning, achieving these gains without compromising model performance can be complex, and in some cases, not even feasible. Packing, in other words, comes with trade-offs.

Recently, Hugging Face introduced a third strategy for batching training examples with TRL: padding-free. This new approach avoids the inefficiencies of padding and sidesteps the compromises associated with packing.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I’ll explain how both packing (FFD and wrapped) and padding-free batching strategies work in the context of supervised fine-tuning. We’ll then run a series of experiments to assess their impact on learning dynamics. Finally, I’ll dive into the often-overlooked costs of packing and discuss why it’s best suited for specific use cases, such as continued pre-training, rather than general-purpose fine-tuning.

As we’ll see, the padding-free approach significantly accelerates training with only minimal downsides, making it a strong replacement for standard padding.

I used the following notebook to fine-tune Llama 3.1 8B using padding, packing, and padding-free batching here:

Get the notebook (#167)

Padding vs. Packing vs. Padding-Free

Update (July 2, 2025): In TRL 0.19.0, Hugging Face introduced a new packing strategy called First Fit Decreasing (FFD). This method combines the strengths of the original packing approach with the efficiency benefits of padding-free batching. It is now the default packing strategy in TRL. I’ve updated this article to include FFD, which is now discussed at the end of this section.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share