Padding is one of the most under-documented aspects of large language models (LLMs). Why? Simply because LLMs are usually pre-trained without padding.
Nonetheless, padding is necessary for fine-tuning LLMs on custom datasets. Failing to pad training examples correctly may result in different kinds of unexpected behaviors: Null loss or infinity loss during training, endless generation, or empty output during inference, are all symptoms of incorrect padding.
In this article, I first explain what padding is and why it is necessary. Then, I show how you can find the correct padding strategy for an LLM pre-trained without padding. I propose two different solutions to add padding support to LLMs using Hugging Face’s Transformers.
Toward the end of the article, I also provide examples showing how to pad your training examples for Llama 2. You can find examples for Llama 3 in this article:
After reading this article, you should be able to figure out by yourself how to pad training examples for LLMs without reading their documentation or tutorials.
All the examples to learn how to pad LLMs are in this notebook:
Pad and batch
What is padding and why do we pad?
Let’s take one example that we wish to use for fine-tuning an LLM.
example = "You are not a chatbot."
We have to turn this example into a sequence of tokens. Libraries, such as Transformers, usually tokenize following these steps:
Segment the example into subwords according to a given vocabulary:
example = ["▁You", "▁are", "▁not", "▁a". "▁chat", "bot", "."]
Replace words by their index from the vocabulary to obtain a sequence of integers:
example = [887, 526, 451, 263, 13563, 7451, 29889]
Add special tokens to the sequence: BOS token, EOS token, UNK token, PAD token, etc.
example = [1, 887, 526, 451, 263, 13563, 7451, 29889]
Note: For this example, I use Llama 2’s tokenizer. We will see below in detail how to do it.
In this example, only the BOS (begin of sequence) special token has been added. It has the ID “1”.
An attention mask is also generated for each training example. This mask tells the model whether it should give attention to a token (1) or not (0). The attention mask for this example is simple since all the tokens should be considered.
#We have as many values as tokens.
attention_mask = [1, 1, 1, 1, 1, 1, 1, 1]
The next step is to wrap everything into tensors with Pytorch. This wrapping is necessary to apply the matrix operations for which CUDA and GPUs are optimized.
{'input_ids': tensor([[1, 887, 526, 451, 263, 13563, 7451, 29889]]),
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}
Now, let’s say that we have not one but two training examples. For the sake of simplicity, I’ll just duplicate the one I already have. The new tensors have one more row:
{'input_ids': tensor([[1, 887, 526, 451, 263, 13563, 7451, 29889],
[1, 887, 526, 451, 263, 13563, 7451, 29889]]),
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1]])}
Both examples have the same length (of course, since they are identical). Both tensors have the same dimensions 2x8 (N x M).
Examples are put in the tensors to create batches so that the neural network can update its weights after seeing N examples. Batching is critical for computing efficiency.
Now, let’s introduce a third example that is shorter:
example = "You are not."
After tokenization, we obtain:
example = [1, 887, 526, 451, 29889]
attention_mask = [1, 1, 1, 1, 1]
If you try to add it to our list of examples and create tensors, you will get an error. But imagine that we don’t have any errors, we would obtain:
{'input_ids': tensor([[1, 887, 526, 451, 263, 13563, 7451, 29889],
[1, 887, 526, 451, 263, 13563, 7451, 29889],
[1, 887, 526, 451, 29889]]),
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1]])}
Can you see the problem here and why it is not possible to create such tensors?
We have one row of a different length. We can’t apply matrix operations on this.
In most datasets, examples don’t have the same length. We have to modify them to make sure that examples in the same batch have the same length.
This is why we need “padding”.
Padding token and padding side
You can see padding as extending a sequence up to a given length by repeating a dummy token.
This dummy token is a “pad token”.
For example, our first example above has a length of 8 tokens (including the BOS token). Let’s say that in our batch we won’t have sequences longer than 8 tokens. All the sequences must be 8 tokens long.
Our second example contains only 5 tokens. So we must add 3 pad tokens to make a sequence of 8 tokens.
example = "You are not. [PAD] [PAD] [PAD]"
In practice, we don’t manually add “[PAD]” tokens to the sequences. Most tokenizers would split “[PAD]” into subwords. The pad token is usually a special token defined inside the tokenizer and automatically added, if necessary, along with the other special tokens to the sequence.
If the pad token has the ID 32000 in the vocabulary, we would obtain:
example = [1, 887, 526, 451, 29889, 32000, 32000, 32000]
Now, we have a sequence with the expected length. But one problem remains: We also need to modify the attention mask.
Remember, the pad tokens are dummy tokens, we don’t want the LLM to give any attention to them. We only introduced these tokens to fill sequences and create correct tensors.
To indicate it to the model, we simply put “0” in the attention mask so that the model will ignore them.
attention_mask = [1, 1, 1, 1, 1, 0, 0, 0]
Finally, we can create correct tensors with the padded examples:
{'input_ids': tensor([[1, 887, 526, 451, 263, 13563, 7451, 29889],
[1, 887, 526, 451, 263, 13563, 7451, 29889],
[1, 887, 526, 451, 29889, 32000, 32000, 32000]]),
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 0, 0, 0]])}
Note: Padding is performed when a sequence is too short given the maximum length. But in some cases, a sequence can be too long. In this situation, we have to truncate the sequence so that its size matches the maximum length.
Another important parameter of padding is the padding side. In the example above, I padded right. If the model has an EOS token, the pad tokens will be added after it.
We can also pad left. In this situation, the tensors look like this:
{'input_ids': tensor([[1, 887, 526, 451, 263, 13563, 7451, 29889],
[1, 887, 526, 451, 263, 13563, 7451, 29889],
[32000, 32000, 32000, 1, 887, 526, 451, 29889]]),
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 1, 1, 1, 1, 1]])}
Pad tokens are added before the BOS token.
Which side to choose mainly depends on the LLM you want to use and your downstream tasks. That’s why it’s important to study the model and its tokenizer before making any decision. Below, we will see how to make this decision for Llama 2. You can do the same for other LLMs.
Adding padding support for causal LLM
As we saw, padding is (almost) always necessary for fine-tuning. Yet, many LLMs don’t support padding by default. It means that they don’t have a special pad token in their vocabulary.
Here, I present two solutions to add a pad token.
The simple solution
This solution is the one that you will find in most tutorials.
It simply assigns an existing token to the pad token. For instance, you can declare that your pad token will be the EOS token. We would obtain tensors like this (right-padded, and where “2” is the ID of the EOS token):
{'input_ids': tensor([[1, 887, 526, 451, 263, 13563, 7451, 29889],
[1, 887, 526, 451, 263, 13563, 7451, 29889],
[1, 887, 526, 451, 29889, 2, 2, 2]]),
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 0, 0, 0]])}
The problem with this solution is that the LLM is now confused: Most of the time, the EOS token will have “0” in the attention mask. It encourages the LLM to ignore the original EOS token. This is not ideal since the EOS token signals the LLM to stop generating.
Also with this solution, we have to pad right. If you pad left, you would have sequences beginning with an EOS token thus early stopping the generation.
However, with recent implementations of LLMs, this won’t be a problem. The PAD token is processed first and masked. If the PAD tokens are EOS tokens, the model won’t see them.
In my opinion, a better alternative is to use the UNK token, or any other token that is not very important, as the pad token.
Meta in its “Llama recipes” also uses the UNK token. Llama 3.1 has a special token for padding.
The alternative solution: Create a pad token from scratch
Ideally, we want a pad token that is used only for padding.
We have to create from scratch a pad token in the vocabulary if it doesn’t exist. This is the solution recommended by Hugging Face for Llama 2.
With libraries such as transformers, it’s easy to extend a vocabulary.
If you want to create a pad token, you have to follow these steps:
add the pad token as a special token in the vocabulary of the LLM
resize the token embeddings
retrain the token embeddings (optional)
If you are on a budget and use LoRA for fine-tuning, you may want to skip the last step since the token embeddings can weigh several 100 million parameters.
Case study: padding Llama 2 with Hugging Face’s Transformers
In this section, we will enable padding for Llama 2. To replicate each step, you will need access to Llama 2 on Hugging Face. I explained how to get Llama 2 in this article.
First, install the Transformers library:
pip install transformers
Then, we import transformers and load the tokenizer.
from transformers import AutoTokenizer
#The model we want to quantize
pretrained_model_dir = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
We define two training examples:
prompt1 = "You are not a chatbot."
prompt2 = "You are not."
If we put in the same batch the prompt1 twice, everything goes well:
prompts = [prompt1, prompt1]
input = tokenizer(prompts, return_tensors="pt");
print(input)
Output:
{'input_ids': tensor([[ 1, 887, 526, 451, 263, 13563, 7451, 29889],
[ 1, 887, 526, 451, 263, 13563, 7451, 29889]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1]])}
But if you add prompt2, you will get an error as expected:
prompts = [prompt1, prompt1, prompt2]
input = tokenizer(prompts, return_tensors="pt");
print(input)
Output:
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).
It’s clear that the tokenizer didn’t pad the examples.
We can solve this problem by simply using the UNK token as a pad token, as follows:
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.unk_token
input = tokenizer(prompts, padding='max_length', max_length=20, return_tensors="pt");
print(input)
In this example, I asked the tokenizer to pad up to max_length. I set max_length to 20. If your example contains 10 tokens, the tokenizer will add 10 pad tokens.
{'input_ids': tensor([[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 887, 526, 451, 263, 13563, 7451, 29889],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 887, 526, 451, 263, 13563, 7451, 29889],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 887, 526, 451, 29889]]), 'attention_mask': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]])}
The alternative is to create a pad token from scratch. With Hugging Face’s transformers, we can do this with the method “add_special_tokens”.
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
input = tokenizer(prompts, padding='max_length', max_length=20, return_tensors="pt");
print(input)
Output:
{'input_ids': tensor([[32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000,
32000, 32000, 1, 887, 526, 451, 263, 13563, 7451, 29889],
[32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000,
32000, 32000, 1, 887, 526, 451, 263, 13563, 7451, 29889],
[32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000,
32000, 32000, 32000, 32000, 32000, 1, 887, 526, 451, 29889]]), 'attention_mask': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]])}
Don’t forget to resize the token embeddings of Llama 2 after you add the pad token to its vocabulary. I explained how to do it in this article:
Conclusion
Once you understand it, padding is very straightforward.
Using an existing unused special token for padding, or creating a pad token from scratch, are safe solutions that will work for almost all causal LLMs. But you should always have a look at how the tokenizer works. At least you should be aware of the special tokens it already supports. For instance, not all LLMs have a UNK token, and some LLMs have a pad token that is not explicitly defined as a pad token in the vocabulary.
As usual, if you have any questions, please leave drop a comment.