7 Comments

On the issue of gradient accumulation, I learned this from a friend late this spring:

For example, with `bs=1` and `gradient_accumulation_steps=2`, there are two sequences of `attention_mask`:

- seq1: [1,1,1,1,1,1,1,1,0,0]

- seq2: [1,1,0,0,0,0,0,0,0,0]

In theory, when calculating the loss for backpropagation, seq1 should have 80% weight and seq2 should have 20% weight, but the implementation of HF is straightforward (seq1_loss + seq2_loss)/2, which means that two sequences with different actual lengths are treated as the same weight.

I don't know if this has anything to do with it, but the bias may be mitigated when `gradient_accumulation_steps` is large, so the effect is not so bad.

Expand full comment

Interesting!

This is not so difficult to verify. I could create a dataset in which all the sequences have the same length and use it for training.

Expand full comment

I talked to him about this again, and he said that the HF implementation (total_loss/num_seqs), in addition to being equal regardless of the number of padding, also lost information about the sequence length(or the number of tokens), So he made some corrections himself (total_loss/num_seqs * max_seq_len), which tries to use `max_seq_len` to introduce the "number of tokens" information, so that the training results are closer to pure large bs. In theory, it would be more accurate to count the number of tokens for each sequence rather than `max_seq_len`, but it doesn't seem easy to record or pass data across gradient accumulations.

So indeed if you experiment with a dataset of inherently equal length, you should expect to get the same results as a purely large bs.

Expand full comment

As always looking forward to your sharing!

Expand full comment

Wow hf accumulates loss? I had assumed it accumulated the gradients.

This point seems, very salient then…

Expand full comment

Not sure... I'm waiting for the professionals to get their hands on it. Daniel, the author of Unsloth, could reproduce the issue. He is working on it, but he is also very busy.

Expand full comment

But I didn't confirm it later, I just mentioned it here.

Expand full comment