Discussion about this post

User's avatar
Remixa's avatar

On the issue of gradient accumulation, I learned this from a friend late this spring:

For example, with `bs=1` and `gradient_accumulation_steps=2`, there are two sequences of `attention_mask`:

- seq1: [1,1,1,1,1,1,1,1,0,0]

- seq2: [1,1,0,0,0,0,0,0,0,0]

In theory, when calculating the loss for backpropagation, seq1 should have 80% weight and seq2 should have 20% weight, but the implementation of HF is straightforward (seq1_loss + seq2_loss)/2, which means that two sequences with different actual lengths are treated as the same weight.

I don't know if this has anything to do with it, but the bias may be mitigated when `gradient_accumulation_steps` is large, so the effect is not so bad.

Expand full comment
6 more comments...

No posts