Wow. I remember when minibatch/batch normalization/gradient accumulation was offered as a performance improvement to lessen the number of weight updates in backpropagation. Carried forward because that's how it's always been done.
Now we await differences in model performance after the Transformers change.
Wow. I remember when minibatch/batch normalization/gradient accumulation was offered as a performance improvement to lessen the number of weight updates in backpropagation. Carried forward because that's how it's always been done.
Now we await differences in model performance after the Transformers change.
Glad to see it fixed!