A bit new to fine-tuning, so a silly question: In this notebook, are you training on prompt tokens too because we aren't masking them, is this correct? I've seen a few examples where people do mask instructions and only propagate the loss on response tokens. What's your take on which is preferable? Also, if we're not masking prompt tokens, isn't it the same as continued pre-training?
We don't have clear evidence on whether we should mask or not the prompt tokens. You will find some paper with positive results, and other with negative results.
I believe this simply depends on the dataset used for fine-tuning and the format of the prompt itself.
Hey Ben,
A bit new to fine-tuning, so a silly question: In this notebook, are you training on prompt tokens too because we aren't masking them, is this correct? I've seen a few examples where people do mask instructions and only propagate the loss on response tokens. What's your take on which is preferable? Also, if we're not masking prompt tokens, isn't it the same as continued pre-training?
Very good question!
We don't have clear evidence on whether we should mask or not the prompt tokens. You will find some paper with positive results, and other with negative results.
I believe this simply depends on the dataset used for fine-tuning and the format of the prompt itself.