2 Comments
User's avatar
Matt's avatar

Ack. Maybe I should downgrade TRL while we wait for them to correct it.

Expand full comment
Benjamin Marie's avatar

The issue with downgrading is that it breaks a lot of things, including GRPO.

I'm also not sure that this current behavior of TRL is not intended...

TRL might just become a post-training framework for instruct models, i.e., moving away from the original intent which was fine-tuning base models.

Interestingly, Nanotron, which is a framework for pre-training is now implementing SFT...

https://github.com/huggingface/nanotron/pull/295

Expand full comment