Discussion about this post

User's avatar
baconnier loic's avatar

There is something i don’t understand

You keep the sft model as reference and in arguilla blog they keep the original base model

What is the best way to ise DPO please ?

reference

« 

Finally, they describe with a few lines of code, how you can configure a DPOTrainer class and run the train. Here is what you will need:

model, the fine-tuned version of your model (the result from SFT);

model_ref, the non-fine-tuned version of the model that's being fine-tuned. Usually it’s the original checkpoint you used before SFT.

training_args, same TrainerArguments class object present in transformers library, containing a list of training parameters such as per_device_train_batch_size, max_steps, gradient_accumulation_steps, learning_rate, evaluation_strategy, output_dir, etc.

beta, temperature parameter for the DPO loss, typically something in the range of 0.1 to 0.5. »

https://argilla.io/blog/mantisnlp-rlhf-part-3/

Expand full comment
2 more comments...

No posts