Fine-tune Your Own Instruct Version of Mistral 7B with Direct Preference Optimization (DPO)
Making a cheap Zephyr 7B
Hugging Face recently published Zephyr 7B, a chat model outperforming Llama 2 70B. Currently, Zephyr 7B is ranked first on the OpenLLM leaderboard.
How did they do?
Zephyr 7B is based on Mistral 7B fine-tuned with Direct Preference Optimization (DPO), a simple but effective alternative to reinforcement learning with human feedback (RLHF) that is usually used to fine-tune instruct LLMs.
In this article, I first present DPO and highlight its advantages over RLHF. Then, we will see how to fine-tune Mistral 7B with DPO using Hugging Face’s TRL. I adapted the settings and hyperparameters used by Hugging Face to train Zephyr 7B so that we can do it on consumer hardware.
The notebook implementing DPO training for Mistral 7B is available here:
Last update: March 13th, 2024