Fine-tune Your Own Instruct Version of Mistral 7B with Direct Preference Optimization (DPO)

Making a cheap Zephyr 7B

Oct 26, 2023

∙ Paid

Mistral and Zephyr are winds. Are we entering a new era where LLMs don’t have animal names anymore? — Picture generated by Substack’s AI with the prompt ‘stormy weather‘

Hugging Face recently published Zephyr 7B, a chat model outperforming Llama 2 70B. Currently, Zephyr 7B is ranked first on the OpenLLM leaderboard.

How did they do?

Zephyr 7B is based on Mistral 7B fine-tuned with Direct Preference Optimization (DPO), a simple but effective alternative to reinforcement learning with human feedback (RLHF) that is usually used to fine-tune instruct LLMs.

In this article, I first present DPO and highlight its advantages over RLHF. Then, we will see how to fine-tune Mistral 7B with DPO using Hugging Face’s TRL. I adapted the settings and hyperparameters used by Hugging Face to train Zephyr 7B so that we can do it on consumer hardware.

The notebook implementing DPO training for Mistral 7B is available here:

Get the notebook (#24)

Last update: March 13th, 2024

The Kaitchup – AI on a Budget

Fine-tune Your Own Instruct Version of Mistral 7B with Direct Preference Optimization (DPO)

Making a cheap Zephyr 7B

This post is for paid subscribers