The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Fine-tune Better Chat Models with Distilled Identity Preference Optimization (IPO)
Copy link
Facebook
Email
Notes
More

Fine-tune Better Chat Models with Distilled Identity Preference Optimization (IPO)

Mistral 7B aligned with IPO

Benjamin Marie's avatar
Benjamin Marie
Dec 07, 2023
∙ Paid
8

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Fine-tune Better Chat Models with Distilled Identity Preference Optimization (IPO)
Copy link
Facebook
Email
Notes
More
2
2
Share

To become chat models, pre-trained large language models (LLMs) are fine-tuned on large datasets of instructions/questions paired with expected answers. While this simple fine-tuning yields convincing chat models, their answers may still be incoherent, biased, unethical, and unsafe from a human perspective. This is why we usually perform an additional training step to better align the LLM with human preferences.

This alignment can be done using reinforcement learning with human feedback (RLHF). As demonstrated by OpenAI and the success of ChatGPT, RLHF can yield state-of-the-art chat models. However, RLHF is expensive to run. It requires large datasets annotated by humans and the training of several auxiliary models (reference and reward models).

As a simpler and cheaper alternative to RLHF, direct preference optimization (DPO) has recently been applied with success to align LLMs, such as Hugging Face’s Zephyr and Intel’s Neural Chat.

Fine-tune Your Own Instruct Version of Mistral 7B with Direct Preference Optimization (DPO)

Fine-tune Your Own Instruct Version of Mistral 7B with Direct Preference Optimization (DPO)

Benjamin Marie, PhD
·
October 26, 2023
Read full story
Zephyr 7B Beta: A Good Teacher Is All You Need

Zephyr 7B Beta: A Good Teacher Is All You Need

Benjamin Marie, PhD
·
November 6, 2023
Read full story

In this article, based on a work by Google DeepMind, we will see that, while RLHF and DPO perform well at aligning LLMs, they are far from optimal given the datasets used for training. DeepMind also demonstrates why DPO is prone to overfitting. I’ll explain, in plain English, how the alternative proposed by DeepMind, the identity policy optimization (IPO) objective, is simpler and better designed to learn from the training data than RLHF and DPO.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In the following sections, I show how to use IPO following a training recipe close to the one used by Hugging Face to train the Zephyr models.

I have also implemented a notebook demonstrating IPO training for Mistral 7B. You can find it here:

Get the notebook (#31)

Last update: April 8th, 2024

Keep reading with a 7-day free trial

Subscribe to The Kaitchup – AI on a Budget to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More