The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
ORPO: Preference Optimization without the Supervised Fine-tuning (SFT) Step
Copy link
Facebook
Email
Notes
More

ORPO: Preference Optimization without the Supervised Fine-tuning (SFT) Step

A much cheaper alignment method but performing as well as DPO

Benjamin Marie's avatar
Benjamin Marie
Apr 08, 2024
∙ Paid
10

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
ORPO: Preference Optimization without the Supervised Fine-tuning (SFT) Step
Copy link
Facebook
Email
Notes
More
5
Share
Generated with DALL-E

There are now many methods to align large language models (LLMs) with human preferences. Reinforcement learning with human feedback (RLHF) was one of the first and brought us ChatGPT, but RLHF is very costly. DPO, IPO, and KTO are notably cheaper than RLHF as they don’t need a reward model.

Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #3: Reinforcement Learning with Human Feedback

Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #3: Reinforcement Learning with Human Feedback

September 21, 2023
Read full story
Fine-tune Your Own Instruct Version of Mistral 7B with Direct Preference Optimization (DPO)

Fine-tune Your Own Instruct Version of Mistral 7B with Direct Preference Optimization (DPO)

Benjamin Marie
·
October 26, 2023
Read full story
Fine-tune Better Chat Models with Distilled Identity Preference Optimization (IPO)

Fine-tune Better Chat Models with Distilled Identity Preference Optimization (IPO)

Benjamin Marie
·
December 7, 2023
Read full story

While DPO and IPO are cheaper, they still require to train two different models. One model for the supervised fine-tuning (SFT) step, i.e., training the model to answer instructions, and then the model to align with human preferences using the SFT model for initialization and as a reference.

ORPO is yet another new method for LLM alignment but this one doesn’t even need the SFT model. With ORPO, the LLM jointly learns to answer instructions and human preferences.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I explain ORPO and review its performance. I show how to use it to turn Mistral 7B into a chat model using consumer hardware. ORPO is indeed cheaper than DPO and IPO as it requires less GPU memory.

The notebook showing how to align Mistral 7B (or any other LLMs) with ORPO is available here:

Get the notebook (#58)

The notebook also includes an example of ORPO training with GaLore.

GaLore: Full Fine-tuning on Your GPU

GaLore: Full Fine-tuning on Your GPU

Benjamin Marie
·
April 4, 2024
Read full story

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More