The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
LLM Alignment: Searching for Optimal ORPO Hyperparameters

LLM Alignment: Searching for Optimal ORPO Hyperparameters

Higher learning rate and beta

Benjamin Marie's avatar
Benjamin Marie
Dec 02, 2024
∙ Paid
7

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
LLM Alignment: Searching for Optimal ORPO Hyperparameters
2
Share
Generated with Grok

There are several approaches to aligning large language models (LLMs) with human preferences. Two weeks ago, I discussed DPO, one of the most widely used alignment methods:

DPO Full Training vs. LoRA: How Good is LoRA for DPO Training?

DPO Full Training vs. LoRA: How Good is LoRA for DPO Training?

Benjamin Marie
·
November 18, 2024
Read full story

DPO relies on a reference model that has already undergone supervised fine-tuning (SFT). However, the SFT process can be resource-intensive, particularly for large models.

By contrast, other methods for preference optimization like ORPO eliminate the need for a reference model. This approach not only reduces memory requirements but also theoretically bypasses the necessity of SFT. However, as we’ll explore, the absence of SFT in ORPO introduces its own challenges. Since ORPO must simultaneously learn to generate answers to user prompts and align with human preferences, hyperparameters like the learning rate and beta become harder to calibrate. Achieving the right balance between generating accurate answers and prioritizing human-preferred outputs can be difficult.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I’ll first explain how ORPO works and what it learns. To understand the role of the beta hyperparameter, it is essential to first examine its underlying mechanics. Next, I’ll discuss my process for tuning ORPO’s hyperparameters and validating them. This task is not straightforward because ORPO learns slowly, making it difficult to assess the impact of hyperparameters during short training runs.

For the experiments, I used SmolLM2 1.7B with LoRA. The training was conducted on an RTX 3090 24 GB GPU via RunPod (referral link).

The notebook demonstrating ORPO training with LoRA is available here:

Get the notebook (#126)

ORPO: Aligning LLMs without a Reference Model

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share