There are several approaches to aligning large language models (LLMs) with human preferences. Two weeks ago, I discussed DPO, one of the most widely used alignment methods:
DPO relies on a reference model that has already undergone supervised fine-tuning (SFT). However, the SFT process can be resource-intensive, particularly for large models.
By contrast, other methods for preference optimization like ORPO eliminate the need for a reference model. This approach not only reduces memory requirements but also theoretically bypasses the necessity of SFT. However, as we’ll explore, the absence of SFT in ORPO introduces its own challenges. Since ORPO must simultaneously learn to generate answers to user prompts and align with human preferences, hyperparameters like the learning rate and beta become harder to calibrate. Achieving the right balance between generating accurate answers and prioritizing human-preferred outputs can be difficult.
In this article, I’ll first explain how ORPO works and what it learns. To understand the role of the beta hyperparameter, it is essential to first examine its underlying mechanics. Next, I’ll discuss my process for tuning ORPO’s hyperparameters and validating them. This task is not straightforward because ORPO learns slowly, making it difficult to assess the impact of hyperparameters during short training runs.
For the experiments, I used SmolLM2 1.7B with LoRA. The training was conducted on an RTX 3090 24 GB GPU via RunPod (referral link).
The notebook demonstrating ORPO training with LoRA is available here: