There are various methods to align LLMs with human preferences. Beyond reinforcement learning with human feedback (RLHF), often seen as too resource-intensive for consistent application on newly fine-tuned models, Direct Preference Optimization (DPO) is one of the most popular alternatives for LLM alignment. I explain DPO in detail in this article:
Although DPO is significantly more cost-effective than RLHF, it still requires a reference model in addition to the "policy" model (i.e., the model being actively trained), unlike reference-free methods like ORPO. This means both models must be loaded into GPU memory simultaneously, which can be challenging for single-GPU configurations, especially with large models.
A more memory-efficient approach would be to use LoRA for DPO training. Instead of training the entire model, we freeze its parameters and train a small adapter. This method becomes even more efficient if both the policy and reference models share the same base model; in that case, we load the base model once, then load a frozen adapter for the reference model and a trainable adapter for the policy model, significantly reducing memory requirements.
However, the effect of LoRA on DPO’s performance is still understudied in my opinion. While LoRA can closely approximate full training, its performance largely depends on the tasks.
In this article, I train an LLM, Qwen 2.5, with DPO using LoRA and compare its learning curves and costs to those of full training. For full training, neither the reference nor the policy models use adapters. I also provide a step-by-step guide on using adapters with both reference and policy models.
The notebook with instructions on running DPO training with adapters is available here:
Full DPO Training for Qwen2.5
We need an instruct model that has already been fine-tuned on a conversational dataset. This is the supervised fine-tuning (SFT) step, where the model learns the specific task. This SFT model will serve as the initial point for DPO training and as the reference model in DPO.
For SFT, we’ve previously covered the process for models like Qwen2.5 and Llama 3.2 in these articles:
The notebooks in these articles provide steps to obtain an SFT model.