The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
DPO Full Training vs. LoRA: How Good is LoRA for DPO Training?

DPO Full Training vs. LoRA: How Good is LoRA for DPO Training?

One model, two adapters

Benjamin Marie's avatar
Benjamin Marie
Nov 18, 2024
∙ Paid
12

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
DPO Full Training vs. LoRA: How Good is LoRA for DPO Training?
5
Share
Generated with Grok

There are various methods to align LLMs with human preferences. Beyond reinforcement learning with human feedback (RLHF), often seen as too resource-intensive for consistent application on newly fine-tuned models, Direct Preference Optimization (DPO) is one of the most popular alternatives for LLM alignment. I explain DPO in detail in this article:

Fine-tune Your Own Instruct Version of Mistral 7B with Direct Preference Optimization (DPO)

Fine-tune Your Own Instruct Version of Mistral 7B with Direct Preference Optimization (DPO)

Benjamin Marie
·
October 26, 2023
Read full story

Although DPO is significantly more cost-effective than RLHF, it still requires a reference model in addition to the "policy" model (i.e., the model being actively trained), unlike reference-free methods like ORPO. This means both models must be loaded into GPU memory simultaneously, which can be challenging for single-GPU configurations, especially with large models.

A more memory-efficient approach would be to use LoRA for DPO training. Instead of training the entire model, we freeze its parameters and train a small adapter. This method becomes even more efficient if both the policy and reference models share the same base model; in that case, we load the base model once, then load a frozen adapter for the reference model and a trainable adapter for the policy model, significantly reducing memory requirements.

However, the effect of LoRA on DPO’s performance is still understudied in my opinion. While LoRA can closely approximate full training, its performance largely depends on the tasks.

The Kaitchup – AI on a Budget is a reader-supported publication. Subscribe to get access to 200+ tutorials and 160+ AI notebooks

In this article, I train an LLM, Qwen 2.5, with DPO using LoRA and compare its learning curves and costs to those of full training. For full training, neither the reference nor the policy models use adapters. I also provide a step-by-step guide on using adapters with both reference and policy models.

The notebook with instructions on running DPO training with adapters is available here:

Get the notebook (#122)

Full DPO Training for Qwen2.5

We need an instruct model that has already been fine-tuned on a conversational dataset. This is the supervised fine-tuning (SFT) step, where the model learns the specific task. This SFT model will serve as the initial point for DPO training and as the reference model in DPO.

For SFT, we’ve previously covered the process for models like Qwen2.5 and Llama 3.2 in these articles:

Qwen2.5 QLoRA, LoRA, and Full Fine-tuning on Your Computer

Qwen2.5 QLoRA, LoRA, and Full Fine-tuning on Your Computer

Benjamin Marie
·
September 23, 2024
Read full story
Fine-Tuning Meta's Llama 3.2 1B & 3B Models on Budget GPUs

Fine-Tuning Meta's Llama 3.2 1B & 3B Models on Budget GPUs

Benjamin Marie
·
September 30, 2024
Read full story

The notebooks in these articles provide steps to obtain an SFT model.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share