There are several methods for post-training large language models (LLMs) to enhance their ability to follow instructions and align with human preferences. Among the most popular approaches are Reinforcement Learning with Human Feedback (RLHF), which typically relies on the PPO (Proximal Policy Optimization) algorithm, and Direct Preference Optimization (DPO), a simpler yet widely used alternative.
Group Regularized Policy Optimization (GRPO), introduced one year ago with DeepSeekMath, has also emerged as a promising alternative. GRPO has demonstrated remarkable efficiency and has been successfully used to train state-of-the-art LLMs such as Qwen2.5 and DeepSeek-R1. It is now implemented in Hugging Face TRL and Unsloth.
In this article, we will:
Explore how GRPO works and why it is a strong alternative to RLHF and DPO.
Train a 14B model, Qwen2.5 Instruct, using GRPO with Unsloth on a single consumer GPU.
Discuss the critical roles of the reward functions.
The following notebook provides a step-by-step guide on training an LLM with GRPO using QLoRA/LoRA.