GSPO vs GRPO: Reinforcement Learning for MoE Models
How Qwen’s GSPO Outperforms GRPO for Stable and Scalable MoE Training
While they were updating their largest Qwen3 model, Qwen3-235B-A32B, into separate instruct and thinking models, the Qwen team unveiled a new reinforcement learning (RL) method that seems to demonstrate notably superior training stability, efficiency, and performance, especially for MoE models compared to GRPO, the RL techniques popularized by the DeepSeek Models.
This new technique, Group Sequence Policy Optimization (GSPO), has already been implemented in Hugging Face TRL.
In this article, I review GSPO. I'll highlight the key differences between GSPO and GRPO, and examine the motivations behind this evolution. To assess GSPO’s practical impact, I’ll also analyze the results reported by the Qwen team.
As we’ll see, enabling GSPO in Unsloth and TRL is remarkably simple, requiring only a single change to an existing GRPO training script, such as this one:
If you're unfamiliar with GRPO, read this previous article first to better understand how GSPO builds upon it.