GSPO vs GRPO: Reinforcement Learning for MoE Models

How Qwen’s GSPO Outperforms GRPO for Stable and Scalable MoE Training

Aug 04, 2025

∙ Paid

While they were updating their largest Qwen3 model, Qwen3-235B-A32B, into separate instruct and thinking models, the Qwen team unveiled a new reinforcement learning (RL) method that seems to demonstrate notably superior training stability, efficiency, and performance, especially for MoE models compared to GRPO, the RL techniques popularized by the DeepSeek Models.

This new technique, Group Sequence Policy Optimization (GSPO), has already been implemented in Hugging Face TRL.

In this article, I review GSPO. I'll highlight the key differences between GSPO and GRPO, and examine the motivations behind this evolution. To assess GSPO’s practical impact, I’ll also analyze the results reported by the Qwen team.

As we’ll see, enabling GSPO in Unsloth and TRL is remarkably simple, requiring only a single change to an existing GRPO training script, such as this one:

Get the notebook (#143)

If you're unfamiliar with GRPO, read this previous article first to better understand how GSPO builds upon it.

GRPO: Train LLMs with DeepSeek-R1's Reinforcement Learning Method

Benjamin Marie

Feb 10

Read full story

The Kaitchup – AI on a Budget

GSPO vs GRPO: Reinforcement Learning for MoE Models

How Qwen’s GSPO Outperforms GRPO for Stable and Scalable MoE Training

GRPO: Train LLMs with DeepSeek-R1's Reinforcement Learning Method

GSPO Explained

This post is for paid subscribers