The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

GSPO vs GRPO: Reinforcement Learning for MoE Models

How Qwen’s GSPO Outperforms GRPO for Stable and Scalable MoE Training

Benjamin Marie's avatar
Benjamin Marie
Aug 04, 2025
∙ Paid
4
1
Share

While they were updating their largest Qwen3 model, Qwen3-235B-A32B, into separate instruct and thinking models, the Qwen team unveiled a new reinforcement learning (RL) method that seems to demonstrate notably superior training stability, efficiency, and performance, especially for MoE models compared to GRPO, the RL techniques popularized by the DeepSeek Models.

This new technique, Group Sequence Policy Optimization (GSPO), has already been implemented in Hugging Face TRL.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I review GSPO. I'll highlight the key differences between GSPO and GRPO, and examine the motivations behind this evolution. To assess GSPO’s practical impact, I’ll also analyze the results reported by the Qwen team.

As we’ll see, enabling GSPO in Unsloth and TRL is remarkably simple, requiring only a single change to an existing GRPO training script, such as this one:

Get the notebook (#143)

If you're unfamiliar with GRPO, read this previous article first to better understand how GSPO builds upon it.

GRPO: Train LLMs with DeepSeek-R1's Reinforcement Learning Method

GRPO: Train LLMs with DeepSeek-R1's Reinforcement Learning Method

Benjamin Marie
·
Feb 10
Read full story

GSPO Explained

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture