The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Qwen3 Instruct "Thinks": When Token Budgets Silently Skew Benchmark Scores

Qwen3 Instruct "Thinks": When Token Budgets Silently Skew Benchmark Scores

Replicating Qwen3-30B-A3B-Instruct-2507 shows accuracy riding on verbosity, not just GSPO

Benjamin Marie's avatar
Benjamin Marie
Aug 18, 2025
∙ Paid
3

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Qwen3 Instruct "Thinks": When Token Budgets Silently Skew Benchmark Scores
2
2
Share

A few weeks ago, the Qwen team shipped updated Qwen3 models with two goals:

  • Improve overall quality, and

  • Split the lineup into separate Instruct and Thinking variants.

This separation simplifies usage: the chat template is easier to use, and behaviors are clearer. Instruct models answer user prompts directly with human-readable, informative sequences of tokens. Thinking models first produce a reasoning trace, which then guides the final answer.

The Qwen team started by updating the Qwen3 MoE family, models that likely benefit most from GSPO reinforcement learning (as discussed in a previous article). I summarized their published MoE benchmarks, comparing the updates and the original versions, as follows:

The gains are both impressive and puzzling. The Instruct version of Qwen3-30B-A3B-2507 improves by ~40 accuracy points on AIME25, a math benchmark that demands strong reasoning, over the previous non-thinking Qwen3. As far as I know, this would make it the first Instruct model to reach such a high score.

Share

Is GSPO alone that good?

Things get stranger. The team also updated the Qwen3-4B dense models, architectures that shouldn’t, in theory, benefit as much from GSPO, and yet:

source: https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507

AIME25 gains remain outstanding even for a 4B-parameter dense model.

What’s going on?

Let’s dig in. In this article, we’ll replicate and analyze all Qwen3-30B-A3B variants on AIME25:

  • Qwen/Qwen3-30B-A3B (the original hybrid version)

    • Thinking “on”

    • Thinking “off”

  • Qwen/Qwen3-30B-A3B-Instruct-2507

  • Qwen/Qwen3-30B-A3B-Thinking-2507

First, we’ll verify the Qwen team’s reported scores using their recommended inference hyperparameters. Next, we’ll inspect the model’s generations to understand the gap between the hybrid and Instruct versions.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Although this piece focuses on Qwen3, the broader point is methodological: evaluations often hide many knobs, and raw accuracy can be misleading, especially with test-time scaling. We will determine what the Qwen team didn’t report that explains the high scores of the Qwen3 Instruct versions, while significantly deteriorating the user experience.

You can replicate my experiments using this notebook:

Get the notebook (#178)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share