The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Reasoning Budgets vs. Structured CoT: Controlling LLM Thinking Tokens

Evaluations of BNF grammars and reasoning budgets with Qwen3.6 27B

Benjamin Marie's avatar
Benjamin Marie
May 25, 2026
∙ Paid

LLMs now rely heavily on reasoning traces to improve accuracy. A reasoning trace is the intermediate text generated before the final answer, often delimited by tags such as <think>...</think>.

It may contain planning, decomposition, calculations, checks, failed attempts, and self-corrections. In practice, this trace is part of the model output: it consumes tokens, latency, KV cache, and money. It should not be treated as a guaranteed faithful explanation of the model’s internal computation, but it is the surface mechanism that many recent models use to spend more inference-time compute.

The length of these traces varies dramatically across model families. Some models, such as Gemma 4, tend to produce shorter traces, often below 20k tokens. Others, such as Qwen3.5 and Qwen3.6, can easily generate more than 100k tokens of reasoning before producing the final answer.

Qwen3.6 27B vs Qwen3.5 27B vs Gemma 4 31B: Accuracy, Latency, Memory, and Token Efficiency Tested

Qwen3.6 27B vs Qwen3.5 27B vs Gemma 4 31B: Accuracy, Latency, Memory, and Token Efficiency Tested

Benjamin Marie
·
May 5
Read full story

This creates a practical serving problem. For many models, reasoning is almost binary: either disable thinking and save tokens, usually at the cost of accuracy, or enable thinking and pay the full cost. Some model families expose better controls. GPT-OSS-style models expose coarse reasoning levels such as low, medium, and high. Nemotron-style models can accept a reasoning budget directly in the chat template. But many strong open models still lack a clean, reliable, per-request reasoning budget.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

For models without native budget controls, inference frameworks can still impose constraints at decoding time. Two methods are getting especially popular:

  1. Force a reasoning budget. If a budget b is set, the inference framework allows the model to generate at most b reasoning tokens. Once the budget is reached, the framework forces the reasoning block to close, usually by injecting or forcing a </think> tag, and then lets the model generate the final answer.

  2. Constrain the reasoning trace with a BNF grammar. A grammar defines the exact structure that the model is allowed to generate. During decoding, the framework masks out tokens that would violate the grammar. If the grammar says that the reasoning trace must contain exactly a few labeled lines, the model is forced to compress its reasoning into that shape.

Both methods are attractive because they reduce output length without retraining the model. But both also create reasoning traces that the model did not see during training. The question is not only whether they save tokens, but whether they preserve the behavior that made thinking useful in the first place.

In this article, I evaluate Qwen3.6 27B on three difficult benchmark families: coding tasks, hard math questions, and hard science multiple-choice questions. The goal is to answer three practical questions:

  • How much do these methods reduce generated tokens?

  • What is the impact on accuracy?

  • Does constrained thinking give accuracy between “thinking on” and “thinking off,” or can it be worse than both?

with examples using vLLM and llama.cpp.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 The Kaitchup · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture