Reasoning Budgets vs. Structured CoT: Controlling LLM Thinking Tokens
Evaluations of BNF grammars and reasoning budgets with Qwen3.6 27B
LLMs now rely heavily on reasoning traces to improve accuracy. A reasoning trace is the intermediate text generated before the final answer, often delimited by tags such as <think>...</think>.
It may contain planning, decomposition, calculations, checks, failed attempts, and self-corrections. In practice, this trace is part of the model output: it consumes tokens, latency, KV cache, and money. It should not be treated as a guaranteed faithful explanation of the model’s internal computation, but it is the surface mechanism that many recent models use to spend more inference-time compute.
The length of these traces varies dramatically across model families. Some models, such as Gemma 4, tend to produce shorter traces, often below 20k tokens. Others, such as Qwen3.5 and Qwen3.6, can easily generate more than 100k tokens of reasoning before producing the final answer.
This creates a practical serving problem. For many models, reasoning is almost binary: either disable thinking and save tokens, usually at the cost of accuracy, or enable thinking and pay the full cost. Some model families expose better controls. GPT-OSS-style models expose coarse reasoning levels such as low, medium, and high. Nemotron-style models can accept a reasoning budget directly in the chat template. But many strong open models still lack a clean, reliable, per-request reasoning budget.
For models without native budget controls, inference frameworks can still impose constraints at decoding time. Two methods are getting especially popular:
Force a reasoning budget. If a budget
bis set, the inference framework allows the model to generate at mostbreasoning tokens. Once the budget is reached, the framework forces the reasoning block to close, usually by injecting or forcing a</think>tag, and then lets the model generate the final answer.Constrain the reasoning trace with a BNF grammar. A grammar defines the exact structure that the model is allowed to generate. During decoding, the framework masks out tokens that would violate the grammar. If the grammar says that the reasoning trace must contain exactly a few labeled lines, the model is forced to compress its reasoning into that shape.
Both methods are attractive because they reduce output length without retraining the model. But both also create reasoning traces that the model did not see during training. The question is not only whether they save tokens, but whether they preserve the behavior that made thinking useful in the first place.
In this article, I evaluate Qwen3.6 27B on three difficult benchmark families: coding tasks, hard math questions, and hard science multiple-choice questions. The goal is to answer three practical questions:
How much do these methods reduce generated tokens?
What is the impact on accuracy?
Does constrained thinking give accuracy between “thinking on” and “thinking off,” or can it be worse than both?
with examples using vLLM and llama.cpp.



