The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

How to Reduce LLM Inference Cost and Improve Accuracy with Pass@k and Majority Voting

Is thinking disabled + multiple retries better and still more efficient than thinking enabled?

Benjamin Marie's avatar
Benjamin Marie
Apr 27, 2026
∙ Paid

Ask the same model the same question twice, and you may get two different answers.

This comes from random sampling at inference time: instead of always choosing the single most likely next token, the model samples from several plausible ones. That randomness is what makes outputs flexible enough for creativity and often helps reasoning explore different solution paths instead of getting stuck on the same one every time.

So when the task has only one correct answer, like a hard math problem or a multiple-choice question, the model may answer correctly on some runs and incorrectly on others. That is why LLM providers usually run these benchmarks multiple times, for example 32 to 64 times for hard math problems like AIME, and report the average accuracy.

At first glance, this randomness may seem like a weakness for tasks with a single correct answer. But it can actually be turned into a strength.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I show how to use it to improve accuracy with simple techniques such as majority voting (maj@k) and pass@k. These methods run the model several times, then either select the most common answer or check whether at least one of the sampled answers is correct. Since repeated sampling has a cost, especially for reasoning models that generate many tokens, like Qwen3.5, we will also look at how to identify models that are cheap enough to outperform a stronger model through reranking or repeated attempts. If one model is K times cheaper than a moderately stronger one, it may be enough to run it fewer than K times to come out ahead overall.

These strategies are especially effective with quantized models and/or when thinking is disabled. Disabling thinking usually causes a drop in accuracy, while making inference much cheaper, so the savings can be reinvested into additional samples. In many cases, this lets a quantized model with thinking disabled outperform the same model with thinking enabled, while still remaining cheaper.

Why LLMs Give Different Answers

LLMs do not always pick the single most likely next token. Instead, depending on the inference hyperparameters, they often sample from a distribution of plausible next tokens. A small change early in generation can lead to a very different full answer.

Two settings mainly control this:

  • Temperature: controls randomness.
    Lower temperature makes outputs more deterministic and repetitive. Higher temperature makes outputs more varied, but also noisier.

  • Top-p: limits sampling to the smallest set of tokens whose total probability reaches p.
    Lower top-p makes the model choose from a narrower set of likely tokens. Higher top-p allows more diverse continuations.

So when temperature or top-p is increased, the model explores more possible paths. That is why answers may become less consistent.

But that same diversity is what makes pass@k and maj@k useful: multiple samples can reveal either one correct solution or a stable consensus.

Pass@k vs. Maj@k

Pass@k asks: if we generate k answers, does at least one of them get the problem right?

With k = 4, imagine a coding task where the model generates 4 programs:

  1. wrong

  2. wrong

  3. correct

  4. wrong

Then pass@4 = 1 for that problem, because one of the 4 passed.

This works well for coding because code can be checked automatically with tests. You do not need the model’s first answer to be perfect. You just need one working solution among several samples.

Maj@k asks: if we generate k answers, and take the majority answer, is that final answer correct?

With k = 4, suppose a math model outputs:

  1. 42

  2. 42

  3. 41

  4. 42

The majority answer is 42, so maj@4 succeeds if 42 is correct.

This is useful for MCQ and math, where you may not have a direct verifier, but

  • pass@k uses multiple tries

  • maj@k uses multiple votes

Use pass@k when the task can be checked automatically.
This is why it fits coding best: you can sample several programs, run tests, and keep any one that works. The same idea also applies to tasks like SQL generation, code repair, regex writing, or script generation, where correctness can be verified by execution or test cases.

Use maj@k when the task cannot be verified easily, but the same correct answer is likely to appear repeatedly.
This is why it fits MCQ, math, logic puzzles, short factual QA, and structured reasoning tasks. You sample several times and trust the answer that appears most often.

Note: With majority voting, maj@2 doesn’t make sense. k must be greater than 2 to be effective.

Stronger LLMs with Thinking Disabled: Leveraging Maj@k and Pass@k

When thinking is enabled, the model produces a long reasoning trace, often thousands of tokens, which it uses to arrive at a more accurate answer.

With thinking disabled, accuracy is usually lower, but the model no longer generates that reasoning trace, making inference much cheaper.

Let’s compare how long Qwen3.5 27B (NVFP4) takes to answer the same prompt with thinking enabled versus disabled.

Simulation @ 50 tokens/second. I put the same speed for both, but since the one with thinking enabled (blue bars) generates much longer sequences, the throughput will be lower over time for this one.

So, on average, with thinking disabled, the model can answer five coding problems in the same amount of time it takes to answer one with thinking enabled.

That means if pass@5 without thinking is higher than pass@1 with thinking, it’s better to disable thinking and generate multiple answers instead.

And yes, that’s exactly what we observe:

(pass@1 is almost the same for both; even with thinking disabled, Qwen3.5 27B thinks a lot when seeing coding problems)

Pass@1 is nearly identical across both settings: 73.7 on LiveCodeBench.

But with thinking disabled, pass@2 jumps to 81.0, while still costing far less than a single run with thinking enabled.

The gap is even more striking on instruction-following benchmarks, where enabling thinking is nearly 20× more expensive.

Pass@8 with thinking disabled reaches 71.4, compared with 66.7 for pass@1 with thinking enabled, while still being significantly cheaper to run.

We see the same pattern with other models, including NVIDIA’s Nemotron 3 Super.

On LiveCodeBench, disabling thinking makes the model roughly 10× cheaper per problem:

But it is also much weaker. Even so, pass@4 is enough to beat the model’s pass@1 performance with thinking enabled.

In other words, even after four attempts, disabling thinking is still cheaper overall and, on average, more likely to produce code that passes your tests.

For math tasks such as AIME, where we can leverage majority voting, disabling thinking does not reduce cost nearly as much. Still, it is interesting to see that with 32 retries and majority voting, the model without reasoning significantly outperforms the single-run accuracy of the same model with thinking enabled.

Of course, this is not always true. On MMLU-Pro, for example, which is a multiple-choice benchmark, retrying 8 times and using majority voting improves accuracy, but it still falls well short of the model’s performance with thinking enabled.

Quantized vs Original

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 The Kaitchup · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture