How to Reduce LLM Inference Cost and Improve Accuracy with Pass@k and Majority Voting
Is thinking disabled + multiple retries better and still more efficient than thinking enabled?
Ask the same model the same question twice, and you may get two different answers.
This comes from random sampling at inference time: instead of always choosing the single most likely next token, the model samples from several plausible ones. That randomness is what makes outputs flexible enough for creativity and often helps reasoning explore different solution paths instead of getting stuck on the same one every time.
So when the task has only one correct answer, like a hard math problem or a multiple-choice question, the model may answer correctly on some runs and incorrectly on others. That is why LLM providers usually run these benchmarks multiple times, for example 32 to 64 times for hard math problems like AIME, and report the average accuracy.
At first glance, this randomness may seem like a weakness for tasks with a single correct answer. But it can actually be turned into a strength.
In this article, I show how to use it to improve accuracy with simple techniques such as majority voting (maj@k) and pass@k. These methods run the model several times, then either select the most common answer or check whether at least one of the sampled answers is correct. Since repeated sampling has a cost, especially for reasoning models that generate many tokens, like Qwen3.5, we will also look at how to identify models that are cheap enough to outperform a stronger model through reranking or repeated attempts. If one model is K times cheaper than a moderately stronger one, it may be enough to run it fewer than K times to come out ahead overall.
These strategies are especially effective with quantized models and/or when thinking is disabled. Disabling thinking usually causes a drop in accuracy, while making inference much cheaper, so the savings can be reinvested into additional samples. In many cases, this lets a quantized model with thinking disabled outperform the same model with thinking enabled, while still remaining cheaper.
Why LLMs Give Different Answers
LLMs do not always pick the single most likely next token. Instead, depending on the inference hyperparameters, they often sample from a distribution of plausible next tokens. A small change early in generation can lead to a very different full answer.
Two settings mainly control this:
Temperature: controls randomness.
Lower temperature makes outputs more deterministic and repetitive. Higher temperature makes outputs more varied, but also noisier.Top-p: limits sampling to the smallest set of tokens whose total probability reaches p.
Lower top-p makes the model choose from a narrower set of likely tokens. Higher top-p allows more diverse continuations.
So when temperature or top-p is increased, the model explores more possible paths. That is why answers may become less consistent.
But that same diversity is what makes pass@k and maj@k useful: multiple samples can reveal either one correct solution or a stable consensus.
Pass@k vs. Maj@k
Pass@k asks: if we generate k answers, does at least one of them get the problem right?
With k = 4, imagine a coding task where the model generates 4 programs:
wrong
wrong
correct
wrong
Then pass@4 = 1 for that problem, because one of the 4 passed.
This works well for coding because code can be checked automatically with tests. You do not need the model’s first answer to be perfect. You just need one working solution among several samples.
Maj@k asks: if we generate k answers, and take the majority answer, is that final answer correct?
With k = 4, suppose a math model outputs:
42
42
41
42
The majority answer is 42, so maj@4 succeeds if 42 is correct.
This is useful for MCQ and math, where you may not have a direct verifier, but
pass@k uses multiple tries
maj@k uses multiple votes
Use pass@k when the task can be checked automatically.
This is why it fits coding best: you can sample several programs, run tests, and keep any one that works. The same idea also applies to tasks like SQL generation, code repair, regex writing, or script generation, where correctness can be verified by execution or test cases.
Use maj@k when the task cannot be verified easily, but the same correct answer is likely to appear repeatedly.
This is why it fits MCQ, math, logic puzzles, short factual QA, and structured reasoning tasks. You sample several times and trust the answer that appears most often.
Note: With majority voting, maj@2 doesn’t make sense. k must be greater than 2 to be effective.
Stronger LLMs with Thinking Disabled: Leveraging Maj@k and Pass@k
When thinking is enabled, the model produces a long reasoning trace, often thousands of tokens, which it uses to arrive at a more accurate answer.
With thinking disabled, accuracy is usually lower, but the model no longer generates that reasoning trace, making inference much cheaper.
Let’s compare how long Qwen3.5 27B (NVFP4) takes to answer the same prompt with thinking enabled versus disabled.

So, on average, with thinking disabled, the model can answer five coding problems in the same amount of time it takes to answer one with thinking enabled.
That means if pass@5 without thinking is higher than pass@1 with thinking, it’s better to disable thinking and generate multiple answers instead.
And yes, that’s exactly what we observe:

Pass@1 is nearly identical across both settings: 73.7 on LiveCodeBench.
But with thinking disabled, pass@2 jumps to 81.0, while still costing far less than a single run with thinking enabled.
The gap is even more striking on instruction-following benchmarks, where enabling thinking is nearly 20× more expensive.
Pass@8 with thinking disabled reaches 71.4, compared with 66.7 for pass@1 with thinking enabled, while still being significantly cheaper to run.
We see the same pattern with other models, including NVIDIA’s Nemotron 3 Super.
On LiveCodeBench, disabling thinking makes the model roughly 10× cheaper per problem:
But it is also much weaker. Even so, pass@4 is enough to beat the model’s pass@1 performance with thinking enabled.
In other words, even after four attempts, disabling thinking is still cheaper overall and, on average, more likely to produce code that passes your tests.
For math tasks such as AIME, where we can leverage majority voting, disabling thinking does not reduce cost nearly as much. Still, it is interesting to see that with 32 retries and majority voting, the model without reasoning significantly outperforms the single-run accuracy of the same model with thinking enabled.
Of course, this is not always true. On MMLU-Pro, for example, which is a multiple-choice benchmark, retrying 8 times and using majority voting improves accuracy, but it still falls well short of the model’s performance with thinking enabled.






