The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

The Fastest and Cheapest 120B LLM?

Mistral Small 4, Nemotron 3 Super, and Qwen3.5 122B in NVFP4

Benjamin Marie's avatar
Benjamin Marie
Apr 01, 2026
∙ Paid

We have 3 excellent ~120B-parameter LLMs:

  • mistralai/Mistral-Small-4-119B-2603

  • nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16

  • Qwen/Qwen3.5-122B-A10B

They are all MoEs, but they use very different architectures.

As I did in my article comparing 30B-parameter MoEs, this article reviews how these three models differ.

We will compare their memory footprint and accuracy across a range of tasks. Since there are no clear public comparisons of their quantized formats in terms of accuracy and speed, I will focus in particular on their NVFP4 versions, with both weights and activations quantized, to see what kind of performance they can deliver on Blackwell GPUs.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Key Findings: Although Qwen3.5 122B is, on average, the most accurate model, the picture changes somewhat after quantization. In NVFP4, Nemotron 3 Super and Qwen3.5 122B perform much more similarly, likely because NVFP4 was partially used during Nemotron’s training. But focusing only on accuracy would miss the most important point: Nemotron 3 Super operates at a completely different level of efficiency. It is far more token-efficient, achieving higher accuracy with shorter reasoning traces, uses a smaller KV cache, and decodes significantly faster. In practice, that makes Nemotron 3 Super much cheaper to run.

Mistral Small 4 trails behind on both accuracy and efficiency, but it still has a few advantages up its sleeve.

Acknowledgments

This article would not have been possible without the compute sponsorship generously provided by Verda, whose B200 GPUs I used throughout this work.

Verda provides access to high-end GPUs such as the B200 and B300, with GB300 support coming soon, as well as smaller GPUs such as the RTX Pro 6000 and RTX 6000 Ada, which are among the most affordable per hour on the market.

Verda is a European, AI-focused cloud and GPU infrastructure provider with sovereignty, sustainability, data privacy, and performance at its core.

Please check them out here.

The Models and How They Differ

Nemotron 3 Super NVFP4

The model uses a Mamba2-Transformer hybrid LatentMoE architecture, 120B total parameters, and 12B active parameters:

  • nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

The BF16 checkpoint is 247 GB, and the NVFP4 checkpoint is 80.4 GB. That is 166.6 GB less, or 67.4% smaller. Latent projections, QKV and attention projections, and embeddings remain in BF16 or MXFP8.

Nemotron 3 Super is particularly optimized for Multi-Token Prediction (MTP), in which the target model includes native prediction heads for future tokens, and this NVFP4 checkpoint still supports it very well, as we will see in the next sections.

Note: MTP doesn’t impact accuracy. The only negative impact is that it consumes slightly more memory and slows inference down if configured very poorly.

Mistral Small 4 NVFP4

Mistral AI released an official NVFP4 checkpoint made with LLM Compressor:

  • mistralai/Mistral-Small-4-119B-2603-NVFP4

Note: I now use AutoRound more often than LLM Compressor for NVFP4 quantization. I assume that AutoRound is better at it thanks to a superior rounding algorithm, but I didn’t confirm it.

Mistral Small 4 is a multimodal MoE model with 128 experts, 4 active experts per token, 119B total parameters, 6.5B active parameters per token, and a 256k context window. Among the three models discussed here, it has the fewest active parameters per token, partly because it activates fewer experts at each step than Nemotron 3 Super and Qwen3.5 122B.

It is also the only model in this comparison that uses Transformer attention across all layers. At the same time, it adopts MLA to compress the KV cache.

Its BF16 checkpoint is 242 GB, while the NVFP4 checkpoint is 70.8 GB, a reduction of 171.2 GB, or 70.7%. The vision stack, embeddings, attention-heavy paths, and output head are excluded from the NVFP4 group, which means the quantization is applied mostly to the experts.

The model does not include MTP layers for speculative decoding, but Mistral AI released a companion Eagle model to address that:

  • mistralai/Mistral-Small-4-119B-2603-eagle

Qwen3.5 122B NVFP4

The model has 122B total parameters, 10B active parameters, 48 layers, and a hybrid layout: 12 × (3 × (Gated DeltaNet → MoE) → 1 × (Gated Attention → MoE)), with 262,144-token native context. Qwen didn’t release an official NVFP4 checkpoint, so I chose one among the many made by the community:

  • txn545/Qwen3.5-122B-A10B-NVFP4

The original BF16 checkpoint is 250 GB, and the txn545 NVFP4 checkpoint is 82.9 GB. That is 167.1 GB less, or 66.8% smaller. Like the two other models, it quantizes only the experts.

How to Use Mistral Small 4, Nemotron 3 Super, and Qwen3.5 122B NVFP4 Checkpoints with vLLM

There is no single install path across all three checkpoints. Mistral currently points to a patched vLLM route, Nemotron pins vllm==0.17.1, and Qwen3.5 points to nightly vLLM from the main branch.

Note: The following sections show the steps I ran to install vLLM and launch the models. Depending on when you read this, most of the issues I encountered could be solved, and a simple “uv pip install vllm” could be enough for all these models.

Mistral Small 4 NVFP4 with vLLM

Mistral’s model card points either to a custom Docker image or to a temporary vLLM branch, plus transformers from main. It also says mistral_common >= 1.10.0 should be installed. I ran:

git clone --branch fix_mistral_parsing https://github.com/juliendenize/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 uv pip install --editable .
uv pip install git+https://github.com/huggingface/transformers.git

Note: The model card seems to have a small error in the installation instructions: Transformers is installed with uv while vLLM is installed without it, which would put both in different environments.

Then, run the model with:

vllm serve mistralai/Mistral-Small-4-119B-2603-NVFP4 \
  --max-model-len 262144 \
  --attention-backend TRITON_MLA \
  --reasoning-parser mistral 

Nemotron 3 Super NVFP4 with vLLM

uv pip install vllm

And to run it:

export MODEL_CKPT=

vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
  --max-model-len 262144 \
  --trust-remote-code \
  --reasoning-parser nemotron_v3 \
  --speculative-config ‘{”method”:”nemotron_h_mtp”,”num_speculative_tokens”:5}’

For num_speculative_tokens, I set “5” as recommended by NVIDIA. I confirm that it’s faster with 5 than with 3. It should be tuned depending on the task, but I left it at 5.

Qwen3.5 122B NVFP4 with vLLM

uv pip install vllm

As I’m writing this, the NVFP4 versions of the Qwen3.5 MoEs (35B, 122B, and 397B) are buggy on the B200 (and maybe on the RTX 5090 and Pro 6000; didn’t confirm). The MoE backend automatically selected by vLLM is the wrong one, and the model will generate gibberish. I had to pass --moe_backend flashinfer_cutlass

vllm serve txn545/Qwen3.5-122B-A10B-NVFP4 \
  --max-model-len 262144 \
  --moe_backend flashinfer_cutlass \
  --reasoning-parser qwen3 \
  --language-model-only

The --language-model-only flag disables the vision tower entirely to save a bit more memory if you are not going to send images to the model.

For MTP, pass:

--speculative-config ‘{”method”:”qwen3_next_mtp”,”num_speculative_tokens”:2}’

However, I couldn’t get it to work for this checkpoint. It could just be a configuration issue. Other NVFP4 checkpoints (e.g., Sehyo/Qwen3.5-122B-A10B-NVFP4) support it, but it’s unstable. Single queries work, but when I send batches of many queries to simulate a high workload, vLLM crashes.

Nemotron 3 Super: As Accurate as Qwen3.5 122B

With Thinking Enabled:

Qwen3.5 seems slightly stronger on multiple-choice benchmarks such as GPQA and MMLU-Pro/Redux, whereas Nemotron 3 Super is stronger on instruction-following tasks like IFBench and IFEval.

Mistral Small 4 lags behind overall, except on coding benchmarks such as LiveCodeBench. Even there, though, a closer look at pass@k tells a less favorable story: Qwen3.5 and Nemotron improve meaningfully with retries.

Note: On the model card, Mistral AI reports that Mistral Small 4 underperforms Qwen3.5 122B on LiveCodeBench. The NVFP4 conversion may have altered that result, at least for pass@1, or my run may simply have been unusually favorable.

With Thinking Disabled:

Note: Sorry about the inconsistent bar colors across the charts. I still need to fix that in my visualization app.

When “thinking” is disabled, the differences between the models become much clearer. Qwen3.5 122B stands out as the strongest overall by a wide margin, while Mistral Small 4 falls well behind.

That said, these results are of limited value without statistics on the number of generated tokens. In practice, disabling “thinking” is mainly useful for getting faster responses. And even with “thinking” turned off, the model may still attempt to reason internally, as we saw in previous articles.

Disable “Thinking,” Still Get Thousands of Tokens: What Instruct LLMs Are Doing

Disable “Thinking,” Still Get Thousands of Tokens: What Instruct LLMs Are Doing

Benjamin Marie
·
Mar 2
Read full story

I also checked how the models perform in translation, with thinking disabled.1

I wanted Mistral Small 4 to be significantly better at translating into French (fr_fr), but it isn’t. For almost all the languages I tested, Qwen3.5 122B is the best.

Nemotron 3 Super: The Most Token-Efficient, Memory-Efficient, and Fastest

NVIDIA placed a strong emphasis on Nemotron 3 Super’s efficiency when it was released, but I think they actually undersold it.

Yes, the model is faster. But more importantly, it is far more token-efficient. It achieves significantly higher accuracy than Qwen3.5 122B for the same number of generated tokens.

That came as a surprise to me because, as we saw with “thinking” disabled, Nemotron 3 Super is not especially strong without reasoning traces. Once thinking is enabled, though, the picture changes completely: Nemotron 3 Super produces accurate answers with much shorter reasoning traces. Here is an example on AIME25:

I also found that it makes more gradual use of reasoning tokens. Take GPQA Diamond, for example: on this benchmark, both Mistral and Qwen3.5 tend to stop reasoning at around 32k tokens, whereas Nemotron 3 Super can make use of roughly 120k reasoning tokens. This limit is actually tunable through the Nemotron’s prompt, though I did not experiment with that.

This efficiency matters a great deal. Although Qwen3.5 and Nemotron 3 Super reach similar levels of accuracy overall, Nemotron 3 Super does so while generating far fewer tokens.

But that’s not all.

If we combine this token efficiency with raw inference speed, the efficiency gap becomes even larger.

First, let’s look at how fast Nemotron 3 Super actually is:

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 The Kaitchup · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture