The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Qwen3.5 27B Latency and Throughput: INT4 vs NVFP4 vs FP8 vs BF16

Tested on RTX Pro 6000, H100, and B200 GPUs

Benjamin Marie's avatar
Benjamin Marie
Mar 26, 2026
∙ Paid

On most benchmarks, quantized Qwen3.5 models perform very close to the original models. That is true across several common formats, including INT4, NVFP4, FP8, and BF16, although the exact result depends on how the model is quantized.

Qwen3.5 Quantization: Similar Accuracy, More Thinking — Best Models and Recipes

Qwen3.5 Quantization: Similar Accuracy, More Thinking — Best Models and Recipes

Benjamin Marie
·
Mar 12
Read full story

Where these formats differ more clearly is memory usage. In most cases, the ranking from smallest to largest is:

  1. INT4

  2. NVFP4

  3. FP8

  4. BF16

This matters because lower memory consumption usually means you can fit a larger KV cache and serve more concurrent requests. With standard LLM architectures, where every layer uses full attention, model size is often the main constraint when you want to run the model on a single GPU.

However, newer models such as Qwen3.5 and Nemotron 3 Super use full attention in only a small fraction of layers. As a result, they produce a much smaller KV cache. Compared with older full-attention models such as Qwen3 30B A3B, models like Qwen3.5 35B A3B can support far more concurrent requests at maximum context length.

Image

At first glance, this looks extremely promising. And once you reduce model size further through quantization, it looks even better.

The KV-Cache of Small MoEs: Qwen3, Qwen3.5, GLM 4.7 Flash, and Nemotron 3 Nano Compared

The KV-Cache of Small MoEs: Qwen3, Qwen3.5, GLM 4.7 Flash, and Nemotron 3 Nano Compared

Benjamin Marie
·
Mar 18
Read full story

But that is only part of the story.

Even if Qwen3.5 35B allows 100+ concurrent requests (B200 GPU) on paper, users may still experience high latency in practice. In this setup, KV cache size is no longer the dominant bottleneck. Under heavy workloads, with many users and very long sequences, GPU memory bandwidth can still saturate quickly. In other words, fitting more requests is not the same as serving them efficiently.

This is why inference speed and latency matter just as much as memory efficiency.

Unfortunately, the performance of these newer hybrid-attention models under different formats is still not very well documented. And the picture is made even more complicated by the fact that performance depends not only on the format itself, but also on what exactly is being quantized.

During inference, quantized weights are often dequantized back to a higher precision for computation. Those extra operations are expensive, and if they are not implemented efficiently, a quantized model can actually run slower than the original.

Fortunately, optimized kernels have improved this substantially. The community has built increasingly efficient kernels for quantized inference, and newer GPUs are also designed to accelerate formats such as NVFP4 and FP8.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I compare the inference speed of Qwen3.5 27B in BF16, FP8, NVFP4, and INT4 on three GPUs: RTX Pro 6000, H100, and B200.

I focus on two scenarios: synchronous inference, with one request at a time, and saturated workloads, where the GPU is pushed to high utilization. I ran the benchmarks with vLLM and GuideLLM, which provides additional metrics such as time to first token, inter-token latency, and other measurements across different query rates (full results below).

Acknowledgments

This article would not have been possible without the compute sponsorship generously provided by Verda, whose RTX Pro 6000, H100, and B200 GPUs I used throughout this work.

They provide access to high-end GPUs such as the B200 and B300, with GB300 support coming soon, as well as smaller GPUs such as the RTX Pro 6000 and RTX 6000 Ada, which are among the most affordable per hour on the market.

Verda is a European, AI-focused cloud and GPU infrastructure provider with sovereignty, sustainability, data privacy, and performance at its core.

Please check them out here.

Models, vLLM, GuideLLM, and GPUs

I benchmarked the models served by vLLM. You may get different results with SGLang. I used vLLM because I have much more experience with it and know that all the formats I wanted to test are supported.

vllm serve [model_id] --host 127.0.0.1 

I let vLLM select the backend for the attention computation and kernels to run each format.

Note: If you test LLMs in GPU clouds, outside of a container, don’t forget to set host to 127.0.0.1; by default, it’s 0.0.0.0, meaning that the world can ping and use your model if you don’t set an API key.

For benchmarking, I used GuideLLM.

GuideLLM is an open-source benchmarking tool for LLM deployments that focuses on inference behavior rather than generic HTTP load testing. It checks LLM-specific metrics such as TTFT, inter-token latency, and output distributions, and it is built to run against OpenAI-compatible backends like vLLM, which makes it useful for comparing serving stacks, quantization settings, and hardware under the same request mix.

Benchmarking each model checkpoint took approximately 1 hour per GPU. So I only tested one configuration: 1,000 prompt tokens with 1,000 output tokens.

For the GPUs, I created instances with Verda. You can check how I set them up in this article:

Qwen3.5 Quantization: Similar Accuracy, More Thinking — Best Models and Recipes

Qwen3.5 Quantization: Similar Accuracy, More Thinking — Best Models and Recipes

Benjamin Marie
·
Mar 12
Read full story

Here is the list of models I benchmarked:

  • Qwen/Qwen3.5-27B
    Unquantized baseline. Main model runs in BF16, with Mamba state-space components in FP32. Nothing is compressed.

  • Qwen/Qwen3.5-27B-FP8
    FP8 model with dynamic activation quantization and blockwise FP8 weights. Most eligible layers are FP8, but the output head, token embeddings, vision stack, MTP head, and parts of linear attention remain in higher precision.

  • cyankiwi/Qwen3.5-27B-AWQ-BF16-INT4
    Weight-only INT4 AWQ on linear layers. No activation quantization. Vision modules, output head, MTP head, and the full linear-attention projection stack are left unquantized; embeddings and norms also stay higher precision.

  • cyankiwi/Qwen3.5-27B-AWQ-4bit
    Weight-only INT4 AWQ on linear layers. No activation quantization. Vision modules, output head, MTP head, and only part of linear attention are left unquantized, so this version quantizes more of linear attention than the BF16-INT4 hybrid. Embeddings and norms remain unquantized.

  • kaitchup/Qwen3.5-27B-autoround-NVFP4-linearattn-BF16
    NVFP4 on linear layers, covering both weights and input activations. Vision modules, output head, and all linear-attention modules are excluded, so linear attention stays in BF16. Embeddings and norms are not quantized. Note: On the H100, the activations are not quantized.

  • kaitchup/Qwen3.5-27B-autoround-NVFP4
    NVFP4 on linear layers, again quantizing both weights and input activations. Vision modules and output head are excluded, but linear attention is quantized in this version. Embeddings and norms remain unquantized. Note: NVFP4 uses a group size of 32 for quantization. On Hopper GPUs, like the H100, vLLM relies on the Marlin kernel, which can’t apply this group size for some of the layers in Qwen3.5 27B. No results will be provided for this model on the H100.

Single Query

For one-at-a-time requests, the pattern is clear: reducing weight traffic matters more than adding compute. That is why 4-bit variants outperform BF16 and, in most cases, FP8. At batch size one, these GPUs are not fully occupied, so the limiter is often how fast the model and KV cache can be pulled through memory, not peak tensor throughput.

That favors INT4 on the RTX Pro 6000 and H100. On B200, the best single-query result comes from the NVFP4-based path rather than plain INT4. FP8 still improves on BF16 across the board, but it looks like a compromise format here: useful, broadly deployable, but not the best latency lever when 4-bit kernels are stable.

Throughput

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 The Kaitchup · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture