4-bit GLM-4.7 (358B) on a Single NVIDIA B300 with vLLM: AWQ vs NVFP4 vs INT4
Just give it enough tokens to think
GLM-4.7 is a sparse MoE model with a 358B total parameter count. I wanted to run it myself on a single NVIDIA B300 with vLLM, and compare the main 4-bit variants (AWQ, NVFP4, and INT4-mixed/AutoRound) for what actually works end-to-end, how fast it is, and what accuracy you keep.
GLM-4.7 has a large total parameter count (358B), but only a subset of experts are active on each forward pass. That helps with compute, but it doesn’t solve the main practical issue for most users: the weights are still huge, and memory requirements are still the limiting factor.
In practice, running GLM-4.7 comfortably at higher precision typically means a multi-GPU node, think an 8x140 GB setup (e.g., H200s) for “no compromises” inference. Even an FP8 deployment is still generally in “node territory” (for example, an 8x80 GB class setup such as H100s), depending on your serving settings and context length.
At the same time, I also wanted to test the B300, which comes with 288 GB of memory. That’s not enough for an FP8 deployment of GLM-4.7, but it is enough for a 4-bit version, so the question becomes: which 4-bit format works cleanly with vLLM on a B300, how fast is it, and how much quality do you keep?
In this article, I’ll focus on three things: getting GLM-4.7 running correctly in vLLM, documenting the practical pain points of the B300 in a real setup, and comparing the main 4-bit variants (AWQ, NVFP4, and int4-mixed/AutoRound) in terms of both speed and accuracy.
Concretely, we’ll cover:
How to serve GLM-4.7 with vLLM (and what tends to break with quantized versions, like the multi-token prediction).
Some issues I encountered when setting up the B300.
Which 4-bit formats actually run end-to-end, how fast they are, and how close they stay to the full model on reasoning benchmarks.
Note: these experiments were done live in an open chat with The Kaitchup subscribers. To join future chats, it’s here:
Environment
I went with Prime Intellect (not sponsored) for this. They have a very good diversity of GPUs provided through other cloud providers. So you get access to RunPod, Lambda, etc. GPUs to a single place.


