Run GLM-4.7 Flash on One GPU: VRAM Math, Quantization Options, and Benchmark Results
And how good is it with "thinking" disabled
Z.ai’s flagship model, GLM-4.7, is among the strongest open-weight models available, but it’s also massive at 384B parameters. In practice, that means even an NVIDIA B300 (288 GB) can’t hold the full model in memory, even in FP8. A 4-bit version can fit, but inference is still expensive.
Z.ai also released a smaller sibling: GLM-4.7 Flash, a 30B-A3B Mixture-of-Experts (MoE) model. It has 30B total parameters, but only ~3B are active per token during inference.
Despite being much smaller, GLM-4.7 Flash still includes features we usually only see in flagship models, such as multi-head latent attention (MLA) and multi-token prediction (MTP). Both are designed to boost throughput while lowering KV-cache requirements and memory-bandwidth costs.
And while 30B is small enough to run on a single workstation GPU like an RTX Pro 6000 (96 GB), it’s still out of reach for most consumer cards.
In this article, we’ll cover which compressed versions can fit on your GPU, and how each option trades off accuracy versus the original model. I’ll also give a brief, practical explanation of MLA and MTP, including when to enable/disable them if needed. Finally, we’ll compare multiple quantization approaches, including INT4 and NVFP4. Since benchmarks with “thinking” disabled are scarce, I ran my own tests in that mode as well.

