MiniMax M3 GGUF Quantization: From 852 GB to ~150 GB Without Breaking Accuracy
Benchmarks, token efficiency, and tensor-level analysis of low-bit M3 GGUFs.
MiniMax M3 is an excellent model, but with roughly 428B parameters, running it locally is very challenging. In BF16, the weights alone require around 852 GB of memory, so an unquantized setup realistically needs a large multi-GPU server, for example an 8×H200 machine.
Quantization can dramatically reduce these requirements. However, as we saw in previous evaluations, MiniMax M2.5 degraded heavily once quantized, even at 4-bit. The natural question is whether M3 is more robust.
As we will see in this article, the answer is yes: M3 is much easier to quantize. My hypothesis is that this improved quantization robustness comes from M3’s shared-expert MoE design, a feature that was absent from M2.5. The shared expert provides a path that can be preserved during quantization at a high precision, while the routed experts, which account for about 97% of M3’s parameters, can be compressed much more aggressively.
In this article, I evaluate several low-bit MiniMax M3 GGUFs, including Unsloth’s UD GGUFs and my own MoQ quantization.
The main result is that M3 can be compressed from 852 GB to around 150 GB while preserving most of its accuracy. I then analyze the tensor-level quantization choices behind each GGUF to explain why some variants remain strong, and why the smallest ones start to break.
Acknowledgments
This article would not have been possible without the compute sponsorship generously provided by Verda, whose B300 GPUs I used throughout this work.
Verda is a European, AI-focused cloud and GPU infrastructure provider with sovereignty, sustainability, data privacy, and performance at its core.
You can check them out here.
Results: Which GGUF Can You Safely Use?
First, a note on the benchmarks I ran. For my GGUF evaluations, I usually use large subsets of MMLU Pro for world knowledge, Math 500 for math questions ranging from easy to difficult, LiveCodeBench for challenging coding tasks, and the full GPQA Diamond benchmark for difficult science questions.
However, running M3 with llama.cpp is very costly, especially when evaluating low-bit versions that can generate many more tokens than the original model to answer the same questions. Running only 100 LiveCodeBench samples was already too expensive for this evaluation, at least $200 for a single GGUF evaluaiton, so I did not include this benchmark.
How should you read the following results?


