Mistral-NeMo: 4.1x Smaller with Quantized Minitron
How Pruning, Knowledge Distillation, and 4-Bit Quantization Can Make Advanced AI Models More Accessible and Cost-Effective
NVIDIA's Minitron compresses large language models (LLMs) by pruning the least important weights, followed by retraining through knowledge distillation. This approach significantly reduces model sizes while preserving their accuracy.
NVIDIA released Minitron versions of Llama 3.1 and Mistral-NeMo, reducing their number of parameters from 8B to 4B and 12B to 8B, respectively.
Why is this important?
While Mistral-NeMo can’t run on a consumer GPU, its Minitron version can. A 24 GB GPU would be enough. However, this could also be achieved by quantizing Mistral-NeMo. 4-bit quantization methods are now accurate enough.
But what if we could also quantize a Minitron model? Is quantization still accurate enough for a model that has been pruned with Minitron?
For instance, a 4-bit version of Mistral-NeMo-Minitron would run on an 8 GB GPU, significantly bringing down inference costs.
In this article, I review the Minitron approach, exploring how to compress LLMs through pruning and knowledge distillation. We will then discuss quantizing these Minitron models to 4-bit precision using AutoRound. The last section presents the evaluation results.
The findings indicate that Minitron models are strong candidates for 4-bit quantization, with only minimal accuracy loss. Notably, the 4-bit Mistral-NeMo-Minitron outperforms Llama 3.1 8B while using 10.1 GB less memory, making it capable of running on a 12 GB GPU, or an 8 GB GPU for short sequences or with a quantized KV cache.
The quantization and evaluation of all the models discussed in this article are implemented in this notebook: