The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Mistral-NeMo: 4.1x Smaller with Quantized Minitron

Mistral-NeMo: 4.1x Smaller with Quantized Minitron

How Pruning, Knowledge Distillation, and 4-Bit Quantization Can Make Advanced AI Models More Accessible and Cost-Effective

Benjamin Marie's avatar
Benjamin Marie
Aug 26, 2024
∙ Paid
12

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Mistral-NeMo: 4.1x Smaller with Quantized Minitron
4
1
Share

NVIDIA's Minitron compresses large language models (LLMs) by pruning the least important weights, followed by retraining through knowledge distillation. This approach significantly reduces model sizes while preserving their accuracy.

NVIDIA released Minitron versions of Llama 3.1 and Mistral-NeMo, reducing their number of parameters from 8B to 4B and 12B to 8B, respectively.

Why is this important?

While Mistral-NeMo can’t run on a consumer GPU, its Minitron version can. A 24 GB GPU would be enough. However, this could also be achieved by quantizing Mistral-NeMo. 4-bit quantization methods are now accurate enough.

But what if we could also quantize a Minitron model? Is quantization still accurate enough for a model that has been pruned with Minitron?

For instance, a 4-bit version of Mistral-NeMo-Minitron would run on an 8 GB GPU, significantly bringing down inference costs.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I review the Minitron approach, exploring how to compress LLMs through pruning and knowledge distillation. We will then discuss quantizing these Minitron models to 4-bit precision using AutoRound. The last section presents the evaluation results.

The findings indicate that Minitron models are strong candidates for 4-bit quantization, with only minimal accuracy loss. Notably, the 4-bit Mistral-NeMo-Minitron outperforms Llama 3.1 8B while using 10.1 GB less memory, making it capable of running on a 12 GB GPU, or an 8 GB GPU for short sequences or with a quantized KV cache.

The quantization and evaluation of all the models discussed in this article are implemented in this notebook:

Get the notebook (#98)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share