Mistral Small 3: An Excellent 24B-Parameter Wide-Shallow LLM

Fine-tuning, quantization, and evaluation

Feb 17, 2025

∙ Paid

With large language models (LLMs), depth often leads to better performance: adding more layers typically improves results more than increasing parameter count by widening hidden and intermediate layers.

However, deeper models come at a cost: slower inference. With Mistral Small 3, Mistral AI takes a different approach, favoring a wider model with a significantly larger intermediate size, while maintaining a layer count comparable to LLMs twice as small in total parameters.

Despite this unconventional architecture, Small 3 performs on par with larger models like Qwen2.5 32B on most benchmarks and closely matches Llama 3.3 70B in performance.

What’s in this article?

A deeper look at Mistral Small 3’s architecture.
How to use both the base and instruct models with Mistral AI’s recommended hyperparameters and prompts.
Since the model is unusually wide, I explored quantization performance, reducing it to 4-bit and testing its accuracy on IFEval and MMLU-PRO.
We will also see how to fine-tune the model.

I’ve prepared two notebooks for this article:

Quantization and Evaluation: Demonstrating the 4-bit quantization process and accuracy results.

Get the notebook (#145)

Fine-Tuning Mistral Small 3: Covering full-tuning, LoRA, and QLoRA techniques.

Get the notebook (#146)

The Kaitchup – AI on a Budget

Mistral Small 3: An Excellent 24B-Parameter Wide-Shallow LLM

Fine-tuning, quantization, and evaluation

Mistral Small 3: An Unusually “Wide” Model

This post is for paid subscribers