Mistral Small 3: An Excellent 24B-Parameter Wide-Shallow LLM
Fine-tuning, quantization, and evaluation
With large language models (LLMs), depth often leads to better performance: adding more layers typically improves results more than increasing parameter count by widening hidden and intermediate layers.
However, deeper models come at a cost: slower inference. With Mistral Small 3, Mistral AI takes a different approach, favoring a wider model with a significantly larger intermediate size, while maintaining a layer count comparable to LLMs twice as small in total parameters.
Despite this unconventional architecture, Small 3 performs on par with larger models like Qwen2.5 32B on most benchmarks and closely matches Llama 3.3 70B in performance.
What’s in this article?
A deeper look at Mistral Small 3’s architecture.
How to use both the base and instruct models with Mistral AI’s recommended hyperparameters and prompts.
Since the model is unusually wide, I explored quantization performance, reducing it to 4-bit and testing its accuracy on IFEval and MMLU-PRO.
We will also see how to fine-tune the model.
I’ve prepared two notebooks for this article:
Quantization and Evaluation: Demonstrating the 4-bit quantization process and accuracy results.
Fine-Tuning Mistral Small 3: Covering full-tuning, LoRA, and QLoRA techniques.