Unsloth's Quantization-Aware Training (QAT) vs Post-Training Quantization (PTQ) for Small Models

Can a tiny LLM stay accurate under quantization thanks to QAT?

Nov 10, 2025

∙ Paid

Quantization is a common way to shrink large language models (LLMs). In practice, it’s a form of compression that reduces parameter precision, typically from 16-bit (BF16/FP16) to lower-precision formats like 8-bit or 4-bit. Most deployments apply this via post-training quantization (PTQ).

On very large models, PTQ often preserves downstream accuracy remarkably well. But on smaller models, those with only a few billion parameters, or even sub-billion, PTQ can cause substantial accuracy degradation.

An alternative is quantization-aware training (QAT), which trains the model to be robust to quantization effects. QAT is usually expensive, and on bigger models I rarely find the gains worth the cost. For small models, though, it can make a difference without spending too much compute.

Unsloth now supports QAT, letting us train models to be quantization-aware while adapting them to our task and data. Thanks to Unsloth’s efficiency, this is probably the most affordable way to fine-tune a model that remains robust under quantization. In this article, I put Unsloth’s QAT to the test on a deliberately hard setting: English→French translation with a very small model, Gemma 3 270M. In earlier work, I had good success fine-tuning this model for translation, but as we’ll see, introducing quantization through PTQ can make things fragile. Can QAT limit the damage?

I evaluate two QAT schemes available in Unsloth for this setup, INT4 and INT8-INT4, comparing final accuracy against PTQ and costs. I use full fine-tuning (not LoRA), since the model is already quite small.

Here’s the notebook I used to run these Unsloth QAT experiments:

Get the notebook (#188)

The Kaitchup – AI on a Budget

Unsloth's Quantization-Aware Training (QAT) vs Post-Training Quantization (PTQ) for Small Models

Can a tiny LLM stay accurate under quantization thanks to QAT?

Quantization-Aware Training: int4, fp8-int4, fp8-fp8, and int8-int4

This post is for paid subscribers

Quantization-Aware Training: `int4`, `fp8-int4`, `fp8-fp8`, and `int8-int4`