The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Quantizing Olmo 3: Most Efficient and Accurate Formats

GPTQ, FP8, NVFP4, SmoothQuant, and AWQ on Olmo 3 7B

Benjamin Marie's avatar
Benjamin Marie
Dec 02, 2025
∙ Paid

Olmo 3 is a standard 7B/32B decoder-only transformer. No MoE, no exotic attention, no flashy new architecture tricks. Most of the change is in the training pipeline: in the data (Dolma 3), the midtraining mix (Dolmino), the long-context stage (Longmino), and the “thinking” stack (Dolci, SFT, DPO, RL).

In this article, I briefly go through what actually changed compared to Olmo 2, and what didn’t work.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Then I move to quantization. I quantized Olmo 3 7B and 32B with several standard recipes that are practical to run on a single consumer GPU:

  • gptq-w4a16-g128

  • fp8-dynamic

  • nvfp4

  • awq with custom mappings for Olmo 3

  • W8A8 (INT8) with SmoothQuant

Finally, I look at how these variants behave on benchmarks: accuracy, PASS@k, and token efficiency, plus some notes on hardware choices (RTX 5090 vs RTX 6000) and what actually matters once you run long contexts with concurrent queries.

I used the same quantization script I released last week:

Get the notebook (#190)

I released my quantized models here:

  • Quantized Olmo 3

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture