The Weekly Kaitchup #86

Gemma 3 QAT - Dream 7B

Benjamin Marie

Apr 04, 2025

Hi Everyone,

In this edition of The Weekly Kaitchup:

Gemma 3: Quantization-Aware Training
Dream 7B: A Good Diffusion-Based LLM at Last?

Gemma 3: Quantization-Aware Training

Gemma 3 by Google is currently one of the top-performing LLMs. It delivers strong results across a wide range of language and vision tasks. In this article, we explored how to fine-tune Gemma 3 effectively for your own applications:

Fine-Tuning Gemma 3 on Your Computer with LoRA and QLoRA (+model review)

Benjamin Marie

Mar 13

Read full story

Google has published the technical report describing Gemma 3 here:

Gemma 3 Technical Report

In the technical report, the "Quantization Aware Training" (QAT) versions of the models were referenced, but they weren’t included in the initial release of Gemma 3. Google has now released these QAT models this week, making them officially available:

Gemma 3 QAT

Google has only released GGUF versions of the QAT models. Hopefully, safetensors versions will also be released. According to the technical report, these were produced by “fine-tuning” the Gemma 3 checkpoints, likely by simulating 4-bit quantization. They don’t go into much detail about the QAT method itself, only noting that they used standard quantization schemes from llama.cpp, such as per-channel int4, per-block int4, and switched FP8.

How well do these models perform?
We don’t really know. There are no evaluation results in the technical report (if they are there, I can’t find them!), and the model card doesn’t mention any benchmarks either for QAT. My best guess is that performance is very close to the original bfloat16 models.

What I’m really curious about is whether these QAT models are noticeably better than just applying post-training quantization (PTQ) to the original bfloat16 checkpoints using llama.cpp.

GGUF Quantization with Imatrix and K-Quantization to Run LLMs on Your CPU

Benjamin Marie

September 9, 2024

Read full story

Honestly, I’m skeptical. PTQ models are already very close to their full-precision counterparts. If QAT improves the quantization accuracy, it can’t be significant.

I plan to dig into this more. If I find anything interesting, I’ll definitely write it up.

In the meantime, if you want a detailed explanation of how PTQ and QAT work, check out Chapter 3 of my book, LLMs on a Budget, which I published last week:

Get the Book

Dream 7B: A Solid Diffusion-Based LLM at Last?

Diffusion models generate images by gradually transforming random noise into coherent visuals over many iterative steps. Starting from pure noise, the model learns to denoise based on patterns it has seen during training, eventually producing detailed, realistic images. This approach remains a core technique in state-of-the-art image generation. However, applying diffusion to text generation has been much more challenging, with most attempts falling well short of the performance seen in standard autoregressive transformers, which generate text token by token in sequence.

That’s why Dream 7B is particularly interesting. It might represent a real breakthrough. Unlike standard LLMs, Dream 7B uses a diffusion-based approach but still manages to match the performance of strong autoregressive models of the same size, such as Qwen2.5 7B (which Dream 7B is based on, as explained below).

Figure: comparison of language models on standard evaluation benchmarks. * indicates Dream 7B, LLaDA 8B, Qwen2.5 7B and LLaMA3 8B are evaluated under the same protocol. The best results are bolded and the second best are underlined.

The authors even found that Dream 7B excels at planning tasks, including things like Sudoku solving and trip planning, where it significantly outperforms other models of similar size.

Dream 7B uses a masked diffusion setup and was pretrained on a large, diverse dataset covering text, math, and code. Most of the data came from Dolma v1.7, OpenCoder, and DCLM-Baseline, totaling 580 billion tokens. Training ran for 256 hours on 96 NVIDIA H800 GPUs.

Rather than training from scratch, they initialized Dream 7B with weights from an existing autoregressive model: Qwen2.5 7B. This gave the model a strong head start. There was an expected bump in loss when shifting from causal to full attention, but overall performance benefited from this reuse. One critical detail was tuning the learning rate: too high, and it wiped out valuable left-to-right knowledge from the AR model; too low, and it slowed down diffusion learning.

Another clever improvement is context-adaptive token-level noise rescheduling. In simple terms, instead of applying the same noise level across an entire sentence, as standard diffusion methods do, Dream assigns a custom noise level to each masked token based on how much context surrounds it. This allows the model to focus more precisely on harder-to-predict tokens, making learning more efficient.

I also really liked their demo video showing how the model generates text. It’s intuitive: the model often lays out the “easy” tokens first, like the "=" sign in the example below, which shows that it knows the overall structure of the answer early on. The remaining steps are more about choosing the right words to fill in that structure. It’s a very different (and pretty compelling) way to think about language generation.

It also illustrates very well the good planning capabilities of the model.

The models are not released yet, but they mentioned they will be there:

Dream-org/Dream-v0-Instruct-7B

source

The Salt

The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.

I reviewed in The Weekly Salt:

The Massive Activations in LLMs

Benjamin Marie

Apr 2

Read full story

⭐A Refined Analysis of Massive Activations in LLMs
Expanding RL with Verifiable Rewards Across Diverse Domains
AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through Lightweight Vocabulary Adaptation
Gemma 3 Technical Report

Support The Kaitchup by becoming a Pro subscriber:

What You'll Get

Priority Support – Fast, dedicated assistance whenever you need it to fine-tune or optimize your LLM/VLM. I answer all your questions!
Lifetime Access to All the AI Toolboxes – Repositories containing Jupyter notebooks optimized for LLMs and providing implementation examples of AI applications.
Full Access to The Salt – Dive deeper into exclusive research content. Already a paid subscriber to The Salt? You’ll be refunded for the unused time!
Early Access to Research – Be the first to access groundbreaking studies and models by The Kaitchup.
30% Discount for Group Subscriptions – Perfect for teams and collaborators.
The Kaitchup’s Book – A comprehensive guide to LLM fine-tuning. Already bought it? You’ll be fully refunded!
All Benefits from Regular Kaitchup Subscriptions – Everything you already love, plus more. Already a paid subscriber? You’ll be refunded for the unused time!

Subscribe to The Kaitchup Pro

That’s all for this week.

If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% (or 30% for Pro subscribers) discount for group subscriptions):

Share The Kaitchup – AI on a Budget

Have a nice weekend!

The Kaitchup – AI on a Budget

The Weekly Kaitchup #86

Gemma 3 QAT - Dream 7B

Gemma 3: Quantization-Aware Training

Fine-Tuning Gemma 3 on Your Computer with LoRA and QLoRA (+model review)

GGUF Quantization with Imatrix and K-Quantization to Run LLMs on Your CPU

Dream 7B: A Solid Diffusion-Based LLM at Last?

The Salt

The Massive Activations in LLMs

What You'll Get

Discussion about this post