Hi Everyone,
In this edition of The Weekly Kaitchup:
MoE-Quant: Quantize DeepSeek-V3/R1-Sized Models within 2 Hours
Llama-3.1-Nemotron-Ultra-253B: When a Llama 3.1 Model Outperforms Llama 4
Unpacking the Underwhelming Launch of Llama 4
MoE-Quant: Quantize DeepSeek-V3/R1-Sized Models within 2 Hours
IST-DASLab, the group behind the very good AQLM quantization method, has introduced new, highly efficient quantization techniques for MoE models like DeepSeek-R1.
The method used in MoE-Quant appears to be based on the standard GPTQ algorithm, similar to the one used with tools like AutoGPTQ or GPTQModel. The real challenge lies not in the algorithm itself, but in scaling the quantization process for very large models, which can otherwise require days of GPU time to complete efficiently.
MoE-Quant uses:
Fast Triton Kernel for GPTQ: Quantizing ~45,000 tensors is computationally heavy, so a custom Triton implementation is used to accelerate GPTQ, achieving around 10x speedup over the standard PyTorch implementation.
Expert Parallelism: MLP expert layers are sharded across devices to fit the required Hessians into GPU memory for GPTQ calibration. Each process handles only part of the expert layers and their corresponding Hessians.
Data Parallelism: Calibration data is split uniformly across processes to speed up forward passes during calibration.
With these optimizations, quantizing the DeepSeek-V3/R1 model takes just 2 hours on a machine with 8x H100 GPUs using 512 calibration sequences of length 4096.
It is available here (unknown license):
GitHub: IST-DASLab/MoE-Quant
Llama-3.1-Nemotron-Ultra-253B: When a Llama 3.1 Model Outperforms Llama 4
NVIDIA is very active in training new open LLMs. The Nemotron series has several members based on Llama 3:
There are 8B, 49B, and now 253B versions available, all with commercial use permitted.
The Ultra version offers significantly better performance than Llama 4 Maverick (which has 400B parameters), while being smaller in size. Unlike MoE models, this is a dense model, meaning all parameters are active during inference.

The model is a decoder-only Transformer based on Llama-3.1-405B-Instruct. It was customized using Neural Architecture Search (NAS), which introduces a mix of non-standard block designs. These include things like skipping attention layers altogether, using variable expansion in feedforward layers (FFN), and fusing multiple FFNs when several attention blocks are skipped in a row. The result is a more diverse architecture that balances performance with resource efficiency.
Each block in the original model was replaced with multiple candidate versions offering different tradeoffs between computational cost and output quality. NAS was used to search across these to meet specific memory and throughput targets. To maintain model performance after these changes, the training process included a knowledge distillation phase over 65 billion tokens, followed by continual pretraining on another 88 billion tokens.
Like the other model of this series, it is possible to turn off reasoning.
The model is fully supported by the main inference frameworks such as vLLM and Transformers.
Unpacking the Underwhelming Launch of Llama 4
“Llama” is undeniably one of the most recognized and influential brands in the LLM space. I can personally attest to this—my articles about Llama consistently outperform those on other models in terms of readership and engagement. They clearly generate a lot of interest and curiosity. Many of the keywords entered in Google Search and leading to my articles contain the term “llama”.
From my perspective, the Llama models have always struck a good balance: they’re relatively easy to use, perform well, and are compact enough to be practical for many use cases, including resource-constrained deployments. Naturally, I was quite excited for the release of Llama 4.
But I have to admit, this release didn’t quite meet my expectations.
The timing of the release on a Saturday, the exclusive focus on large MoE models, and the lack of architectural updates or a technical report left me puzzled. The performance so far appears underwhelming, and I was hoping for more transparency and technical depth. It also raises questions—was the goal to simply appear competitive on benchmarks, or was this release rushed for other reasons?
It’s possible that internal challenges at Meta played a role. I can imagine the frustration of investing massive time and compute into model development, only to be caught in organizational bottlenecks—legal reviews, IP concerns, and compliance issues (see their struggles with the EU)—while smaller teams like those behind DeepSeek move faster and manage to ship high-performing models. Meanwhile, the pressure builds.
The relatively quiet launch of Llama 4 suggests that Meta may be in a holding pattern, perhaps reevaluating its strategy. Compared to the energy around Llama 3, this release felt notably subdued. The absence of a technical paper reinforces that feeling, and the recent leadership changes within Meta’s research team might indicate broader shifts behind the scenes.
It’s clear that Meta has the resources to match or exceed offerings like DeepSeek R1.
With the right priorities, they could have easily produced a model of comparable size and quality to buy some time and maintain momentum. A Llama model, even slightly smaller but just as strong, would have generated excitement.
I’ve spent a few days working with the Llama 4 models, and unless there are significant implementation issues in the framework side (it’s possible, e.g., the models have FA2 issues), the current performance does not seem to reflect what Meta is truly capable of. If no further updates or improvements follow soon, these models may struggle to maintain attention, especially with promising alternatives like Qwen3 on the horizon.
While Llama 4 may be underwhelming in its current form, it's clear that it has already sparked significant effort from the open-source community to optimize and enhance its performance. I’m still planning to write about the Llama 4 models as soon as we see meaningful improvements or noteworthy developments that better showcase its potential.
The Salt
The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.
I reviewed in The Weekly Salt:
⭐SmolVLM: Redefining small and efficient multimodal models
Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models
Multi-Token Attention
Support The Kaitchup by becoming a Pro subscriber:
What You'll Get
Priority Support – Fast, dedicated assistance whenever you need it to fine-tune or optimize your LLM/VLM. I answer all your questions!
Lifetime Access to All the AI Toolboxes – Repositories containing Jupyter notebooks optimized for LLMs and providing implementation examples of AI applications.
Full Access to The Salt – Dive deeper into exclusive research content. Already a paid subscriber to The Salt? You’ll be refunded for the unused time!
Early Access to Research – Be the first to access groundbreaking studies and models by The Kaitchup.
30% Discount for Group Subscriptions – Perfect for teams and collaborators.
The Kaitchup’s Book – A comprehensive guide to LLM fine-tuning. Already bought it? You’ll be fully refunded!
All Benefits from Regular Kaitchup Subscriptions – Everything you already love, plus more. Already a paid subscriber? You’ll be refunded for the unused time!
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% (or 30% for Pro subscribers) discount for group subscriptions):
Have a nice weekend!