2026 Predictions: Much Faster Inference, Pre-Training with RL, and FP4 Everywhere
The Weekly Kaitchup #124
Hi everyone,
Wishing you a happy New Year and all the best for 2026! 🎉
Before we jump into predictions, here’s a quick recap of what stood out to me in 2025.
Biggest Surprise: DeepSeek (V3 → R1) and the “Reasoning” Playbook
DeepSeek-V3: MoE but Engineered for Efficiency
DeepSeek-V3’s technical report (Yes, the paper is technically from 2024, published on December 27th, but in practice most people discovered and discussed it in 2025.) It’s worth remembering because it introduced a set of ideas that many later models adopted. DeepSeek-V3 is “efficiently big”: 685B total parameters, about 37B activated per token, trained on 14.8T tokens. Since then, several other very large models have followed a similar recipe with sparse MoE, for example GLM, MiniMax, or Kimi, largely coming out of China. NVIDIA also appears to be moving in a comparable direction, with very large Nemotron 3 MoE checkpoints expected over the coming months.
Open-weights models are now performing closely to commercial APIs, lagging only a few months behind in terms of accuracy. Yet, running them locally is not possible for most of us due to their size.
DeepSeek-R1: RLVR + GRPO goes mainstream
On Jan 20, 2025, DeepSeek released DeepSeek-R1, and later made the license MIT, explicitly allowing the community to reuse model weights and even API outputs for fine-tuning/distillation. That single policy choice poured gasoline on the ecosystem.
Technically, R1 is strongly associated with Reinforcement Learning with Verifiable Rewards (RLVR) and GRPO as the post-training workhorse.
And yes, practically speaking, a huge amount of 2025 “small reasoners” energy was downstream of this release: distillation, synthetic traces, rewardable datasets, and tooling.
People also found that a little supervision, with a simple supervised fine-tuning (SFT), could already teach a model to “reason”.
Tooling ripple effects
Two things that introduced me to RLVR:
Hugging Face TRL and notably Unsloth grew real, usable GRPO support (after some early growing pains), and today it’s a first-class training path.
AI2 baked RLVR + GRPO directly into the OLMo 2 post-training recipe, which helped normalize the idea that “verified rewards + online RL” is working.
My take: RLVR/GRPO is powerful, but the current stack is still brittle, reward definition, verifier quality, distribution shift, and infra sensitivity. It works… until it doesn’t. I’m still skeptical about its long-term ergonomics in its current form. We’re nowhere near the ceiling of what RLVR approaches can do.
The expected: FP4 (and “native 4-bit”) arrived for real
Once Blackwell-era hardware support became real, FP4 was quickly adopted.
NVIDIA introduced NVFP4 as a precision format purpose-built for Blackwell-class inference efficiency. I saw nearly 2x faster inference, without noticeable accuracy loss for large models.
OpenAI made a very loud statement by releasing gpt-oss-120b and gpt-oss-20b (Aug 5, 2025) “natively” quantized in MXFP4, explicitly positioning “open weights + efficient deployment” together.
Research momentum followed: papers (from various labs) report stable FP4 (pre-)training with performance comparable to FP8 baselines.
My bet: if you’re a big model provider, you’re already serving some portion of traffic in 4-bit (or something close).
Biggest disappointment: Llama 4
I was excited for Llama 4 because Meta had been the major force in open-weight democratization, especially in 2024.
Factually, Llama 4 wasn’t “nothing.” Scout and Maverick shipped with genuinely ambitious engineering choices, massive context lengths (Scout advertised 10M, Maverick 1M, but with very poor performance in practice), MoE variants, and a bunch of tricks (chunked attention, etc.).
But the rollout and narrative felt messy. The models perform poorly, and Meta has been caught trying to benchmax their models, submitting checkpoints fine-tuned for specific leaderboards.
I wonder how much Llama 4 has cost, but it looks to me like a $100M+ disaster. The 1T+ parameter version (the “Behemoth”) was never released.
I didn’t actually run the models extensively. Based on what I saw in the release and early feedback, testing them didn’t feel like a good use of my time.
That said, calling the Llama “brand” dead is probably overstating it. What is clear is that Meta has been reshuffling teams and priorities, and from the outside there hasn’t been a consistent technical narrative or a publicly visible roadmap for what “Llama 5” is supposed to be.
2026 Predictions: The Year of Consolidation?
Test-time scaling will blow up.
Wait! Wasn’t 2025 already the year of reasoning and test-time scaling?
In 2025, it became clear that “thinking” models often perform best on hard tasks. The trade-off is obvious: they routinely generate thousands of tokens, even for problems that don’t warrant it. That cost, in latency and compute, is exactly why classic instruct models, despite lower peak accuracy, stayed extremely popular. For most day-to-day tasks, nobody wants to wait minutes (or hours…) for an answer.
Example: take an 8B model quantized in NVFP4 with an 8-bit KV cache, served with vLLM on an RTX 5090. That’s about the fastest reasonably accessible setup you can build today, and you’re still looking at roughly 155 tokens/second per query.
Even with naive math, a 20k-token reasoning trace takes a bit over 2 minutes to generate. In practice it’s worse: as the KV cache grows, memory bandwidth increasingly becomes the bottleneck, so throughput drops and the wait time stretches far beyond that estimate. And once you move to larger models, long “reasoning” traces quickly become impractical.
In 2026, the constraint will loosen dramatically: models will generate tokens much faster, not 10x, but orders of magnitude faster. It’s coming.
This won’t be driven primarily by clever decoding tricks like speculative decoding, or by a sudden breakthrough in hybrid neural architectures. It will come from specialized hardware that becomes cheaper and widely available.
NVIDIA knows it too. Their $20B investment into Groq says a lot.
Even if acceleration is mostly hardware-driven, it won’t make architectural work irrelevant. We still need better ways to generate, reuse, and cache context tensors. I expect real progress here this year, lighter KV caches, smarter eviction/compression, or even approaches that make parts of the KV cache effectively unnecessary in some settings.
And beyond caching: long reasoning traces are full of redundant, low-information tokens. We can do better than treating every token as equally worth storing forever.
The Kaitchup: Building the Next Chapter
Subscriber growth slowed sharply in 2025. Substack became the default platform for people writing about AI, which pushed competition and the noise floor higher than ever.
Revenue, however, increased significantly, mainly through The Kaitchup Pro, The Salt (my other newsletter), and book sales. Pro grew so fast that I closed it in early March to make sure I could support members properly.
What’s next
I’ll reopen the Kaitchup Pro soon with a simpler structure and new benefits I’ve been working on. This new format will be easier for me to support.
What I’m aiming for over the next months: one additional article per week (but not every week), more frequent releases of high-quality quantized/pruned models, synthetic datasets, and more targeted fine-tunes.
In parallel, I’m building a product that will leverage the dataset I’ve built from running models and their quantized versions every day. I want most of this in place within the next 6 months, ideally by Q1 2026.
I’ll also start doing regular Live Chats focused on specific, concrete tasks. Example: next week I plan to test GLM 4.7 and MiniMax-M2.1. This requires a big GPU (or multiple GPUs), so I’ll run it on a single B300, with quantized models, because even 288GB VRAM won’t be enough otherwise. Usually, I would do this alone. This time, I’ll open a chat and document everything step by step for ~10 hours: environment setup, vLLM configuration, inference speed benchmarks across multiple 4-bit variants, and, if things go well, fine-tuning with Unsloth. I’ll answer questions as we go. Paid subscribers only. First session is planned for next Thursday at 9:30 (Paris time), and you’ll be notified when it starts. No worries if you miss it, I’ll write a report about my experiments (what worked and what didn’t).
I’m also preparing a course to teach you how to run and adapt LLMs to your tasks and hardware. But this won’t be ready soon.
The overall goal is straightforward: make The Kaitchup one of the places people go when they want the most cost-effective way to run and fine-tune LLMs, without giving up accuracy.
Of course, that’s too much work for a single person. To ship faster, new types of content (and products), and raise the overall quality bar, I’ll hire people to help. I have a lot of ideas, a lot of data, and I just need more hands to make everything clean and useful.
Special Projects (2026)
I want to do longer-running, concrete demos that better demonstrate the full potential of quantized models and how to fine-tune them.
I have several ideas, and I can already confirm one.
Confirmed: WMT26, “small model vs big model” translation challenge
I’m going to submit an official entry to WMT26, aiming to show that a <1B parameter model can outperform much larger ones for a specific translation direction when properly fine-tuned.
If you don’t know WMT: it’s a long-running annual machine translation evaluation with shared tasks, datasets, and rankings (historically with substantial human evaluation). I used to be part of the organizing committee, but I withdrew to avoid any conflict of interest for this year’s competition.
We’ll participate in one language pair, and I’ll document progress as we go. This should be genuinely fun. Last year, they also had a “compression” track. If they reconduct it this year, The Kaitchup will also participate.
We will start from here:
Candidate LLMs for fine-tuning:
LFM2
Gemma 3
and maybe Baguettotron
or… other better small models to be released before April/May 2026, when we will start the serious training
There are almost no state-of-the-art small translation models publicly available for free. I want to correct that.
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% discount for group subscriptions):
Have a nice weekend!







