Nanbeige4.1: Only 3B Parameters, but as Good as Qwen3 32B?
The Weekly Kaitchup #130
Hi everyone,
In this edition of The Weekly Kaitchup, we discuss:
Nanbeige4.1: Only 3B Parameters, but as Good as Qwen3 32B?
The Ring-2.5: A 1T Parameter Hybrid Model
I won’t cover GLM-5, also released this week, in this article. I’m preparing a deeper, dedicated write-up for next week.
Nanbeige4.1: Only 3B Parameters, but as Good as Qwen3 32B?
Nanbeige is another Chinese lab training language models, but it’s focused on compact models you can run locally.
Its earlier release, Nanbeige/Nanbeige4-3B-Thinking-2510 (from four months ago), was already impressively strong. Now, the 4.1 edition is posting benchmark scores so high that they’re hard to accept without scrutiny.
The model reportedly outperforms even Qwen3 32B, a model about 10x larger. Early community feedback seems to back this up: it’s genuinely very good. But is it as good as the benchmarks tell us?
Let’s unpack what they did and examine whether any specific choices, data, evaluation setup, or training procedures might have unusually inflated the benchmark scores.
They published their technical report here.
Nanbeige4.1-3B’s benchmark jump seems to mainly come from a carefully engineered post-training pipeline on top of their 3B base model. They first strengthened supervised fine-tuning by reshaping the data mix (more code, more difficult math/general problems), scaling context length up to 256K, and upgrading a solution-refinement + chain-of-thought reconstruction process so the model learns from higher-quality critique–revision outputs. Then they ran general RL in two complementary forms: a point-wise RL stage using a general reward model (to suppress repetition/formatting issues and improve standalone response quality), followed by pair-wise RL using comparison data (strong vs. weak model outputs) with debiasing tricks like swap-consistency regularization, sharpening preference boundaries, and improving alignment scores.
This is a very careful work, but it is certainly not enough to explain the performance. I noticed that they present learning curves with the same benchmarks used for evaluation.
It’s not uncommon, but it makes me think that they directly validate their post-training on the benchmarks, i.e., selecting checkpoints, hyperparameters, etc. that lead to higher scores. I also ran CoDeC to check for direct benchmark training, but it didn’t flag any of the benchmarks. For example, GPQA came back at 40.9, well below the ~80.0 level you’d expect if there were heavy contamination.
Nonetheless, something is certain: the model is very good. I’ll quantize it to make it smaller and easier to run locally. Since it tends to generate a lot of tokens, we’ll want it to be as fast as possible.
There aren’t many quantized versions available yet. You can find GGUF builds you can run with LM Studio or llama.cpp here:
As usual, avoid going below Q4, since these GGUF haven’t been evaluated, we don’t know how good they are.
The Ring-2.5: A 1T Parameter Hybrid Model
We have very few ~1T-parameter models, mainly because they’re extremely expensive to train and deploy. Kimi-2.5 is a notable example, one of the strongest open-weight models available today. Another, less widely known family of ~1T-parameter models is the “Ring” series, now in its 2.5 release.
Ring-2.5-1T is built for “deep thinking” and long-horizon agentic execution, centered on a hybrid linear attention: a 1:7 mix of MLA (Multi-head Latent Attention; we discussed it for GLM 4.7 Flash) and Lightning Linear Attention, created by incrementally upgrading prior GQA-based layers: some GQA layers are converted to Lightning Linear Attention to boost long-sequence decoding throughput, while the remaining are approximated into MLA with targeted tweaks like QK Norm and Partial RoPE to preserve expressiveness under a more KV-cache-friendly attention scheme. Although the “activation parameter count” increases (reported from 51B → 63B after the changes), the hybrid attention is positioned as the reason inference becomes more efficient for long contexts.
On generation efficiency, the model claims substantial long-context speedups: using a high-ratio linear attention mechanism, it reports >10x reduction in memory access overhead and >3x higher generation throughput when generating beyond 32K tokens, which is aimed at deep reasoning traces and long-horizon tasks. Context length is listed as 128K, extendable to 256K via YaRN, and the release supports multiple tensor types (e.g., BF16 and FP8 variants are referenced for deployment).
On reasoning and agentic performance, Ring-2.5-1T is presented as achieving state-of-the-art among open-source “thinking” models across hard reasoning (math/coding/logic benchmarks such as IMOAnswerBench, AIME 26, HMMT 25, LiveCodeBench, ARC-AGI-V2) and long-horizon execution (Gaia2-search, Tau2-bench, SWE-Bench Verified).
Yet, like KIMI 2.5, this model is not meant to run on your computer. Even the FP8 version would require over 1T of memory.
The Salt
The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.
This week, we review:
⭐On the Optimal Reasoning Length for RL-Trained Language Models
Context Compression via Explicit Information Transmission
Large Language Model Reasoning Failures
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% discount for group subscriptions):
Have a nice weekend!








