This Week in Open Models: Tiny LFM2.5, Ornith-1.0, and GLM-5.2 REAP
The Weekly Kaitchup #148
Hi everyone,
In this edition of The Weekly Kaitchup, let’s discuss some of the open models released this week:
LFM2.5 230M: Tiny Models Can Follow Instructions
Ornith-1.0: Self-Scaffolding Coding Models
GLM-5.2 REAP: Cutting a 753B MoE Down to a More Deployable 504B
I’m also currently evaluating the M3 GGUFs I created with MoQ. This has taken more time than expected, as some of them generate more tokens than anticipated. The evaluation is almost finished, though, and you can expect the full analysis on Monday or Tuesday.
The models are available here:
I already confirmed they are all good, except for the MoQ-2.5 (not broken but much less accurate than the others).
Next week, I’ll be at the ACL 2026, one of the main AI conferences, in San Diego. Let me know if you’ll be there and would like to meet!
As I did for NeurIPS 2025, I’ll publish a report on the most interesting papers, presentations, and demos I come across.
LFM2.5 230M: Tiny Models Can Follow Instructions
Was LFM2.5-350M too large? Liquid AI just released LFM2.5-230M, and for a 230M-parameter model, it performs surprisingly well.
Two numbers that stand out are IFBench and IFEval. LFM2.5-230M is in the same ballpark as models far larger, especially regarding the IFeval score, which looks very close to what we would get with Llama 3.1 8B. That is roughly a 35× parameter gap.
Note on GPQA Diamond
GPQA Diamond is probably too difficult for models this small. It is a 4-choice multiple-choice benchmark, so a model that simply picks A, B, C, or D at random should get 25% accuracy. When results hover around that baseline, they mostly tell us that the model is not really solving the task. Scores below 25% can also happen in generative evaluations where the model is asked to output exactly one of A, B, C, or D. If it generates anything else, the answer is marked wrong. So at this scale, the GPQA number can look meaningful in a table while mostly reflecting answer-format compliance plus noise rather than scientific reasoning.
As for its architecture, LFM2.5-230M is not just a uniformly scaled-down 350M. Both models keep the same 1024 hidden size, 16 attention heads, 8 KV heads, and 65,536-token vocabulary.
The savings come from the blocks. LFM2.5-230M uses 14 layers instead of 16: 8 double-gated LIV convolution blocks + 6 GQA blocks, versus 10 + 6 for LFM2.5-350M. The bigger difference is the FFN/intermediate dimension: 2560 for the 230M model, versus 6656 for the 350M model. That is where most of the parameter reduction comes from.
In memory terms, the 230M model only consumes 459 MB and is 250 MB lighter than the 350M in bf16.
So now we have a genuinely interesting question: should we use the bf16 230M, or a quantized version of the 350M?
I think that a carefully quantized 350M could win on both quality and memory. But at this scale, quantization is less forgiving: 4-bit quantization can easily erase the quality advantage. I’ll try to find time to run the bf16 and quantized variants of these tiny LFM2.5 models and plot accuracy vs memory curves.
Ornith-1.0: Self-Scaffolding Coding Models
Ornith-1.0 is a family of open-source language models built specifically for agentic coding: settings where a model must not only write code, but operate inside a tool-using loop, inspect repositories, run commands, interpret failures, and iteratively repair its own solution.
Models: The Ornith-1.0
Blog Post: Ornith-1.0: Self-Scaffolding LLMs for Agentic Coding
The family spans several deployment and capability tiers, including compact dense models and larger mixture-of-experts models. The announced lineup includes 9B Dense, 31B Dense (not released yet), 35B MoE, and 397B MoE variants, with released Hugging Face artifacts including FP8, and GGUF builds.
The models are post-trained on top of Gemma 4 and Qwen 3.5, then specialized for coding-agent behavior.
The central training idea is “self-scaffolding.”
In conventional reinforcement-learning setups for coding agents, the scaffold, i.e., the harness, orchestration logic, memory strategy, error-handling routine, or prompting structure that guides the rollout, is usually designed by humans and held fixed. Ornith-1.0 instead treats that scaffold as something the model can improve. During RL, the model first proposes or refines a task-specific scaffold, then uses that scaffold to generate a solution rollout. The reward from the rollout is assigned back to both the scaffold-generation step and the solution-generation step, so the model is trained not only to produce better code, but also to discover better ways of organizing its own problem-solving process.
This creates a feedback loop: scaffolds that lead to higher-reward coding trajectories are reinforced, while weaker scaffolds are discarded. Over many training iterations, the model learns task-category-specific strategies for agentic coding without depending entirely on hand-engineered harnesses. DeepReinforce frames this as the key distinction of Ornith-1.0: reinforcement learning is used not just to improve answers, but to improve the model’s internal orchestration of tool use and search.
The training setup also addresses an obvious risk of self-improving agents: reward hacking. If a model can shape the scaffold that drives its own rollout, it may learn shortcuts that satisfy a verifier without solving the task, such as reading hidden test artifacts, modifying verification scripts, or hard-coding expected outputs. They have implemented three guardrails: keeping the environment and tool boundary outside the model’s control, using deterministic monitors to block forbidden actions, and adding a frozen LLM judge as a veto layer on top of the verifier for cases where intent-level gaming is harder to specify mechanically.
The models look very strong on the benchmarks, and early community feedback appears to be positive.
One particularly interesting detail is that the only checkpoint based on Gemma 4 is missing from both the release and the evaluation results. Qwen3.5 is known to be strong at agentic coding, whereas this is an area where Gemma 4 underperforms. That makes the idea of an Ornith-1.0 model based on Gemma 4 especially exciting as it could correct one of its main weaknesses.
At the same time, its absence leaves room for interpretation. Was Ornith-1.0 31B performing significantly below the 35B model based on Qwen3.5? Was it much harder to train or align? Or is the release simply delayed?
GLM-5.2 REAP: Cutting a 753B MoE Down to a More Deployable 504B
0xSero’s GLM-5.2 REAP project is a practical attempt to make Z.ai’s massive GLM-5.2 mixture-of-experts model easier to serve.
Instead of quantizing the whole model more aggressively or merging experts together, REAP removes low-saliency routed experts from each MoE layer. The released 504B version keeps 168 of the original 256 routed experts per layer. In other words, it prunes 88 experts per layer, or about 34.4% of the routed expert pool.
The released checkpoint above is an NVFP4 version.
After pruning, Router-KD, or router-only knowledge distillation, was applied to recover behavior. The experts, attention layers, embeddings, and most of the network are frozen. Only the router gate matrices are trained, representing around 0.016% of the model’s parameters. The goal is to re-teach the router how to route tokens through the surviving 168 experts so that the pruned model better matches the unpruned teacher’s next-token distribution. This makes the recovery step far cheaper than full fine-tuning.
While it’s very effective at reducing the model size, as we saw last week, REAP has significant drawbacks:
it significantly degrades the model’s world knowledge, or more generally, its accuracy on tasks not used to evaluate the expert activations
and the resulting model is often generating more tokens which makes it less cost-effective at inference.
The published evaluation results are also limited and, in some cases, do not include the original model’s results as a baseline. As a result, it is hard to tell how close this model comes to the original. What we do know is that it scores around 70 on Terminal-Bench 2.1, which is very encouraging.
Running more careful evaluations for large models remains prohibitively expensive for the freelance AI community; for example for GLM 5.2 and some pruned/quantized variants, it would cost $20k or more to run the same type of evaluation and analysis I ran for Qwen3.6.
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% discount for group subscriptions):
Have a nice weekend!







Great point about LFM2.5-230M's GPQA-Diamond score, we could've specified it in the text!
Any metrics on Ornith-35B Moe? I'm hoping if the 9B is competitive to qwen-3.5-35B that maybe the 35B MoE is competitive or better than Qwen-3.6. To wit, I still see the industry comparing a lot to Qwen-3.5, but Qwen-3.6 is much better in practice...3.5 feels not quite good enough for agentic coding (just below the bar) and 3.6 feels good enough (above the bar)...so 3.5 comparisons feel a lot less useful...we have to infer the delta between 3.5 and 3.6 and the apply to the new competitor.