This Week: GLM 4.7 Flash's Huge KV Cache and LFM2.5 Thinking
The Weekly Kaitchup #127
Hi everyone,
In this edition of The Weekly Kaitchup, we discuss:
LFM2.5 1.2B: Think Smaller
GLM 4.7 Flash: From Huge to Small KV Cache
LoRA or not LoRA for RLVR/GRPO?
LFM2.5 1.2B: Think Smaller
Liquid AI continues to publish open-weight models, with an emphasis on small checkpoints that are easy to deploy locally. As far as I’m aware, they’re also one of the only labs that regularly report CPU inference throughput.
About two weeks ago, they released Instruct-tuned and Japanese variants of their new LFM2.5 1.2B checkpoints. Community feedback has been very positive, and my own evaluation matches that: they’re very good models for their size.
And aside from Qwen3-1.7B, we don’t have many comparable strong reference points in this class.
This week, they also released “thinking” variants that generate many more tokens (i.e., a longer reasoning trace) to improve accuracy on harder prompts.
For a 1.2B model, and given how difficult small models are to RL, the reported numbers are impressive:

It outperforms Qwen3-1.7B on most benchmarks, with the main exception being MMLU Pro (and AIME25, which isn’t shown here). One note: Qwen3’s unusually high MMLU Pro results have always struck me as a bit suspicious. I’d love to see more explanation from the Qwen team (or, ideally, more detail on the training data) to make their numbers easier to interpret. What did Qwen do to get such high scores on MMLU-like tasks?
Just as importantly, when evaluating “thinking” models, you have to look at the reasoning (token) budget: how many tokens the model generates to reach a given level of accuracy.
The model is significantly more token-efficient on AIME-style tasks (hard math problems), where Qwen3-1.7B generates roughly twice as many tokens. So you end up with a larger model producing longer traces, which makes it substantially more expensive to run. Note: The “average” is a bit misleading here, since a large portion of these tokens comes from a single benchmark (AIME25). If you remove AIME25 from this average, LFM2.5 generates slightly more tokens.
At the same time, since LFM2.5 tends to produce shorter reasoning traces on AIME problems, I’m curious whether we could push it to generate more tokens, and potentially trade that extra budget for higher accuracy on difficult prompts. I’m planning to run a few experiments to test this.
Their blog post is worth reading, too.
LFM2.5-1.2B-Thinking: On-Device Reasoning Under 1GB
They managed to eliminate “doom looping,” a common failure mode where small reasoning models get stuck in repetitive patterns instead of actually finishing. To do this, during preference alignment, they explicitly construct (chosen, rejected) pairs that penalize loops. They pick the best candidate (via an LLM judge) as the “chosen,” and set the “rejected” to either (a) the worst non-looping sample, or (b) any looping sample whenever a loop shows up. Then, during RLVR, they add an n-gram repetition penalty early in training to further suppress the behavior.
I also quantized the model here (4-bit and 8-bit, both vLLM-compatible):
For the first time, I released an MXFP4 version, along with GGUF models made with AutoRound’s new algorithm for mixed-precision quantization.
I’ll evaluate them later, next week.
GLM 4.7 Flash: From Huge to Small KV Cache
GLM 4.7 is, in my opinion, the best open-weight model right now, but it’s also extremely VRAM-hungry. As we saw, even a B300 doesn’t have enough memory to load it, not even when quantized to FP8.
So when Z.ai released a smaller variant this week, GLM 4.7 Flash, I was eager to try it. On paper, it looks like the ideal “practical” version: ~30B total parameters, with only ~3B active during inference. That makes a full eval feel straightforward and cheap, and at 4-bit it would largely fit on a 30 GB GPU. I was also keen to put out quantized builds quickly.
But everything went wrong.
GLM 4.7 Flash is doing something pretty fancy: it uses MLA. Finally, another model in the wild is using it! If you’re not familiar with MLA, I explained it here:
TL;DR
MLA (Multi-head Latent Attention) compresses the attention KV cache. Instead of storing full per-head keys/values for every token (which is what makes long-context inference so VRAM- and bandwidth-heavy), it stores a smaller latent representation per token and reconstructs the per-head K/V on the fly with lightweight projections. In theory, you get much smaller KV memory (so longer context and better throughput at long sequences) with minimal or no accuracy loss, as long as the latent size is chosen well, basically trading a bit of extra compute for a big reduction in memory/bandwidth.
The promise is a smaller KV cache without giving up accuracy. And yet, on an RTX Pro 6000 (96 GB VRAM), the unpleasant surprise was that I could only fit ~20k tokens max in vLLM, about 10x fewer than what I’d expect from a “standard” 30B model.
That pretty clearly pointed to an MLA-related loading issue: the model wasn’t being instantiated with the correct architecture/path, so the KV cache blew up.
Fortunately, someone tracked down a fix quickly. If you want to address the huge KV-cache memory consumption, apply this PR (not merged as of writing):
fix: Add glm4_moe_lite to MLA detection
On a side note, it’s a bit surprising this slipped past Z.ai before release, especially since they clearly tested vLLM support.
Once that was fixed, the obvious next step was quantization. But I ran into more issues almost immediately, mostly from the current incompatibilities between Transformers v5.0 and LLM Compressor. Rather than sink more time into chasing toolchain bugs, I postponed quantization and focused on LFM2.5 Thinking instead.
Still, don’t worry, a GLM 4.7 Flash quantization by The Kaitchup is coming early next week!
In the meantime, I did find a workaround today to make LLM Compressor usable with Transformers v5. If LLM Compressor crashes with:
ImportError: cannot import name 'TORCH_INIT_FUNCTIONS' from 'transformers.modeling_utils'
…you can fix it by adding the following lines before importing oneshot:
import torch.nn as nn
import transformers.modeling_utils as tmu
if not hasattr(tmu, "TORCH_INIT_FUNCTIONS"):
# Minimal mapping used by llmcompressor.utils.dev.skip_weights_initialize
tmu.TORCH_INIT_FUNCTIONS = {
"uniform_": nn.init.uniform_,
"normal_": nn.init.normal_,
"trunc_normal_": nn.init.trunc_normal_,
"constant_": nn.init.constant_,
"xavier_uniform_": nn.init.xavier_uniform_,
"xavier_normal_": nn.init.xavier_normal_,
"kaiming_uniform_": nn.init.kaiming_uniform_,
"kaiming_normal_": nn.init.kaiming_normal_,
"uniform": nn.init.uniform,
"normal": nn.init.normal,
"xavier_uniform": nn.init.xavier_uniform,
"xavier_normal": nn.init.xavier_normal,
"kaiming_uniform": nn.init.kaiming_uniform,
"kaiming_normal": nn.init.kaiming_normal,
}LoRA or not LoRA for RLVR/GRPO?
I still get this question a lot: Does LoRA work for RL, for example, with RLVR or GRPO-style methods?
The annoying part is that the literature (and people’s anecdotes) give genuinely mixed signals.
On one side, the Thinking Machine post (LoRA without Regret) makes the case that LoRA can work even with very small ranks, down to rank 1. The argument is coherent: in principle, you don’t need many trainable parameters to steer behavior. The catch is that this assumes your whole RL setup is “clean”: reward signal, optimizer dynamics, hyperparameters, sampling temperature, numerical stability, etc. In practice, those imperfections matter a lot. Empirically, I haven’t found rank 1 to be reliably sufficient on real tasks. It can work for the “classic” math RL setups (e.g., Qwen + math-style reward), but for my own experiments, and even when trying popular Unsloth GRPO notebooks, rank 1 usually shows very little learning.
On the other side, there’s a paper I almost missed (published Dec 29; and that’s why I’m only writing about this now) that recommends avoiding LoRA for RLVR. They tried multiple parameter-efficient variants (DoRA, LoRA+, VeRa, etc.), and across the board, they underperform full-weight RLVR.
Evaluating Parameter Efficient Methods for RLVR (arXiv)
My take: LoRA for RLVR / GRPO-like RL is absolutely workable, and often the most practical option, but don’t do it with a tiny rank. Rank 1 is either too weak or only works on very specific tasks (and it’s not easy to predict which ones ahead of time). Small ranks underperform full-weight RL.
That said, even in this paper, LoRA doesn’t collapse, it can get surprisingly close to full-weight RL, and rank clearly matters. They report that higher ranks (e.g., 16 and 32) do better than low ranks, and r = 1 consistently lags behind. If you think about it, r=32 is already a fairly standard choice, but I wish they had pushed further (64 or 128 aren’t that exotic). It would’ve been really useful to see whether sufficiently high rank can actually match full-weight RL in their setting, or whether there’s a hard ceiling.
One practical recommendation: run a couple of quick control experiments to confirm your model actually learns under your LoRA setup (and to calibrate the rank you need). Don’t assume “it should work” just because it worked for Qwen3 with math.
Also, LoRA remains extremely practical operationally: you don’t need to reload or swap the full model weights for inference updates, you just swap adapters. With frameworks like vLLM, that workflow is now pretty straightforward.
The main caveats
LoRA doesn’t touch everything. Norms, the LM head, and token embeddings are typically frozen. Over long RL runs, it’s hard for me to believe that never updating those components doesn’t matter at all, especially once you’re pushing beyond “light steering” into genuine capability shifts.
MoE is trickier than dense. LoRA tends to behave worse on MoE models. In practice, people often freeze the router (and sometimes large parts of the experts) to avoid overfitting or destabilizing routing. You can fine-tune LoRA layers inside experts, but it tends to be more sensitive, and if you’re running longer training, you’ll want to be more deliberate about what’s trainable.
The Salt
The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.
This week, we review:
⭐Your Group-Relative Advantage Is Biased
Demystifying the Slash Pattern in Attention: The Role of RoPE
TranslateGemma Technical Report
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% discount for group subscriptions):
Have a nice weekend!







