LFM2.5 and Falcon H1R-7B: New Hybrid Models with Strong Benchmark Scores
The Weekly Kaitchup #125
Hi everyone,
In this edition of The Weekly Kaitchup, we discuss:
LFM2.5: Small, Fast, and Accurate
Falcon H1R-7B: The New Best Model Under 30B Parameters?
Qwen3-VL Embedding and Rerankers
LFM2.5: Small, Fast, and Accurate
LFM2.5 1.2B scores 14.0 on AIME25, and 38.9 on GPQA-Diamond. For a 1.2B-parameter model, that’s very high.
And it’s a hybrid model, so you get a combination of speed and accuracy with a small model.
LiquidAI keeps developing its hybrid architecture, and NVIDIA is moving in the same direction. With Qwen3-Next, the Qwen team also signaled that their next generation may go hybrid. If this momentum holds, 2026 could be the year hybrid models become the default.
Even if it is already small, I wanted to quantize LFM2.5 to shrink it further. The catch is that at this scale, even 4-bit quantization can noticeably degrade quality. And while LFM models already have plenty of GGUF and MLX variants available, most aren’t properly benchmarked, so it’s hard to know ahead of time which ones are actually worth using.
I went ahead anyway and used LLM Compressor + AutoRound to produce multiple formats: GPTQ, AWQ, AutoRound, FP8, and NVFP4. I also tried to make an INT8 SmoothQuant version, but I couldn’t get the mappings right, and I was running out of time, so unless there’s strong demand for it, I’m not planning to revisit INT8 SmoothQuant.
I did not quantize the convolution layers. That’s a meaningful portion of the model that stays in higher precision. The main reason is that once you quantize the conv layers, the model won’t run on vLLM, which would make evaluation much harder. Note: I’ve also heard there’s a vLLM PR coming that should add conv quantization support. If/when that lands, it’ll be worth re-quantizing with conv layers included and comparing results against these current quantized models.
Anyway, my quantized models are here:
All of them run on vLLM v0.13. Also, make sure you upgrade “compressed-tensors” to the latest version, otherwise you may hit cryptic Pydantic errors.
So, how good are they?
Pretty good. I ran a large suite of benchmarks and collected everything I needed. Here are a few highlights:
The main takeaway is what you’d expect: 4-bit introduces a significant but generally acceptable drop. If you’re choosing a 4-bit variant, AWQ is the one to pick based on what I measured.
The accuracy for coding looks low, but you can leverage the model’s speed to generate multiple outputs and double~triple the accuracy. The models have good-looking PASS@k curves:
I didn’t benchmark NVFP4 yet (and I probably won’t), but I’d expect it to trail for a model this small, as NVFP4 often isn’t great at low parameter counts. That said, if you’re on a Blackwell GPU and you want maximum throughput, NVFP4 could still be a pragmatic choice.
Because the conv layers are still in higher precision, the real size difference between 8-bit and 4-bit isn’t as dramatic as you might expect. So my default recommendation is:
Pick FP8 if your GPU supports fast FP8. Performance is very close to the original model, with good speed.
Do They Translate Well?
Translation is a useful proxy for judging how strong a model is in another language, and it’s generally preferable to using a “target-language” evaluation set that was created by machine-translating an English benchmark with other models.
The translation results make it very clear which languages LFM2.5 was likely optimized for. It performs very well into European languages, and could be one of the best “base” models at this size for machine translation fine-tuning in that family.
On the other hand, don’t use it for Indic languages. The quality is overall very low for the languages it doesn’t officially support.
I haven’t had time to dig deeper yet into my other results, but I’ll share a full report of what I found, and likely add comparisons against other small instruct models.
I’m also running evaluations of LFM2.5-1.2B-JP. I’m curious to know how good it is at translating into Japanese.
Insane Numbers of Falcon H1R-7B
Falcon H1R-7B is a ~7B-parameter LM (the Hugging Face page rounds “model size” to ~8B) built on a hybrid Transformer + Mamba2 backbone, where attention and SSM mixing run in parallel inside each block. It’s designed to support very long context (common deployments default to max_model_len ≈ 262,144 tokens), and its chat format can emit an explicit reasoning section wrapped in the usual <think>…</think> before the final answer.
Under the hood, H1R inherits its base from Falcon-H1-7B-Base, whose pretraining budget is reported at ~12T tokens, drawn from a multi-source corpus spanning filtered web text, multilingual Common-Crawl/curated sources, code corpora, and math-heavy datasets (e.g., Proof-Pile-2, FineMath, InfiMM-WebMath-40B, and others), plus substantial in-house synthetic “rewriting” of curated raw data.
H1R’s “reasoning” specialization was then done via a two-stage post-training recipe: a cold-start SFT phase on curated datasets containing long-form, step-by-step reasoning traces (math, coding, science, plus non-reasoning data like chat/tool-calling/safety), targeting very long generations (up to ~48k tokens), followed by GRPO-based RL where rewards favor correct chains and manage a token-budget constraint.
Nothing really new here. Right?
But the benchmark numbers are very high:
The model card lists AIME24 88.1, AIME25 83.1, AMO-Bench 36.3, LCB v6 68.6, GPQA-Diamond 61.3, MMLU-Pro 72.1, and IFBench 53.4 (percent). That’s better than Qwen3 32B on average, but it’s smaller and hybrid, so much cheaper to deploy.
That could be the best model under 30B parameters now.
I’ll have to run it through my evaluation pipelines to check it.
Reference: Falcon-H1R: Pushing the Reasoning Frontiers with a Hybrid Model for Efficient Test-Time Scaling
Qwen3-VL Embedding and Rerankers
Qwen released the VL version of their Qwen3 embedding and rerankers. Expect them to be very good for a multimodal RAG pipeline.
That said, if your data is text-only and you’re already using Qwen3 embeddings, stick with them. The multimodal embedding variants seem to underperform on text-only retrieval, even though Qwen3-VL models are stronger than Qwen3 on many language benchmarks.
I wrote about Qwen3 embedding and reranker here:
and how to set up a multimodal RAG pipeline here:
The Salt
The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.
This week, we review:
⭐Deep Delta Learning
Diversity or Precision? A Deep Dive into Next Token Prediction
KV-Embedding: Training-free Text Embedding via Internal KV Re-routing in Decoder-only LLMs
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% discount for group subscriptions):
Have a nice weekend!












I didn't try it yet. But the model card says it does.
Do you know if Falcon H1R-7B works well with vLLM?