The Weekly Kaitchup #92

BLT - AlphaEvolve - GRPO for Base LLMs

Benjamin Marie

May 16, 2025

Hi Everyone,

In this edition of The Weekly Kaitchup:

The Byte-Latent Transformers
AlphaEvolve
Unsloth’s Recipe for GRPO Applied to Base LLMs

The Byte-Latent Transformers

The AI research team at Meta frequently publishes papers, but the accompanying code and datasets are often delayed and released in batches later on.

One such batch just dropped this week:

Advancing AI systems through progress in perception, localization, and reasoning

Meta released several notable models, including a perception model and one designed for 3D object localization, especially relevant for robotics applications.

One model that stands out to me is the Dynamic Byte Latent Transformer (BLT), originally developed for a paper they published last year. It's particularly interesting for its approach to efficient, byte-level representation learning.

Byte Latent Transformer: Patches Scale Better Than Tokens

In other words, this is a tokenizer-free language model architecture that learns directly from raw byte sequences. Unlike standard LLMs, which rely on fixed tokenization schemes that introduce compression biases and language inequities, it dynamically segments input bytes into variable-length patches based on information entropy. These patches are encoded into latent representations and processed by a combination of lightweight local transformers and a global latent transformer, allowing for more efficient and adaptive compute allocation.

It appears to match the performance of token-based architectures at scale, while achieving up to 50% lower inference FLOPs compared to similarly sized models. Meta has released both 1B and 7B parameter versions.

BLT Models

As is often the case with Meta’s research releases, these models come with non-commercial licenses.

Meta’s research teams consistently publish promising ideas, many of which improve model accuracy or efficiency, but for some reason, these advances rarely make their way into the Llama model family.

AlphaEvolve

Introduced by Google DeepMind, AlphaEvolve is a language-model-driven system for program synthesis and optimization in domains where solutions can be automatically evaluated. It combines evolutionary algorithms with LLM-based code generation, applying this framework to tasks in algorithm discovery, mathematics, and systems optimization.

The core loop operates as follows:

Task Specification: Users define an evaluation function h that maps a program to one or more scalar metrics, implemented as a Python function with a fixed I/O signature. Programs are annotated with # EVOLVE-BLOCK-START and # EVOLVE-BLOCK-END to mark editable regions.
Prompt Construction: A prompt is built using prior high-performing programs from a program database, rendered evaluation results, and optional human-provided or LLM-generated meta-instructions. Prompts may include stochastic formatting and explicit context (e.g., equations, prior literature).
LLM Code Diff Generation: The prompt is passed to an ensemble of LLMs (Gemini 2.0 Flash for fast throughput, Gemini 2.0 Pro for higher-quality suggestions), which return diffs in a SEARCH/REPLACE format. These diffs are applied to generate new candidate programs.
Evaluation and Selection: Candidate programs are executed using the user-defined evaluation function. Their results are logged, and the best ones are retained in the database for future prompt sampling.

AlphaEvolve supports multiple abstraction levels: evolving raw strings, constructor functions, search algorithms, or co-evolving algorithms and intermediate solutions. This is a very complex but flexible framework.

Applications demonstrated include improving matrix multiplication algorithms (e.g., outperforming Strassen's 4×4 complex case), solving constructive math problems (e.g., Kissing Numbers in 11D), and optimizing real-world engineering components like scheduling heuristics and TPU kernels. Evaluation is entirely automatic, which is both a strength (enabling scale) and a limitation (excluding tasks needing manual feedback).

The paper is here:

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Unsloth’s Recipe for GRPO Applied to Base LLMs

Unsloth has released a new tutorial notebook demonstrating how to train a Qwen3 base model using GRPO. It's well-structured, beginner-friendly, and fully runnable on free Google Colab, making it an excellent hands-on resource for anyone looking to explore GRPO fine-tuning in practice.

Qwen3_(4B)-GRPO.ipynb

The notebook particularly highlights the importance of running a short supervised fine-tuning (SFT) phase before applying GRPO. This initial SFT step helps the model internalize the chat template, which in turn makes GRPO training more effective. This aligns with observations from DeepSeek AI during the training of R1, showing that SFT is a crucial step for successful GRPO.

GRPO: Train LLMs with DeepSeek-R1's Reinforcement Learning Method

Benjamin Marie

Feb 10

Read full story

GRPO is effective, but it's worth noting that an increasing number of studies show that even a brief SFT phase can unlock basic reasoning capabilities in models. While the performance won't match that of GRPO, short SFT remains a lightweight and practical alternative for enabling reasoning behavior.

"Thinking" LLMs with Simple Fine-tuning and Budget Forcing

Benjamin Marie

Feb 13

Read full story

The Salt

The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.

I published a deep dive about hybrid models, Nemotron-H:

Nemotron-H: The Mamba/Transformer Models by NVIDIA

Benjamin Marie

May 2

Read full story

I also reviewed in The Weekly Salt:

LLM Alignment: On-Policy vs. Off-Policy Training Data

Benjamin Marie

May 14

Read full story

⭐Learning Dynamics in Continual Pre-Training for Large Language Models
SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning
DanceGRPO: Unleashing GRPO on Visual Generation

Support The Kaitchup by becoming a Pro subscriber:

What You'll Get

Priority Support – Fast, dedicated assistance whenever you need it to fine-tune or optimize your LLM/VLM. I answer all your questions!
Lifetime Access to All the AI Toolboxes – Repositories containing Jupyter notebooks optimized for LLMs and providing implementation examples of AI applications.
Full Access to The Salt – Dive deeper into exclusive research content. Already a paid subscriber to The Salt? You’ll be refunded for the unused time!
Early Access to Research – Be the first to access groundbreaking studies and models by The Kaitchup.
30% Discount for Group Subscriptions – Perfect for teams and collaborators.
The Kaitchup’s Book – A comprehensive guide to LLM fine-tuning. Already bought it? You’ll be fully refunded!
All Benefits from Regular Kaitchup Subscriptions – Everything you already love, plus more. Already a paid subscriber? You’ll be refunded for the unused time!

How to Subscribe?

Subscribe to The Kaitchup Pro

That’s all for this week.

If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% (or 30% for Pro subscribers) discount for group subscriptions):

Share The Kaitchup – AI on a Budget

Have a nice weekend!

The Kaitchup – AI on a Budget

The Weekly Kaitchup #92

BLT - AlphaEvolve - GRPO for Base LLMs

The Byte-Latent Transformers

AlphaEvolve

Unsloth’s Recipe for GRPO Applied to Base LLMs

GRPO: Train LLMs with DeepSeek-R1's Reinforcement Learning Method

"Thinking" LLMs with Simple Fine-tuning and Budget Forcing

The Salt

Nemotron-H: The Mamba/Transformer Models by NVIDIA

LLM Alignment: On-Policy vs. Off-Policy Training Data

What You'll Get

Discussion about this post