The Weekly Kaitchup #80

MoBA - Step-Video-T2V - PaliGemma 2 Mix

Feb 21, 2025

Hi Everyone,

In this edition of The Weekly Kaitchup:

Mixture of Block Attention (MoBA): 6.5x Speedup at 1M Input
Step-Video-T2V: Efficient Video Generation with Compressed VAE
PaliGemma 2 ~~Instruct~~ Mix

I’m working on a longer-than-usual article about multi-head latent attention (MLA). Next week, I’ll publish only on Monday, with the Weekly Kaitchup on Friday, unless of course something huge happens in AI (…the Qwen team seems nearly ready to release something amazing, again). The MLA article will be published the week after that.

I’m also considering activating Substack’s chat feature for The Kaitchup—Substack’s team has been nudging me to do it for a while. The Kaitchup has now probably enough subscribers to make it interesting. Not sure how to kick it off yet, but we’ll see!

Mixture of Block Attention (MoBA): 6.5x Speedup at 1M Input

LLMs need to process long sequences for tasks like reasoning and decision-making. The problem is that attention mechanisms become computationally expensive as sequences grow, making it hard to scale efficiently.

Researchers have tried different approaches, like sparse attention and linear approximations, but these often come with trade-offs, either they require major changes to existing models or don’t work well for complex reasoning. The challenge is finding a way to improve efficiency without overcomplicating things.

Mixture of Block Attention (MoBA) takes a practical approach by adapting Mixture of Experts (MoE) to attention mechanisms. Instead of looking at all tokens equally, it selects relevant blocks of context, reducing computation without losing performance. This makes it easier for models to handle longer inputs without dramatically increasing resource use.

The authors released their code here:

GitHub: MoonshotAI/MoBA

They have experimented with sequences of up to 10M tokens!

Step-Video-T2V: Efficient Video Generation with Compressed VAE

LLMs have gotten really good at understanding and generating text, but they still struggle with capturing the real world, especially motion, space, and time.

Step-Video-T2V is a 30B-parameter text-to-video model that can generate high-quality, smooth-motion videos from text prompts in both English and Chinese. It’s built on a diffusion Transformer (DiT) and uses a compressed Video-VAE to keep training efficient without losing quality. The training process is layered, starting with text-to-image learning, then moving to video generation, fine-tuning, and optimization, so the model picks up on both visuals and motion dynamics.

However, as pointed out by the authors of this work, the model still struggles with rare concept combinations (like an elephant and a penguin in the same scene), long high-resolution videos are expensive to train, and keeping generated videos physically accurate is tricky. Even with 30B parameters, some action sequences don’t turn out quite right.

The model is here:

stepfun-ai/stepvideo-t2v

As usual, the videos shown as generation examples are probably cherry-picked. Obtaining results as good as those might require significant effort in prompt engineering, and luck, as we saw with Pyramid Flow.

Generate Videos on Your Computer with Pyramid Flow

Benjamin Marie

October 24, 2024

Read full story

The technical report is here:

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

PaliGemma 2 Instruct Mix

Google released PaliGemma 2 Mix, a fine-tuned version for a mix of vision-language tasks like OCR, captioning, and answering questions about images.

The original PaliGemma 2 vision-language models (VLMs) are meant to be fine-tuned for specific tasks, but the Mix version gives a preview of how well they can perform across different use cases right out of the box. Note: Yes, they are instruct models and Google should have named them PaliGemma 2 Instruct to follow standard naming conventions.

6 versions are available: 3B, 10B, and 28B, with two different resolutions each. I didn’t carefully check but I assume that the higher resolution delivers better results while it should consume much more memory during inference.

PaliGemma 2 Mix Models (gemma license)

They can process both open-ended prompts and task-specific prefixes, but open-ended prompts tend to work better. For instance, you can ask the model to describe an image, detect objects, or extract text, and it will generate responses accordingly. When it comes to performance, different model sizes and resolutions make a difference: higher resolutions are better for detail-heavy tasks, and larger models tend to generate more accurate and descriptive outputs.

GPU Selection of the Week:

To get the prices of GPUs, I use Amazon.com. If the price of a GPU drops on Amazon, there is a high chance that it will also be lower at your favorite GPU provider. All the links in this section are Amazon affiliate links.

NVIDIA RTX 50XX GPUs are officially released but already sold out, as expected. I won’t track their prices until I can find them at a “reasonable” price.

Even the 40xx series is unaffordable now.

RTX 50 and DIGITS: What Does It Mean for Local AI?

Benjamin Marie

Jan 8

Read full story

RTX 4090 (24 GB): None at a reasonable price.
RTX 4080 SUPER (16 GB): None at a reasonable price.
RTX 4070 Ti SUPER (16 GB): None at a reasonable price.
RTX 4060 Ti (16 GB): INNO3D nVidia GeForce RTX 4060 Ti TWIN X2 OC 16GB

The Salt

The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.

I reviewed in The Weekly Salt:

From GQA to MLA for a Better and More Memory-Efficient Attention Computation

Benjamin Marie

Feb 19

Read full story

⭐TransMLA: Multi-Head Latent Attention Is All You Need
Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning
BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models

Support The Kaitchup by becoming a Pro subscriber:

What You'll Get

Priority Support – Fast, dedicated assistance whenever you need it to fine-tune or optimize your LLM/VLM. I answer all your questions!
Lifetime Access to All the AI Toolboxes – Repositories containing Jupyter notebooks optimized for LLMs and providing implementation examples of AI applications.
Full Access to The Salt – Dive deeper into exclusive research content. Already a paid subscriber to The Salt? You’ll be refunded for the unused time!
Early Access to Research – Be the first to access groundbreaking studies and models by The Kaitchup.
30% Discount for Group Subscriptions – Perfect for teams and collaborators.
The Kaitchup’s Book – A comprehensive guide to LLM fine-tuning. Already bought it? You’ll be fully refunded!
All Benefits from Regular Kaitchup Subscriptions – Everything you already love, plus more. Already a paid subscriber? You’ll be refunded for the unused time!

Subscribe to The Kaitchup Pro

That’s all for this week.

If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% (or 30% for Pro subscribers) discount for group subscriptions):

Share The Kaitchup – AI on a Budget

Have a nice weekend!

The Kaitchup – AI on a Budget

The Weekly Kaitchup #80

MoBA - Step-Video-T2V - PaliGemma 2 Mix

Mixture of Block Attention (MoBA): 6.5x Speedup at 1M Input

Step-Video-T2V: Efficient Video Generation with Compressed VAE

Generate Videos on Your Computer with Pyramid Flow

PaliGemma 2 Instruct Mix

GPU Selection of the Week:

RTX 50 and DIGITS: What Does It Mean for Local AI?

The Salt

From GQA to MLA for a Better and More Memory-Efficient Attention Computation

What You'll Get

Discussion about this post