The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

GLM-5 Memory Requirements Explained: MLA + DeepSeek Sparse Attention (DSA)

How GLM-5 fits 200K context without terabytes of KV cache, and what GPUs you need.

Benjamin Marie's avatar
Benjamin Marie
Feb 16, 2026
∙ Paid

GLM 4.7 was released only two months ago, but Zhipu AI (Z.ai) has already followed up with a stronger successor: GLM 5.

One of the headline changes is the introduction of DeepSeek Sparse Attention (DSA), layered on top of Multi-Head Latent Attention (MLA), to further speed up inference with long context.

GLM 5 is also substantially larger: from 355B parameters for GLM 4.7 to 744B parameters for GLM 5.

In this article, I’ll use this release as an opportunity to explain what DSA brings to the table and why it may become the default going forward. Then we’ll look at what’s new in GLM 5, what it takes to run in practice, and what the hardware requirements look like. Finally, we’ll break down memory consumption and compare the available quantized variants.

If you want a refresher on how MLA works, I covered it here:

Run GLM-4.7 Flash on One GPU: VRAM Math, Quantization Options, and Benchmark Results

Run GLM-4.7 Flash on One GPU: VRAM Math, Quantization Options, and Benchmark Results

Benjamin Marie
·
Feb 10
Read full story

DeepSeek Sparse Attention to Speed Up Inference

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 The Kaitchup · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture