The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Running DeepSeek-R1-0528 with a Single 24 GB GPU

Running DeepSeek-R1-0528 with a Single 24 GB GPU

Is it worth it?

Benjamin Marie's avatar
Benjamin Marie
Jun 02, 2025
∙ Paid
5

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Running DeepSeek-R1-0528 with a Single 24 GB GPU
Share

DeepSeek’s open-source LLMs have dominated the open-model leaderboards, and each new release closes the gap to proprietary systems. The latest build, DeepSeek-R1-0528, competes with OpenAI o3 on several downstream tasks.

Power, however, comes at a cost: in full precision, the 671 billion parameters of the model sprawl across more than a single node of 8x H100 GPUs, so running it locally is out of reach for most users. Hosted inference endpoints are the usual fallback, but sustained usage quickly becomes expensive.

The good news is that a mix of aggressive quantization and expert-layer offloading lets you squeeze DeepSeek-R1-0528 onto a single 24 GB GPU. In this article, we’ll:

  • check the quantized versions that are currently available;

  • walk through offloading expert layers in CPU RAM while keeping the attention modules on the GPU; and

  • measure the resulting memory footprint and tokens-per-second throughput.

The accompanying notebook shows DeepSeek-R1-0528 (with expert offloading) generating a fully playable Flappy Bird clone on a 24 GB GPU:

Get the notebook (#168)

Which Quantized DeepSeek-R1-0528 Should You Use?

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share