DeepSeek’s open-source LLMs have dominated the open-model leaderboards, and each new release closes the gap to proprietary systems. The latest build, DeepSeek-R1-0528, competes with OpenAI o3 on several downstream tasks.
Power, however, comes at a cost: in full precision, the 671 billion parameters of the model sprawl across more than a single node of 8x H100 GPUs, so running it locally is out of reach for most users. Hosted inference endpoints are the usual fallback, but sustained usage quickly becomes expensive.
The good news is that a mix of aggressive quantization and expert-layer offloading lets you squeeze DeepSeek-R1-0528 onto a single 24 GB GPU. In this article, we’ll:
check the quantized versions that are currently available;
walk through offloading expert layers in CPU RAM while keeping the attention modules on the GPU; and
measure the resulting memory footprint and tokens-per-second throughput.
The accompanying notebook shows DeepSeek-R1-0528 (with expert offloading) generating a fully playable Flappy Bird clone on a 24 GB GPU: