DeepSeek-V3 is one of the best open LLMs, outperforming most others in various tasks. With 671 billion parameters, you would expect it to require multiple GPU nodes and to run very slowly, even on expensive hardware. However, DeepSeek-V3 is actually much faster than smaller models like Llama 3.3 (70B) and Qwen2.5 (72B).
So, how does DeepSeek-V3 manage to be so efficient despite being so large?
This article will explain how DeepSeek-AI made this possible. They built on their earlier work, DeepSeek and DeepSeek-V2, by using a special mixture of experts model with many smaller expert models, a few shared experts, and multi-head latent attention. They also trained the model to use FP8 precision, making it much more memory-efficient compared to models of similar size.
We’ll also examine the hardware needed to run DeepSeek-V3 and the cost of running a copy of the model in the cloud.
If you want to try it for yourself, my notebook linked below includes vLLM inference code, using multiple GPUs, to get you started with DeepSeek-V3.