DeepSeek-R1: Reinforcement Learning with Simple and Verifiable Rewards
Qwen2.5 and Llama 3.x are good students
In a previous article, I wrote that we would see smaller language models (LLMs) trained by DeepSeek-V3 “in the coming months.” I was wrong; it only took a few weeks.
DeepSeek AI rapidly post-trained DeepSeek-V3 (the base version) with a straightforward reinforcement learning (RL) pipeline to create a new model called DeepSeek-R1. This model now achieves state-of-the-art results across various benchmarks, outperforming even commercial models like GPT-4o.
With its massive 685 billion parameters, running your own copy of R1 remains prohibitively expensive. However, DeepSeek AI offers an affordable API for accessing the model and they also released distilled R1 models based on Llama 3.1/3.3 and Qwen2.5. The resulting distilled models are very impressive and capable of running on consumer hardware.
In this article, we will explore the simple RL pipeline used to turn DeepSeek-V3 into R1 and review the distillation process used to train Qwen2.5 and Llama 3 models. I also quantized some of the released models to 4-bit precision. Since they are based on Qwen2.5 and Llama 3, these models can run with most inference frameworks. Additionally, we’ll check their reasoning capabilities and output quality.
The following notebook shows how I quantized the models and ran them with Transformers on a single GPU. It also compares outputs from the original Llama 3.1/Qwen2.5 before and after training by R1.