Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #3: Reinforcement Learning with Human Feedback
The efficiency of DeepSpeed Chat in action
This article is the final in the series on training instruct LLMs with DeepSpeed Chat. If you missed the previous articles on supervised fine-tuning (SFT) and training a reward model, you can find them here:
Note: Since RLHF reuse models trained in Steps 1 and 2, I recommend reading these articles.
This third step is much more complex. It uses reinforcement learning (RL) with a Proximal Policy Optimization (PPO). In this article, I will explain how it works, why it is complex, and how DeepSpeed Chat optimizes RLHF to make it affordable.
You will find the notebook training the RLHF model with DeepSpeed Chat here:
Proximal Policy Optimization (PPO) for RLHF
Our goal is to improve the model fine-tuned at step 1 so that the generated responses are more aligned with humans: less toxic, less biased, etc.