Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #3: Reinforcement Learning with Human Feedback

The efficiency of DeepSpeed Chat in action

Sep 21, 2023

∙ Paid

This article is the final in the series on training instruct LLMs with DeepSpeed Chat. If you missed the previous articles on supervised fine-tuning (SFT) and training a reward model, you can find them here:

Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #1: Supervised Fine-tuning

Benjamin Marie, PhD

September 4, 2023

Read full story

Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #2: Training a Reward Model

Benjamin Marie, PhD

September 14, 2023

Read full story

Note: Since RLHF reuse models trained in Steps 1 and 2, I recommend reading these articles.

This third step is much more complex. It uses reinforcement learning (RL) with a Proximal Policy Optimization (PPO). In this article, I will explain how it works, why it is complex, and how DeepSpeed Chat optimizes RLHF to make it affordable.

You will find the notebook training the RLHF model with DeepSpeed Chat here:

Get the notebook (#17)

Proximal Policy Optimization (PPO) for RLHF

Our goal is to improve the model fine-tuned at step 1 so that the generated responses are more aligned with humans: less toxic, less biased, etc.

The Kaitchup – AI on a Budget

Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #3: Reinforcement Learning with Human Feedback

The efficiency of DeepSpeed Chat in action

Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #1: Supervised Fine-tuning

Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #2: Training a Reward Model

Proximal Policy Optimization (PPO) for RLHF

This post is for paid subscribers