The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #3: Reinforcement Learning with Human Feedback

Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #3: Reinforcement Learning with Human Feedback

The efficiency of DeepSpeed Chat in action

Sep 21, 2023
∙ Paid
5

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #3: Reinforcement Learning with Human Feedback
1
Share

This article is the final in the series on training instruct LLMs with DeepSpeed Chat. If you missed the previous articles on supervised fine-tuning (SFT) and training a reward model, you can find them here:

Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #1: Supervised Fine-tuning

Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #1: Supervised Fine-tuning

Benjamin Marie, PhD
·
September 4, 2023
Read full story
Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #2: Training a Reward Model

Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #2: Training a Reward Model

Benjamin Marie, PhD
·
September 14, 2023
Read full story

Note: Since RLHF reuse models trained in Steps 1 and 2, I recommend reading these articles.

This third step is much more complex. It uses reinforcement learning (RL) with a Proximal Policy Optimization (PPO). In this article, I will explain how it works, why it is complex, and how DeepSpeed Chat optimizes RLHF to make it affordable.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

You will find the notebook training the RLHF model with DeepSpeed Chat here:

Get the notebook (#17)

Proximal Policy Optimization (PPO) for RLHF

Our goal is to improve the model fine-tuned at step 1 so that the generated responses are more aligned with humans: less toxic, less biased, etc.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share