The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #2: Training a Reward Model
Copy link
Facebook
Email
Notes
More

Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #2: Training a Reward Model

A key step in training instruct LLMs

Benjamin Marie's avatar
Benjamin Marie
Sep 14, 2023
∙ Paid
5

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #2: Training a Reward Model
Copy link
Facebook
Email
Notes
More
Share

This article is the second in the series on training instruct LLMs with DeepSpeed Chat. If you missed the first one on supervised fine-tuning (SFT), you can find it here:

Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #1: Supervised Fine-tuning

Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #1: Supervised Fine-tuning

Benjamin Marie, PhD
·
September 4, 2023
Read full story

Most of the chat models that you can find online only went through SFT. The performance of these models is already quite good but to fully achieve the training of an instruct LLM we have two more steps to perform:

  • Training a reward model

  • Reinforcement learning (with PPO)

If SFT is already good, why do we need to continue with two more steps?

In short, these two steps help a lot to align the model with the users. Instruct LLMs are especially less biased and less toxic than models trained only with SFT.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

As we saw, SFT can be costly if you train the model for several epochs. Training the reward model is much more straightforward.

In this article, we will see what is the reward model and how to train it with DeepSpeed Chat.

This is achievable with a much smaller base model than SFT. The main cost of training the reward model is in the training data which typically requires humans’ input.

You will find the training code in this notebook:

See the notebook (#15)

What is the reward model for instruct LLMs?

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More