Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #2: Training a Reward Model
A key step in training instruct LLMs
This article is the second in the series on training instruct LLMs with DeepSpeed Chat. If you missed the first one on supervised fine-tuning (SFT), you can find it here:
Most of the chat models that you can find online only went through SFT. The performance of these models is already quite good but to fully achieve the training of an instruct LLM we have two more steps to perform:
Training a reward model
Reinforcement learning (with PPO)
If SFT is already good, why do we need to continue with two more steps?
In short, these two steps help a lot to align the model with the users. Instruct LLMs are especially less biased and less toxic than models trained only with SFT.
As we saw, SFT can be costly if you train the model for several epochs. Training the reward model is much more straightforward.
In this article, we will see what is the reward model and how to train it with DeepSpeed Chat.
This is achievable with a much smaller base model than SFT. The main cost of training the reward model is in the training data which typically requires humans’ input.
You will find the training code in this notebook: