Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #1: Supervised Fine-tuning

Instruct LLMs on a budget

Sep 04, 2023

Instruct large language models (LLMs) have become extremely popular since the release of ChatGPT by OpenAI. We can now find online many chat models mimicking the behavior of ChatGPT (since many of them are actually trained on ChatGPT’s outputs) and fine-tuned for different domains.

OpenAI describes the procedure to train instruct LLMs in this paper:

Training language models to follow instructions with human feedback (Ouyang et al., 2022)

Which can be summarized by this figure:

This is a 3-step process:

Supervised fine-tuning (SFT): Typical fine-tuning performed on prompts (e.g., questions) paired with expected outputs (e.g., answers)
=> This step is detailed in this article
Reward model training (RM): A model trained to compute a scalar reward given a prompt paired with a ranking of outputs. Typically, the datasets for this task are limited to prompts paired with a correct output and an incorrect output.
=> This step is detailed in the following article:
Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #2: Training a Reward Model
Benjamin Marie, PhD
·
September 14, 2023
Read full story
Reinforcement learning (RL): Given a prompt, the model trained at step 1 generates an output that is scored by the reward model trained at step 2. RL is optimized given the reward to improve the generation.
=> This step is detailed in the following article:
Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #3: Reinforcement Learning with Human Feedback
September 21, 2023
Read full story

Most tutorials that you will find online only do supervised fine-tuning (SFT) to train chat models. The main reasons are that open datasets for training steps 2 and 3 are still rare and that SFT already yields reasonably good chat models.

In this series of articles, I’ll show you how to train your own instruct LLM, from step 1 to step 3, on your own computer.

This first article implementing step 1 is accessible by all subscribers but the next articles implementing step 2 and step 3 will be for paid subscribers only. The fine-tuned models will be accessible to everyone. If you are a free subscriber, consider becoming a paid subscriber to support my work. There is a 7-day free trial. It gives you access to all my notebooks and you can cancel anytime:

I won’t explain in detail what step 1 learns since I’ve already written several articles doing SFT. However, I’ll detail step 2 and step 3 in the next articles of this series. We will see why we need these steps and what they optimize.

DeepSpeed Chat to fine-tune instruct LLMs

I will use Microsoft’s DeepSpeed Chat. You could also do all three steps with Hugging Face’s TRL but since I have already written several tutorials for TRL, I think this is a good occasion to learn something new.

DeepSpeed Chat is presented in this arXiv paper (that I can’t recommend reading, as it clearly hasn’t been reviewed):

DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales (Yao et al., 2023)

However, I recommend reading the documentation that you can find on the GitHub of the project (Apache 2.0 license): DeepSpeed Chat repository.

It’s very insightful and I used it extensively to write this article.

As base models, I will use Meta’s OPTs (Zhang et al., 2022). It would also work with Llama 2 but it wouldn’t be as affordable. Note: The main drawback of using OPT models is that their license forbids commercial use.

Prerequisites

The minimum configuration required to run DeepSpeed Chat mainly depends on:

The size of the base model you want to fine-tune with SFT
The maximum training time you can afford

For (1), OPT-1.3B can be fine-tuned with a GPU equipped with 10 GB of VRAM. But OPT-1.3B is rather small to get a good instruct LLM. I would recommend OPT-6.7B if you can afford a configuration that can load it (e.g., an A100). Note: Quantizing the model and then fine-tuning it is an alternative that you may use, but I leave it for future work. DeepSpeed Chat seems to support quantization (according to the log produced during training but I didn’t carefully check).

For (2), Ouyang et al. (2022) did SFT for 16 epochs with a batch size of 32. If you want to do the same, it would take several days on consumer hardware. I chose to fine-tune for only 1 epoch with a batch size of 8. It took almost 6 hours.

Considering (1) and (2), you may think you could run DeepSpeed Chat on the free Google Colab… Unfortunately, you can’t. DeepSpeed Chat consumes a lot of CPU RAM. After loading the model, the CPU RAM consumption peaked at 21 GB, almost twice what free Google Colab offers, but still, it’s feasible on consumer hardware. Note: If you can find in the code how to load the safetensors version of the base model instead of the “.bin“ version, you may significantly reduce the CPU RAM consumption.

To sum up, to reproduce my experiments, you will need:

A GPU with 10 GB of VRAM
24 GB of CPU RAM

For instance, Google Colab PRO with a T4 GPU and high CPU RAM would be enough (cost: around $0.2/hour).

As for the software, you will need to install DeepSpeed, and clone the DeepSpeed Chat repository:

pip install deepspeed>=0.9.0

git clone https://github.com/microsoft/DeepSpeedExamples.git
cd DeepSpeedExamples/applications/DeepSpeed-Chat/
pip install -r requirements.txt

Dataset for Supervised Fine-tuning

We can find numerous open datasets online that are suitable for SFT. They usually have at least two columns:

instruction/prompt
output or response

These datasets are often denoted “instruction datasets“.

The reason why you can find so many online is that most of them are automatically generated by existing instruct LLMs, such as ChatGPT. Their generation is cheap and easy. For instance, Alpaca which contains 52k examples was generated by Stanford for less than $500.

Here is a list of very popular ones that you can use for commercial purposes:

OpenAssistant Conversations Dataset (OASST1) (84.4k training examples)
OpenOrca (4.2M training examples)
openassistant-guanaco (9.8k training examples)
databricks-dolly-15 (15k training examples)

I used the following ones (also used for demonstration by the DeepSpeed Chat team)

Dahoas/rm-static
- 76.5k rows
- Split selected from Anthropic's Helpful Harmless dataset for training step 2 (reward model)
- Columns: prompt (string), response (string), chosen (string), and rejected (string)
Dahoas/full-hh-rlhf
- 112k rows
- Anthropic's Helpful Harmless dataset reformatted
- Columns: prompt (string), response (string), chosen (string), and rejected (string)
Dahoas/synthetic-instruct-gptj-pairwise
- 33.1k rows
- Generated with GPT-J
- Columns: prompt (string), chosen (string), and rejected (string)
yitingxie/rlhf-reward-datasets
- 76.3k rows
- No information
- Columns: prompt (string), chosen (string), and rejected (string)

Note that these datasets have different numbers of columns with some of them containing information that is irrelevant for SFT, but relevant for the next steps.

SFT only uses the columns “prompt“ and “chosen“. There are two datasets, rm-static and full-hh-rlhf, with a significant intersection (the former is an extract of the latter). This is not a problem. In practice, this is equivalent to giving more weight to the duplicated rows.

An interesting feature of DeepSpeed Chat is that we can split the datasets so that different examples are used to train the different steps. For instance, we can indicate that for step 1 we want to use only 20% of the examples. This 20% won’t be used for training the next steps.

I used only 20% of the examples for SFT, which is a total of 59.58k examples.

Supervised Fine-tuning with DeepSpeed Chat

Now that we have selected datasets, we can fine-tune OPT-1.3B (or OPT-6.7B, if you have a big GPU). We are on a budget here, so I will assume that we can’t afford to fully fine-tune the model. Instead, we will fine-tune LoRA adapters.

LoRA adds low-rank tensors on top of each layer while keeping the base model’s parameters frozen. We only fine-tune the parameters of the added tensors. LoRA fine-tuning reaches convergence faster while reaching performances similar to standard fine-tuning, if we find good LoRA hyperparameters.

The notebook to reproduce my experiment is here:

Get the notebook (#13)

Important Hyperparameters

All the following hyperparameters are given as arguments to DeepSpeed Chat.

Data split

We will use 20% of the training data for SFT while preserving 40% for step 2, and another 40% for step 3.

Number of epochs

I fine-tuned for 1 epoch. This is not recommended. OpenAI fine-tuned for 16 epochs to overfit the training data. They observed that after one epoch, SFT already achieves an optimal validation loss but that fine-tuning for many more epochs helps to obtain a better model according to humans.

Batch size

I used a small batch size of 8 to update the model’s weights with smaller steps. Since I only fine-tune for 1 epoch, I prefer the model to learn at a smaller pace.

LoRA hyperparameters

LoRA modules are attached to all the layers. I chose a dimension of 128 as in the DeepSpeed Chat documentation. Note that I fine-tuned only the LoRA parameters and froze everything else.

Learning rate and weight decay

I obtained better results with a weight decay at 0. For the learning rate, I only tried 1e-3.

Running DeepSpeed Chat

SFT on the selected datasets is just one command line, with many arguments.

cd training/step1_supervised_finetuning/

deepspeed --num_gpus 1 main.py \
   --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets \
   --data_split 2,4,4 \
   --model_name_or_path facebook/opt-1.3b \
   --per_device_train_batch_size 8 \
   --per_device_eval_batch_size 8 \
   --max_seq_len 512 \
   --learning_rate 1e-3 \
   --weight_decay 0. \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --num_warmup_steps 100 \
   --seed 1234 \
   --gradient_checkpointing \
   --zero_stage 0 \
   --lora_dim 128 \
   --only_optimize_lora \
   --lora_module_name decoder.layers. \
   --deepspeed \
   --output_dir results

Note: Let me know in the comments if some arguments are unclear (the documentation is misleading for some of them, especially for the ones related to LoRA)

If everything goes well, after almost 6 hours (using a T4 GPU) SFT will finish and the last lines of the training log should look like this:

Model Parameters: 1.429 B, Latency: 2.62s, TFLOPs: 13.03, Samples/sec: 3.05, Time/seq 0.33s, Batch Size: 8, Sequence Length: 512
[2023-09-04 12:39:44,748] [INFO] [logging.py:96:log_dist] [Rank 0] step=7360, skipped=73, lr=[2.4944575557050984e-07], mom=[(0.9, 0.95)]
[2023-09-04 12:39:44,769] [INFO] [timer.py:260:stop] epoch=0/micro_step=7360/global_step=7360, RunningAvgSamplesPerSec=3.0241130365529925, CurrSamplesPerSec=3.441302669205886, MemAllocated=4.31GB, MaxMemAllocated=6.98GB
Model Parameters: 1.429 B, Latency: 2.33s, TFLOPs: 14.70, Samples/sec: 3.44, Time/seq 0.29s, Batch Size: 8, Sequence Length: 512
***** Evaluating perplexity, Epoch 1/1 *****
ppl: 2.040890693664551
saving the final model ...
[2023-09-04 12:48:22,459] [INFO] [launch.py:347:main] Process 1060 exits successfully.

Note: Interestingly, I forgot to pass “--only_optimize_lora“ during my first attempt. It kept the base model parameters unfrozen and reached a significantly higher perplexity (2.7). Double-check your script arguments before launching since the training log doesn’t mention anything about LoRA even if you use it.

Note that it automatically merges the LoRA adapter into the base model after training.

You will find the model here on the Hugging Face Hub:

kaitchup/OPT-1.3B-SFT-DSChatLoRA

I will use it again for step 3.

DeepSpeed Chat fine-tuning feels fast. But I can’t say for sure whether it’s really faster than Hugging Face TRL on the hardware configuration I used. Maybe this would be an interesting comparison for another article.

Testing the model

DeepSpeed Chat also provides a script for comparing the fine-tuned model with another model. Here, I compare my fine-tuned model with the original OPT-1.3B.

python prompt_eval.py --model_name_or_path_baseline  facebook/opt-1.3b\
                --model_name_or_path_finetune  kaitchup/OPT-1.3B-SFT-DSChatLoRA

The models (“Assistant“) generated the following responses to the “Human“ prompts. Note: You will find more examples in the notebook.

==========Baseline: Greedy=========
Human: How does a telescope work? Assistant: It's a telescope. Human: How does a telescope work?
Assistant: It's a telescope.
Human: How does a telescope work?
Assistant: It's a telescope.
Human: How does a telescope work?
Assistant: It's a telescope. […]
==========finetune: Greedy=========
Human: How does a telescope work?
Assistant: A telescope works by focusing light from a source such as the sun or a star onto a lens, which focuses the light onto a mirror. The mirror is then moved to focus the light onto a screen, which is then used to project the image onto a screen. The image is then magnified by a lens, which allows the image to be viewed by a viewer.<|endoftext|>
==========Baseline: Greedy=========
Human: Why do birds migrate south for the winter?
Assistant: Because they can't fly. I'm not sure if this is a joke or not. I'm not sure if this is a joke or not. I'm not sure if this is a joke or not. I'm not sure if this is a joke or not. I'm not sure if this is a joke or not. I'm not sure if this is a joke or not. I'm not sure if this is a
==========finetune: Greedy=========
Human: Why do birds migrate south for the winter?
Assistant: Birds migrate south for a variety of reasons, including the availability of food, the weather, and the availability of water. They also migrate south to avoid the cold winters in their native regions. Additionally, some birds migrate south to breed, while others migrate south to find a mate.<|endoftext|>

We can see that the SFT model is much better. Even though I didn’t apply optimal training hyperparameters, the model can answer correctly to simple prompts. But for more complex tasks, we need to apply the next steps.

That’s all for SFT with DeepSpeed Chat. In my previous articles, we already saw how to do SFT with Hugging Face TRL, for instance, using QLoRA on Llama 2: