A Cheap Zephyr 7B Beta: Distilled DPO on Consumer Hardware
The recipe for training a Zephyr-like model without using A100 GPUs
Hugging Face’s Zephyr 7B Beta is a 7 billion parameter chat model outperforming much larger LLMs. In the previous issue of The Kaitchup, we saw what makes the model so good: knowledge distillation.
Hugging Face trained Zephyr with DPO to align it with human preferences. In another article, we also saw why DPO is much simpler than standard reinforcement learning with human feedback (RLHF) while performing as well.
While Zephyr 7B Beta was a relatively cheap model to make thanks to DPO and distillation, Hugging Face still needed 16 A100 80 GB GPUs for a few hours to train it. Note: In the cloud, the cost of training Zephyr would be a few hundred dollars.
In this article, we will see how to turn Mistral 7B into a Zephyr 7B Beta on consumer hardware using a parameter-efficient training method and quantization (QLoRA). I also adapt the training data made and used by Hugging Face, along with the training hyperparameters, to speed up training and reduce memory consumption.
My recipe for a cheap Zephyr 7B is implemented in this notebook: