Fine-tune a Mixture of Experts on Your Computer

Cheap supervised fine-tuning for MoEs with QLoRA

Jan 25, 2024

∙ Paid

There is a wave of mixture of experts (MoE) and merged models currently surging on the Hugging Face Hub. In a previous article, we reviewed Phixtral, a merge of several Phi-2 LLMs, and made our own MoE.

Maixtchup: Make Your Own Mixture of Experts with Mergekit

Benjamin Marie

January 18, 2024

Read full story

While these new LLMs are simply merging several LLMs, they often outperform, without any fine-tuning, the individual LLMs that they merged.

In this article, I first discuss the memory requirements for fine-tuning MoE/merged LLMs and then show how to fine-tune them using QLoRA. For demonstration, I use my Maixtchup model but it would work the same for any other MoE models made of Llama 2 or Mistral 7b models.

I have implemented this fine-tuning in the following notebook:

Get the notebook (#40)

You can get Maixtchup from the Hugging Face Hub:

kaitchup/Maixtchup-4x7b

How Much Memory Do We Need to Fine-tune 4xMistral 7B?

Maixtchup is a 24 billion parameter model. Fine-tuning it on consumer hardware is challenging.

One fp16 (16-bit) parameter occupies 2 bytes in memory. 24 billion fp16 parameters occupy 48 GB. Even an expensive A100 48GB GPU wouldn’t be enough to load the entire model. To load it on a consumer GPU (e.g., with 24 GB of VRAM), we could offload 3 of the 4 experts to the CPU RAM but then fine-tuning would become extremely slow.

Run Mixtral-8x7B on Consumer Hardware with Expert Offloading

Benjamin Marie

January 8, 2024

Read full story

The best alternative is to quantize the model to 4-bit and then fine-tune an adapter over the model. This is the QLoRA method. If you don’t know about QLoRA, we used it several times in the previous articles and you can find a detailed explanation of how it works here:

QLoRa: Fine-Tune a Large Language Model on Your GPU

Benjamin Marie

May 30, 2023

Read full story

With 4-bit quantization, one parameter only occupies 0.5 bytes. It divides by 4 the size of the model in memory. We only need 14 GB (not 12 GB since not all the parameters are quantized) to load a quantized Maixtchup.

So we need a GPU with 14 GB, but we still don’t know how much memory the fine-tuning itself will require. It depends on several hyperparameters, mainly the training batch size and the maximum sequence length of the training examples.

Since we will fine-tune the model on long training examples (see next section), I set the maximum sequence length to 1,024. I will also activate FlashAttention 2 to reduce the computational cost and memory consumption of training over long sequences of tokens.

Use FlashAttention-2 for Faster Fine-tuning and Inference

Benjamin Marie

November 16, 2023

Read full story

It remains to know the memory cost given the batch size. To know this, I tried to run fine-tuning with different batch sizes.

If you target a 24 GB GPU (RTX 3090/4080/4090), the maximum batch size would be 3 but for training efficiency, it is recommended to use a value that can be divided by 2. So I set 2.

If you target a 16 GB GPU (RTX 4060 Ti/4070 Ti Super), even setting a batch size of 1 would still consume too much memory. You will need to change the maximum sequence length to 512 and then a batch size of 2 would work.

Distilled Supervised Fine-tuning

This supervised fine-tuning (SFT) aims to make a chat model using Maixtchup as a base model. We want this chat model to be able to answer instructions. We need an instruction dataset for training: instructions paired with a correct answer.

The Kaitchup – AI on a Budget

Fine-tune a Mixture of Experts on Your Computer

Cheap supervised fine-tuning for MoEs with QLoRA

Maixtchup: Make Your Own Mixture of Experts with Mergekit

How Much Memory Do We Need to Fine-tune 4xMistral 7B?

Run Mixtral-8x7B on Consumer Hardware with Expert Offloading

QLoRa: Fine-Tune a Large Language Model on Your GPU

Use FlashAttention-2 for Faster Fine-tuning and Inference

Distilled Supervised Fine-tuning

This post is for paid subscribers