Since the release of Mixtral-8x7B by Mistral AI, there has been a renewed interest in the mixture of expert (MoE) models. This architecture exploits expert sub-networks among which only some of them are selected and activated by a router network during inference.
MoEs are so simple and flexible that it is easy to make a custom MoE. On the Hugging Face Hub, we can now find several trending LLMs that are custom MoEs, such as mlabonne/phixtral-4x2_8.
However, most of them are not traditional MoEs made from scratch, they simply use a combination of already fine-tuned LLMs as experts. Their creation was made easy with mergekit. For instance, Phixtral LLMs have been made with mergekit by combining several Phi-2 models.
In this article, we will see how Phixtral was created. We will apply the same process to create our own mixture of experts, Maixtchup, using several Mistral 7B models.
I have implemented a notebook to reproduce the fabrication of Maixtchup. It is available here: