Creating Your Own Mixture of Experts Model with Mergekit
Written on
Chapter 1: Introduction to Mixture of Experts (MoE)
The recent launch of Mistral AI's Mixtral-8x7B has sparked a renewed enthusiasm for Mixture of Experts (MoE) models. This architecture leverages various expert sub-networks, activating only a subset through a routing mechanism during inference.
MoEs are remarkably straightforward and adaptable, making it easy to design a personalized MoE. The Hugging Face Hub now features various trending LLMs that incorporate custom MoEs, such as mlabonne/phixtral-4x2_8. However, many of these models aren't traditional MoEs created from the ground up; they typically combine pre-fine-tuned LLMs as experts. This simplification was facilitated by Mergekit (LGPL-3.0 license). For example, Phixtral LLMs were developed using Mergekit by integrating multiple Phi-2 models.
In this article, we will explore the creation of Phixtral and apply similar techniques to build our own mixture of experts, named Maixtchup, utilizing several Mistral 7B models.
Section 1.1: Understanding Phixtral
To grasp the high-level structure of a model, I prefer to visualize it. For instance, here’s how to display the architecture for mlabonne/phixtral-4x2_8 (MIT license):
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"mlabonne/phixtral-4x2_8",
torch_dtype="auto",
load_in_4bit=True,
trust_remote_code=True
)
print(model)
This code will output a detailed model structure, showcasing components such as the PhiModel and the various expert sub-networks.
Subsection 1.1.1: Model Configuration
The output reveals that this model is a PhiModel, utilizing the architecture of Phi-2. Notably, it incorporates four expert sub-networks, each with its own MLP modules. While Phi-2 has 2.7 billion parameters, the total parameter count for phixtral-4x2_8 is 7.8 billion due to the shared self-attention modules.
However, the printed output doesn’t clarify how the model was constructed—whether from scratch, using random weights, or with pre-existing fine-tuned weights. This information can be found in the model's repository, specifically in the configuration file:
base_model: cognitivecomputations/dolphin-2_6-phi-2
gate_mode: cheap_embed
experts:
source_model: cognitivecomputations/dolphin-2_6-phi-2
positive_prompts: [""]
source_model: lxuechen/phi-2-dpo
positive_prompts: [""]
source_model: Yhyu13/phi-2-sft-dpo-gpt4_en-ep1
positive_prompts: [""]
source_model: mrm8488/phi-2-coder
positive_prompts: [""]
This configuration indicates that the base model is cognitivecomputations/dolphin-2_6-phi-2, with various experts initialized from other fine-tuned models.
Section 1.2: Evaluating Performance
The performance of the models generated through this merging process is quite impressive. Although Phixtral has not undergone fine-tuning and the router network's weights are initialized randomly, it outperforms each of its individual experts.
Since the router network isn’t fine-tuned, a cost-effective strategy is to select which experts to activate based on user prompts during inference. For example, in Phixtral, phi-2-coder is designed for code generation. We could implement a mechanism to activate specific parameters in Phixtral when the prompt suggests code-related tasks.
This concept is thoroughly explored in Charles Goddard's blog post: Mixture of Experts for Clowns (at a Circus).
We can establish prompts that help in activating the correct expert. In our configuration, "positive_prompts" serves as a list of examples for the router network to use in selecting the appropriate expert.
For instance, for mrm8488/phi-2-coder, we could specify:
positive_prompts: ["You are a useful coding assistant.", "Complete the following function:"]
During inference, if a user enters a prompt that closely matches the positive prompts, the router network will engage "mrm8488/phi-2-coder".
In Phixtral, this is achieved using raw token embeddings for each layer, specified by "gate_mode: cheap_embed" in the configuration. While this is a straightforward approach, a more accurate representation could be obtained by computing the hidden states of the prompts, though this requires loading the models into memory.
Chapter 2: Building Your Own MoE with Mergekit
The first video titled "Building Mixture of Experts Model from Scratch - MakeMoe - YouTube" provides a comprehensive overview of creating MoE models. It covers the essentials of model architecture and practical implementation.
The second video, "Mixture of Experts Implementation from scratch - YouTube," further elaborates on the implementation techniques and real-world applications of MoE models.
To create MoEs effectively, Mistral 7B models serve as excellent candidates due to their size, allowing for the merging of multiple models without overloading standard consumer GPUs. For this tutorial, we will focus on six Mistral 7B models, ensuring compatibility with consumer hardware.
Here's a configuration example for merging:
base_model: mistralai/Mistral-7B-Instruct-v0.2
dtype: float16
gate_mode: cheap_embed
experts:
source_model: HuggingFaceH4/zephyr-7b-beta
positive_prompts: ["You are a helpful general-purpose assistant."]
source_model: mistralai/Mistral-7B-Instruct-v0.2
positive_prompts: ["You are a helpful assistant."]
source_model: teknium/OpenHermes-2.5-Mistral-7B
positive_prompts: ["You are a useful coding assistant."]
source_model: meta-math/MetaMath-Mistral-7B
positive_prompts: ["You are skilled in mathematics."]
I've selected mistralai/Mistral-7B-Instruct-v0.2 as the base model based on its strong performance in general tasks.
To set up Mergekit, first, install the necessary dependencies:
cd mergekit && pip install -e .
pip install --upgrade transformers
Ensure you're using the latest version of Transformers to support MoEs with the Mixtral architecture. The merging process itself is swift, though downloading the models may take some time.
The final model is saved as kaitchup/Maixtchup-4x7b and contains a total of 24 billion parameters.
To test the model, use the following code:
from transformers import AutoTokenizer
import transformers
import torch
model = "kaitchup/Maixtchup-4x7b"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
model_kwargs={"torch_dtype": torch.float16, "load_in_4bit": True},
)
messages = [{"role": "user", "content": "Do you know how to cook pasta?"}]
prompt = pipeline.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
This will generate a coherent response, demonstrating the capabilities of the Maixtchup model.
Conclusion
Creating MoEs has become more accessible than ever. While Mergekit currently supports a limited range of model architectures, its growing popularity suggests future expansions. In this article, we explored the merging of several models and utilized the resulting model for inference. Although our new MoE generates impressive outputs, remember that it has not undergone fine-tuning. To enhance performance, consider fine-tuning it with QLoRA.
For those interested in reproducing the Maixtchup creation process, a notebook is available, utilizing the same methodology discussed in this article.