- Overview
- Key Features
- Architecture
- Installation
- Quick Start
- Configuration
- Training
- Advanced Usage
- How It Works
- Performance & Efficiency
- Troubleshooting
- Citation
- Contributing
MoE-LoRA transforms standard decoder-only language models (like Mistral 7B) into efficient Mixture-of-Experts (MoE) models (similar to Mixtral 8x7B) using Parameter Efficient Fine-Tuning (PEFT) with LoRA (Low-Rank Adaptation).
Instead of training billions of parameters from scratch, MoE-LoRA injects trainable LoRA adapters into the Feed-Forward Network (FFN) layers, creating multiple "expert" pathways while keeping the base model frozen. This approach dramatically reduces training costs while enabling MoE capabilities.
Mixture-of-Experts models route each token to a subset of specialized "expert" networks, allowing the model to be larger while keeping computational costs manageable. Only a few experts process each token, providing efficiency at scale.
Low-Rank Adaptation (LoRA) freezes pre-trained model weights and injects trainable low-rank matrices into each layer, reducing the number of trainable parameters by orders of magnitude while maintaining model quality.
By combining MoE and LoRA, you can:
- Convert existing models to MoE architecture without full retraining
- Train with minimal GPU memory using quantization
- Achieve parameter efficiency (train <1% of total parameters)
- Experiment with different expert configurations rapidly
- Deploy expert systems on consumer hardware
- Parameter Efficient: Train only LoRA adapters (~0.1-1% of model parameters)
- Memory Efficient: Supports 4-bit and 8-bit quantization via bitsandbytes
- Flexible Architecture: Configure number of experts, routing strategy, and expert rank
- Compatible: Works with any Mistral-based or LLaMA-based model
- Router Learning: Trainable gating network with optional auxiliary loss
- Production Ready: Includes training scripts for OpenAssistant and Wikipedia datasets
┌─────────────────────────────────────────┐
│ Input Embeddings │
└──────────────┬──────────────────────────┘
│
┌────────▼────────┐
│ Self-Attention │ (Frozen)
└────────┬─────────┘
│
┌────────▼────────────────────────────┐
│ MoE-LoRA Block │
│ ┌──────────────────────────────┐ │
│ │ Router (Gating Network) │ │
│ └──────┬───────────────────────┘ │
│ │ │
│ ┌────▼──────┐ Top-K Selection │
│ │ Expert 1 │ (LoRA Adapters) │
│ │ Expert 2 │ │
│ │ ... │ │
│ │ Expert N │ │
│ └───────────┘ │
│ │ │
│ Weighted Sum │
└────────┬───────────────────────────┘
│
┌─────▼──────┐
│ Output │
└────────────┘
Each LoraExpert wraps the frozen FFN with three LoRA adapters:
gate_lora: Low-rank adaptation for gate projectionup_lora: Low-rank adaptation for up projectiondown_lora: Low-rank adaptation for down projection
- Python 3.8+
- CUDA-capable GPU (recommended: 16GB+ VRAM)
- PyTorch 2.0+
# Clone the repository
git clone https://github.com/maidacundo/MoE-LoRA.git
cd MoE-LoRA/
# Install dependencies
pip install -r requirements.txt
# Login to required services
wandb login # For experiment tracking
huggingface-cli login # For model downloadstransformers>=4.38.0
datasets
accelerate
evaluate
wandb
bitsandbytes
peft
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from lora_moe import LoraMoeConfig, LoraMoeModel
import torch
# Configure 4-bit quantization for memory efficiency
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.1",
quantization_config=bnb_config,
device_map="auto",
)
# Configure MoE-LoRA
moe_config = LoraMoeConfig.from_pretrained("mistralai/Mistral-7B-v0.1")
moe_config.experts_rank = 8 # LoRA rank (higher = more capacity)
moe_config.experts_scale = 1.0 # LoRA scaling factor
moe_config.num_experts_per_tok = 2 # Experts active per token
moe_config.num_local_experts = 8 # Total number of experts
moe_config.output_router_logits = True # Enable router loss
# Wrap model with MoE-LoRA
moe_model = LoraMoeModel(base_model, moe_config)
# Freeze base model, train only LoRA experts
moe_model.make_experts_trainable()
# Use like any Hugging Face model
outputs = moe_model(input_ids=input_ids, labels=labels)
loss = outputs.lossaccelerate launch train_openassistant.pyaccelerate launch train_wikipedia.pyBoth scripts use the configurations in training/training_config.py.
| Parameter | Type | Default | Description |
|---|---|---|---|
experts_rank |
int | 8 | Rank of LoRA projection matrices (controls expert capacity) |
experts_scale |
float | 1.0 | Scaling factor applied to LoRA outputs |
num_experts_per_tok |
int | 2 | Number of experts activated per token (top-k routing) |
num_local_experts |
int | 8 | Total number of expert modules |
output_router_logits |
bool | False | Whether to return routing weights (needed for auxiliary loss) |
router_aux_loss_coef |
float | 0.001 | Weight of load-balancing auxiliary loss |
Edit training/training_config.py to customize training:
@dataclass
class TrainingConfig:
# Dataset
dataset: str = "openassistant" # or "wikipedia"
# LoRA MoE
experts_rank: int = 8
experts_scale: float = 1.0
num_experts_per_tok: int = 2
num_local_experts: int = 8
# Training
num_epochs: int = 1
train_batch_size: int = 1
learning_rate: float = 1e-4
context_length: int = 64
# Model
base_model_id: str = "mistralai/Mistral-7B-v0.1"
quantize: bool = True
mixed_precision: str = "fp16"
# Logging
project_name: str = "lora_moe"
run_name: str = "experiment_1"The training pipeline uses Hugging Face Accelerate for distributed training and mixed precision:
from training import train, TrainingConfig
# Create custom config
config = TrainingConfig(
dataset="openassistant",
experts_rank=16,
num_local_experts=4,
learning_rate=2e-4,
)
# Launch training
train(config)accelerate config # Configure distributed setup
accelerate launch --num_processes=2 train_openassistant.pyApply MoE-LoRA to specific transformer layers:
# Only wrap layers 10-20
moe_model = LoraMoeModel(
base_model,
moe_config,
layer_ids=list(range(10, 20))
)Extend the LoraExpert class to create specialized expert architectures:
from lora_moe.peft_experts import LoraExpert
class CustomExpert(LoraExpert):
def __init__(self, config):
super().__init__(config)
# Add custom layers
def forward(self, hidden_states, mlp):
# Custom expert logic
return output# Generate text
input_ids = tokenizer("Once upon a time", return_tensors="pt").input_ids
outputs = moe_model.generate(
input_ids,
max_length=100,
temperature=0.7,
top_p=0.9,
)
print(tokenizer.decode(outputs[0]))MoE-LoRA wraps each transformer decoder layer's FFN with a LoraMoeBlock containing:
- Router: Learns to assign tokens to experts using a noisy top-k gating mechanism
- LoRA Experts: Multiple low-rank adapter modules that process token representations
For each token:
- Router computes logits for all experts
- Top-k experts are selected based on highest routing weights
- Token representation is processed by selected experts
- Expert outputs are combined via weighted sum
Only LoRA adapter parameters and router weights are trained:
Total parameters: ~7B
Trainable parameters: ~50M (0.7%)
The auxiliary load-balancing loss encourages even expert utilization:
aux_loss = load_balancing_loss_func(router_logits, num_experts, top_k)
total_loss = task_loss + router_aux_loss_coef * aux_loss| Configuration | VRAM (Training) | VRAM (Inference) |
|---|---|---|
| 7B base, 8 experts, rank 8, 4-bit | ~12 GB | ~6 GB |
| 7B base, 8 experts, rank 16, 4-bit | ~14 GB | ~7 GB |
| 7B base, 16 experts, rank 8, 4-bit | ~16 GB | ~8 GB |
- Sparse Activation: Only 2/8 experts active per token (25% of expert capacity)
- Efficient Routing: Block-sparse operations avoid padding overhead
- Gradient Efficiency: Only ~1% of parameters receive gradients
- Reduce
experts_rank(e.g., 4 or 8) - Reduce
num_local_experts(e.g., 4 instead of 8) - Enable gradient checkpointing
- Reduce
train_batch_size - Use deeper quantization (4-bit instead of 8-bit)
- Increase
router_aux_loss_coef(e.g., 0.01) - Verify
output_router_logits=Truein config - Check that router weights are being updated
- Ensure CUDA is available:
torch.cuda.is_available() - Use
torch.compile()for PyTorch 2.0+ - Enable Flash Attention 2 if available
- Reduce
context_lengthfor faster iterations
pip install --upgrade transformers accelerate peftIf you use this code in your research, please cite:
@software{moe_lora_2024,
author = {maidacundo},
title = {MoE-LoRA: Mixture-of-Experts Adaptation using Parameter Efficient Fine-tuning},
year = {2024},
url = {https://github.com/maidacundo/MoE-LoRA}
}- LoRA: LoRA: Low-Rank Adaptation of Large Language Models
- Mixtral: Mixtral of Experts
- Switch Transformers: Switch Transformers: Scaling to Trillion Parameter Models
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
pip install -r requirements.txt
pip install pytest black flake8 # Dev dependenciesThis project is licensed under the Apache 2.0 License - see the LICENSE file for details.
- Built on Hugging Face Transformers
- Inspired by Mixtral and LoRA
- Uses bitsandbytes for quantization