Faster-MoA is a research framework for accelerating tree-based Mixture-of-Agents (MoA) inference with large language models (LLMs). It organises multiple LLM "agents" into a hierarchical tree—leaf proposers generate diverse candidate responses, middle aggregators synthesise them, and a final agent produces the answer—while an embedding-based early-exit mechanism dynamically skips redundant computation when sufficient consensus and confidence are reached.
The repository ships two complementary implementations that target different serving stacks:
| Sub-project | Serving backend | Key idea |
|---|---|---|
| Faster-MoA-Alg-vllm | vLLM | Hierarchical tree orchestration with parallel clusters and Q-metric early exit |
| Faster-MoA-PD | SGLang (prefill/decode disaggregation) | Dependency-aware prompt splicing with PD disaggregation and optional dynamic early exit |
- Tree-structured multi-agent pipeline — proposer layers → aggregator layers, with configurable depth, width, and model assignments per agent.
- Embedding-based early exit (Q-metric) — combines response confidence (token log-probabilities) and pairwise embedding similarity to decide, at each agent completion, whether the remaining agents in a cluster can be skipped.
- Heterogeneous model support — mix models of different sizes (e.g. Qwen3-VL 4B / 8B / 32B) within the same tree.
- Comprehensive benchmark suite — built-in support for GSM8K, MATH-500, AIME 2025, HMMT Feb 2025, GPQA Diamond, MMLU, MMLU-ProX-Lite, IFBench, HumanEval+, and BBH.
For reproducing our results, please notice that:
- The
Faster-MoA-PDimplementation reproduces category #2, #3 and #4 results in Fig. 4 in the paper, while theFaster-MoA-Alg-vllmimplementation reproduces category #1 and #2 results. The two results are normalized on #2 and then compared in Fig. 4.
┌─────────────────────────────────────────┐
│ Proposer Layer 1 │
│ ┌───────────┬───────────┬───────────┐ │
│ │ Cluster 1 │ Cluster 2 │ Cluster 3 │ │
│ │ Agent 1 │ Agent 1 │ Agent 1 │ │
│ │ Agent 2 │ Agent 2 │ Agent 2 │ │
│ │ Agent 3 │ Agent 3 │ Agent 3 │ │
│ └─────┬─────┴─────┬─────┴─────┬─────┘ │
└────────┼───────────┼───────────┼────────┘
│ early │ exit │
▼ check ▼ check ▼
┌─────────────────────────────────────────┐
│ Proposer Layer 2 │
│ ┌───────────────────┐ │
│ │ Cluster 1 │ │
│ │ Agent 1 Agent 2 │ │
│ │ Agent 3 │ │
│ └─────────┬─────────┘ │
└───────────────────┼─────────────────────┘
│ early exit check
▼
┌─────────────────────────────────────────┐
│ Aggregator │
│ ┌───────────────────┐ │
│ │ Agent 1 │ │
│ └─────────┬─────────┘ │
└───────────────────┼─────────────────────┘
│
▼
Final Answer
Within each cluster agents run in parallel. After every agent completes, the Q-metric is evaluated:
Q = C^γ × B^(1 − γ) γ = 0.5
C = RMS of per-agent confidence scores (from token log-probs)
P = confidence-weighted average pairwise Frobenius-Cosine similarity
B = 1 − |P − τ| / max(τ, 1 − τ) τ ≈ 0.7
A uniform random draw below Q triggers an early exit, skipping the remaining agents in the cluster.
A self-contained MoA system built on vLLM. A single script launches one vLLM server per model, and a FastAPI gateway (moa_api_tree.py) exposes the full pipeline as a /v1/completions endpoint compatible with lm-evaluation-harness.
Quick start (2+ GPUs):
cd Faster-MoA-Alg-vllm
pip install vllm sentence-transformers
# 1. Launch vLLM servers (auto-allocates GPUs)
python start_server.py --base-port 8010 --dtype bfloat16
# 2. Start the MoA API
python moa_api_tree.py --port 9000
# 3. Query
curl -s http://localhost:9000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"moa","prompt":"Solve: What is 2+2?"}'See Faster-MoA-Alg-vllm/README.md for the full guide.
An experiment framework built on SGLang PD disaggregation. Each model runs as a prefill/decode server pair, and a custom shell router handles dependency-aware prompt splicing—downstream agents begin prefilling before upstream agents finish decoding.
The framework compares two execution modes side-by-side:
- Baseline — blocking layer-by-layer orchestration.
- Proposed — pipelined dependency splicing with optional dynamic early exit.
Quick start (2+ GPUs, single-model):
cd Faster-MoA-PD
pip install "transformers==4.57.0" sglang==0.5.3 sglang-router datasets nixl
# Launch servers + run experiment
CONFIG_PATH=./cfgs/.isolation/real_dataset_tb_configs.json \
NUM_SAMPLES=85 \
QUESTION_BATCH_SIZE=1 \
./run_servers_all_pd_w_experiment.shSee Faster-MoA-PD/README.md for heterogeneous multi-model and DMV (Dynamic Majority Voting) setups.
Both sub-projects default to the Qwen3-VL family:
| Model | Parameters | Role |
|---|---|---|
Qwen/Qwen3-VL-4B-Instruct |
4 B | Proposer (leaf agent) |
Qwen/Qwen3-VL-8B-Instruct |
8 B | Proposer / Aggregator |
Qwen/Qwen3-VL-32B-Instruct |
32 B | Aggregator / Synthesiser |
Qwen/Qwen3-Embedding-0.6B |
0.6 B | Embedding model for early-exit Q-metric |
Other HuggingFace-hosted models can be used by updating the JSON configuration files.
GSM8K · MATH-500 · AIME 2025 · HMMT Feb 2025 · GPQA Diamond · MMLU · MMLU-ProX-Lite · IFBench · HumanEval+ · BBH (chain-of-thought, zero-shot)
| Setup | GPUs |
|---|---|
| Single-model (vLLM or PD) | 2 |
| Heterogeneous 3-model (PD) | 6 (3 prefill + 3 decode) |
GPU allocation is automatic; the launchers query nvidia-smi and assign models by available memory.
Faster-MoA/
├── README.md ← you are here
├── LICENSE (Apache 2.0)
├── Faster-MoA-Alg-vllm/ vLLM-based MoA implementation
│ ├── configs_tree.json Hierarchical agent/model config
│ ├── start_server.py GPU allocation & vLLM server launcher
│ ├── moa_api_tree.py FastAPI MoA gateway
│ ├── moa_chat_embedding_tree_earlyexit_confidence.py
│ │ Core async tree orchestrator
│ ├── utils.py Embeddings, Q-metric, answer extraction
│ └── run_moa_eval_tree.sh lm-eval benchmark harness
│
└── Faster-MoA-PD/ SGLang PD disaggregation implementation
├── launch_server.py SGLang server entry point
├── shell_router.py Dependency-aware shell router
├── shell_router_dmv.py DMV (dynamic early-exit) variant
├── tb_real_dataset_agent_tree_structure.py
│ Experiment runner (baseline + proposed)
├── tb_real_dataset_agent_tree_structure_dmv.py
│ DMV experiment runner
├── cfgs/ JSON configs (homo / hetero / hetero_dmv)
├── src/
│ ├── sglang_ext/ Patched SGLang runtime
│ └── tb_utils/utils.py Shared embedding & Q-metric utilities
├── run_servers_*.sh Launch scripts for various topologies
└── assets/ Diagrams
This project is licensed under the Apache License 2.0.
