Faster-MoA

Faster-MoA is a research framework for accelerating tree-based Mixture-of-Agents (MoA) inference with large language models (LLMs). It organises multiple LLM "agents" into a hierarchical tree—leaf proposers generate diverse candidate responses, middle aggregators synthesise them, and a final agent produces the answer—while an embedding-based early-exit mechanism dynamically skips redundant computation when sufficient consensus and confidence are reached.

The repository ships two complementary implementations that target different serving stacks:

Sub-project	Serving backend	Key idea
Faster-MoA-Alg-vllm	vLLM	Hierarchical tree orchestration with parallel clusters and Q-metric early exit
Faster-MoA-PD	SGLang (prefill/decode disaggregation)	Dependency-aware prompt splicing with PD disaggregation and optional dynamic early exit

Key Features

Tree-structured multi-agent pipeline — proposer layers → aggregator layers, with configurable depth, width, and model assignments per agent.
Embedding-based early exit (Q-metric) — combines response confidence (token log-probabilities) and pairwise embedding similarity to decide, at each agent completion, whether the remaining agents in a cluster can be skipped.
Heterogeneous model support — mix models of different sizes (e.g. Qwen3-VL 4B / 8B / 32B) within the same tree.
Comprehensive benchmark suite — built-in support for GSM8K, MATH-500, AIME 2025, HMMT Feb 2025, GPQA Diamond, MMLU, MMLU-ProX-Lite, IFBench, HumanEval+, and BBH.

Special Note

For reproducing our results, please notice that:

The Faster-MoA-PD implementation reproduces category #2, #3 and #4 results in Fig. 4 in the paper, while the Faster-MoA-Alg-vllm implementation reproduces category #1 and #2 results. The two results are normalized on #2 and then compared in Fig. 4.

Architecture Overview

                    ┌─────────────────────────────────────────┐
                    │            Proposer Layer 1             │
                    │  ┌───────────┬───────────┬───────────┐  │
                    │  │ Cluster 1 │ Cluster 2 │ Cluster 3 │  │
                    │  │ Agent 1   │ Agent 1   │ Agent 1   │  │
                    │  │ Agent 2   │ Agent 2   │ Agent 2   │  │
                    │  │ Agent 3   │ Agent 3   │ Agent 3   │  │
                    │  └─────┬─────┴─────┬─────┴─────┬─────┘  │
                    └────────┼───────────┼───────────┼────────┘
                             │  early    │  exit     │
                             ▼  check    ▼  check    ▼
                    ┌─────────────────────────────────────────┐
                    │            Proposer Layer 2             │
                    │         ┌───────────────────┐           │
                    │         │     Cluster 1     │           │
                    │         │ Agent 1  Agent 2  │           │
                    │         │     Agent 3       │           │
                    │         └─────────┬─────────┘           │
                    └───────────────────┼─────────────────────┘
                                        │  early exit check
                                        ▼
                    ┌─────────────────────────────────────────┐
                    │              Aggregator                 │
                    │         ┌───────────────────┐           │
                    │         │     Agent 1       │           │
                    │         └─────────┬─────────┘           │
                    └───────────────────┼─────────────────────┘
                                        │
                                        ▼
                                  Final Answer

Within each cluster agents run in parallel. After every agent completes, the Q-metric is evaluated:

Q = C^γ  ×  B^(1 − γ)        γ = 0.5

C = RMS of per-agent confidence scores (from token log-probs)
P = confidence-weighted average pairwise Frobenius-Cosine similarity
B = 1 − |P − τ| / max(τ, 1 − τ)        τ ≈ 0.7

A uniform random draw below Q triggers an early exit, skipping the remaining agents in the cluster.

Sub-projects

Faster-MoA-Alg-vllm

A self-contained MoA system built on vLLM. A single script launches one vLLM server per model, and a FastAPI gateway (moa_api_tree.py) exposes the full pipeline as a /v1/completions endpoint compatible with lm-evaluation-harness.

Quick start (2+ GPUs):

cd Faster-MoA-Alg-vllm
pip install vllm sentence-transformers

# 1. Launch vLLM servers (auto-allocates GPUs)
python start_server.py --base-port 8010 --dtype bfloat16

# 2. Start the MoA API
python moa_api_tree.py --port 9000

# 3. Query
curl -s http://localhost:9000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"moa","prompt":"Solve: What is 2+2?"}'

See Faster-MoA-Alg-vllm/README.md for the full guide.

Faster-MoA-PD

An experiment framework built on SGLang PD disaggregation. Each model runs as a prefill/decode server pair, and a custom shell router handles dependency-aware prompt splicing—downstream agents begin prefilling before upstream agents finish decoding.

The framework compares two execution modes side-by-side:

Baseline — blocking layer-by-layer orchestration.
Proposed — pipelined dependency splicing with optional dynamic early exit.

Quick start (2+ GPUs, single-model):

cd Faster-MoA-PD
pip install "transformers==4.57.0" sglang==0.5.3 sglang-router datasets nixl

# Launch servers + run experiment
CONFIG_PATH=./cfgs/.isolation/real_dataset_tb_configs.json \
NUM_SAMPLES=85 \
QUESTION_BATCH_SIZE=1 \
./run_servers_all_pd_w_experiment.sh

See Faster-MoA-PD/README.md for heterogeneous multi-model and DMV (Dynamic Majority Voting) setups.

Models

Both sub-projects default to the Qwen3-VL family:

Model	Parameters	Role
`Qwen/Qwen3-VL-4B-Instruct`	4 B	Proposer (leaf agent)
`Qwen/Qwen3-VL-8B-Instruct`	8 B	Proposer / Aggregator
`Qwen/Qwen3-VL-32B-Instruct`	32 B	Aggregator / Synthesiser
`Qwen/Qwen3-Embedding-0.6B`	0.6 B	Embedding model for early-exit Q-metric

Other HuggingFace-hosted models can be used by updating the JSON configuration files.

Supported Benchmarks

GSM8K · MATH-500 · AIME 2025 · HMMT Feb 2025 · GPQA Diamond · MMLU · MMLU-ProX-Lite · IFBench · HumanEval+ · BBH (chain-of-thought, zero-shot)

Hardware Requirements

Setup	GPUs
Single-model (vLLM or PD)	2
Heterogeneous 3-model (PD)	6 (3 prefill + 3 decode)

GPU allocation is automatic; the launchers query nvidia-smi and assign models by available memory.

Repository Structure

Faster-MoA/
├── README.md                      ← you are here
├── LICENSE                        (Apache 2.0)
├── Faster-MoA-Alg-vllm/          vLLM-based MoA implementation
│   ├── configs_tree.json          Hierarchical agent/model config
│   ├── start_server.py            GPU allocation & vLLM server launcher
│   ├── moa_api_tree.py            FastAPI MoA gateway
│   ├── moa_chat_embedding_tree_earlyexit_confidence.py
│   │                              Core async tree orchestrator
│   ├── utils.py                   Embeddings, Q-metric, answer extraction
│   └── run_moa_eval_tree.sh       lm-eval benchmark harness
│
└── Faster-MoA-PD/                 SGLang PD disaggregation implementation
    ├── launch_server.py           SGLang server entry point
    ├── shell_router.py            Dependency-aware shell router
    ├── shell_router_dmv.py        DMV (dynamic early-exit) variant
    ├── tb_real_dataset_agent_tree_structure.py
    │                              Experiment runner (baseline + proposed)
    ├── tb_real_dataset_agent_tree_structure_dmv.py
    │                              DMV experiment runner
    ├── cfgs/                      JSON configs (homo / hetero / hetero_dmv)
    ├── src/
    │   ├── sglang_ext/            Patched SGLang runtime
    │   └── tb_utils/utils.py      Shared embedding & Q-metric utilities
    ├── run_servers_*.sh           Launch scripts for various topologies
    └── assets/                    Diagrams

License

This project is licensed under the Apache License 2.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Faster-MoA

Key Features

Special Note

Architecture Overview

Sub-projects

Faster-MoA-Alg-vllm

Faster-MoA-PD

Models

Supported Benchmarks

Hardware Requirements

Repository Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Faster-MoA-Alg-vllm		Faster-MoA-Alg-vllm
Faster-MoA-PD		Faster-MoA-PD
assets		assets
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Faster-MoA

Key Features

Special Note

Architecture Overview

Sub-projects

Faster-MoA-Alg-vllm

Faster-MoA-PD

Models

Supported Benchmarks

Hardware Requirements

Repository Structure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages