Skip to content

sharc-lab/Faster-MoA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Faster-MoA

Python 3.10+ License: Apache 2.0 arXiv

Faster-MoA is a research framework for accelerating tree-based Mixture-of-Agents (MoA) inference with large language models (LLMs). It organises multiple LLM "agents" into a hierarchical tree—leaf proposers generate diverse candidate responses, middle aggregators synthesise them, and a final agent produces the answer—while an embedding-based early-exit mechanism dynamically skips redundant computation when sufficient consensus and confidence are reached.

Overall architecture diagram

The repository ships two complementary implementations that target different serving stacks:

Sub-project Serving backend Key idea
Faster-MoA-Alg-vllm vLLM Hierarchical tree orchestration with parallel clusters and Q-metric early exit
Faster-MoA-PD SGLang (prefill/decode disaggregation) Dependency-aware prompt splicing with PD disaggregation and optional dynamic early exit

Key Features

  • Tree-structured multi-agent pipeline — proposer layers → aggregator layers, with configurable depth, width, and model assignments per agent.
  • Embedding-based early exit (Q-metric) — combines response confidence (token log-probabilities) and pairwise embedding similarity to decide, at each agent completion, whether the remaining agents in a cluster can be skipped.
  • Heterogeneous model support — mix models of different sizes (e.g. Qwen3-VL 4B / 8B / 32B) within the same tree.
  • Comprehensive benchmark suite — built-in support for GSM8K, MATH-500, AIME 2025, HMMT Feb 2025, GPQA Diamond, MMLU, MMLU-ProX-Lite, IFBench, HumanEval+, and BBH.

Special Note

For reproducing our results, please notice that:

  1. The Faster-MoA-PD implementation reproduces category #2, #3 and #4 results in Fig. 4 in the paper, while the Faster-MoA-Alg-vllm implementation reproduces category #1 and #2 results. The two results are normalized on #2 and then compared in Fig. 4.

Architecture Overview

                    ┌─────────────────────────────────────────┐
                    │            Proposer Layer 1             │
                    │  ┌───────────┬───────────┬───────────┐  │
                    │  │ Cluster 1 │ Cluster 2 │ Cluster 3 │  │
                    │  │ Agent 1   │ Agent 1   │ Agent 1   │  │
                    │  │ Agent 2   │ Agent 2   │ Agent 2   │  │
                    │  │ Agent 3   │ Agent 3   │ Agent 3   │  │
                    │  └─────┬─────┴─────┬─────┴─────┬─────┘  │
                    └────────┼───────────┼───────────┼────────┘
                             │  early    │  exit     │
                             ▼  check    ▼  check    ▼
                    ┌─────────────────────────────────────────┐
                    │            Proposer Layer 2             │
                    │         ┌───────────────────┐           │
                    │         │     Cluster 1     │           │
                    │         │ Agent 1  Agent 2  │           │
                    │         │     Agent 3       │           │
                    │         └─────────┬─────────┘           │
                    └───────────────────┼─────────────────────┘
                                        │  early exit check
                                        ▼
                    ┌─────────────────────────────────────────┐
                    │              Aggregator                 │
                    │         ┌───────────────────┐           │
                    │         │     Agent 1       │           │
                    │         └─────────┬─────────┘           │
                    └───────────────────┼─────────────────────┘
                                        │
                                        ▼
                                  Final Answer

Within each cluster agents run in parallel. After every agent completes, the Q-metric is evaluated:

Q = C^γ  ×  B^(1 − γ)        γ = 0.5

C = RMS of per-agent confidence scores (from token log-probs)
P = confidence-weighted average pairwise Frobenius-Cosine similarity
B = 1 − |P − τ| / max(τ, 1 − τ)        τ ≈ 0.7

A uniform random draw below Q triggers an early exit, skipping the remaining agents in the cluster.


Sub-projects

Faster-MoA-Alg-vllm

A self-contained MoA system built on vLLM. A single script launches one vLLM server per model, and a FastAPI gateway (moa_api_tree.py) exposes the full pipeline as a /v1/completions endpoint compatible with lm-evaluation-harness.

Quick start (2+ GPUs):

cd Faster-MoA-Alg-vllm
pip install vllm sentence-transformers

# 1. Launch vLLM servers (auto-allocates GPUs)
python start_server.py --base-port 8010 --dtype bfloat16

# 2. Start the MoA API
python moa_api_tree.py --port 9000

# 3. Query
curl -s http://localhost:9000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"moa","prompt":"Solve: What is 2+2?"}'

See Faster-MoA-Alg-vllm/README.md for the full guide.

Faster-MoA-PD

An experiment framework built on SGLang PD disaggregation. Each model runs as a prefill/decode server pair, and a custom shell router handles dependency-aware prompt splicing—downstream agents begin prefilling before upstream agents finish decoding.

The framework compares two execution modes side-by-side:

  1. Baseline — blocking layer-by-layer orchestration.
  2. Proposed — pipelined dependency splicing with optional dynamic early exit.

Quick start (2+ GPUs, single-model):

cd Faster-MoA-PD
pip install "transformers==4.57.0" sglang==0.5.3 sglang-router datasets nixl

# Launch servers + run experiment
CONFIG_PATH=./cfgs/.isolation/real_dataset_tb_configs.json \
NUM_SAMPLES=85 \
QUESTION_BATCH_SIZE=1 \
./run_servers_all_pd_w_experiment.sh

See Faster-MoA-PD/README.md for heterogeneous multi-model and DMV (Dynamic Majority Voting) setups.


Models

Both sub-projects default to the Qwen3-VL family:

Model Parameters Role
Qwen/Qwen3-VL-4B-Instruct 4 B Proposer (leaf agent)
Qwen/Qwen3-VL-8B-Instruct 8 B Proposer / Aggregator
Qwen/Qwen3-VL-32B-Instruct 32 B Aggregator / Synthesiser
Qwen/Qwen3-Embedding-0.6B 0.6 B Embedding model for early-exit Q-metric

Other HuggingFace-hosted models can be used by updating the JSON configuration files.


Supported Benchmarks

GSM8K · MATH-500 · AIME 2025 · HMMT Feb 2025 · GPQA Diamond · MMLU · MMLU-ProX-Lite · IFBench · HumanEval+ · BBH (chain-of-thought, zero-shot)


Hardware Requirements

Setup GPUs
Single-model (vLLM or PD) 2
Heterogeneous 3-model (PD) 6 (3 prefill + 3 decode)

GPU allocation is automatic; the launchers query nvidia-smi and assign models by available memory.


Repository Structure

Faster-MoA/
├── README.md                      ← you are here
├── LICENSE                        (Apache 2.0)
├── Faster-MoA-Alg-vllm/          vLLM-based MoA implementation
│   ├── configs_tree.json          Hierarchical agent/model config
│   ├── start_server.py            GPU allocation & vLLM server launcher
│   ├── moa_api_tree.py            FastAPI MoA gateway
│   ├── moa_chat_embedding_tree_earlyexit_confidence.py
│   │                              Core async tree orchestrator
│   ├── utils.py                   Embeddings, Q-metric, answer extraction
│   └── run_moa_eval_tree.sh       lm-eval benchmark harness
│
└── Faster-MoA-PD/                 SGLang PD disaggregation implementation
    ├── launch_server.py           SGLang server entry point
    ├── shell_router.py            Dependency-aware shell router
    ├── shell_router_dmv.py        DMV (dynamic early-exit) variant
    ├── tb_real_dataset_agent_tree_structure.py
    │                              Experiment runner (baseline + proposed)
    ├── tb_real_dataset_agent_tree_structure_dmv.py
    │                              DMV experiment runner
    ├── cfgs/                      JSON configs (homo / hetero / hetero_dmv)
    ├── src/
    │   ├── sglang_ext/            Patched SGLang runtime
    │   └── tb_utils/utils.py      Shared embedding & Q-metric utilities
    ├── run_servers_*.sh           Launch scripts for various topologies
    └── assets/                    Diagrams

License

This project is licensed under the Apache License 2.0.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors