Multi-Agent Evaluation for Structured Relational Output
Comparing agentic orchestration frameworks for automated relational diagram generation.
The benchmark runs a matrix of inputs × strategies × models × repeats, scores
each generated Mermaid diagram against its ground truth, and records every
result — plus the runtime environment — in a SQLite database. The steps below
run the experiment from a clean checkout.
This is a high-level walkthrough. A detailed guide (troubleshooting, full CLI reference) will follow as the code stabilises.
- Python 3.11
- API keys for the providers you intend to run — Anthropic, OpenAI, Mistral, Gemini, DeepSeek (see each provider's docs for obtaining a key)
mmdc(mermaid-cli) for the structural-validity metric — optional locally (the metric is skipped if it is absent), bundled in the Docker image- Docker (optional) — only if you prefer the container path over a local install
The local install path is tested on macOS. The Docker path runs Linux inside the container, so it is platform-independent and is the recommended route on Windows.
git clone https://github.com/Colinho22/maestro.git
cd maestro
pip install -e . # or: pip install -e ".[dev]" for the test/lint toolsOr build the container, which bundles Python, mermaid-cli, and Chromium:
docker compose buildCopy the template and fill in the keys for the providers you will use:
cp .env.template .env
# edit .env — keys are read from the environment at run timeA single tier-1 cell confirms the install, keys, and scoring pipeline work before committing to the full matrix:
python -m maestro.run --strategy single_agent --tier 1 --repeats 1
# Docker: docker compose run --rm maestro python -m maestro.run --strategy single_agent --tier 1 --repeats 1python -m maestro.run
# Docker: docker compose run --rm maestro python -m maestro.runRuns are resumable by default: already-completed cells are skipped, so an
interrupted run can be restarted with the same command. Results are written to
maestro.db (or ./out/maestro.db under Docker).
python -m maestro.analysisdocker compose up # → http://localhost:8501
# Local (without Docker): streamlit run src/maestro/viz/app.pyEvery invocation snapshots its runtime environment — OS, architecture, Python
version, library versions, git commit, and (under Docker) the image digest —
into the run_environments table, linked to each run. This lets a later
replication attempt diagnose diverging numbers against the exact stack that
produced the original data.
Setup is tested on macOS. Install the dev extras and run the test suite and linters from the project root:
pip install -e ".[dev]"
pytest
ruff check .
ruff format --check .pre-commit hooks (ruff lint + format) are configured
in .pre-commit-config.yaml; enable them with pre-commit install.
If you use MAESTRO in your work, please cite it via the
CITATION.cff file (GitHub's "Cite this repository" button), or
see that file for the reference details.