Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
cff-version: 1.2.0
message: "If you use this software, please cite it using these metadata."
title: "MAESTRO: Multi-Agent Evaluation for Structured Relational Output"
abstract: >-
A benchmark comparing agentic orchestration frameworks for automated
relational diagram generation, scoring generated Mermaid diagrams against
structured ground truth across entity, relationship, container, and
attachment dimensions.
type: software
authors:
- family-names: Bolli
given-names: Colin
email: colin.bolli@stud.fhgr.ch
affiliation: "FH Graubünden (University of Applied Sciences of the Grisons) FHGR"
license: MIT
repository-code: "https://github.com/Colinho22/maestro"
version: "0.1.0"
#date-released: "yyyy-mm-dd"
keywords:
- LLM evaluation
- agentic orchestration
- diagram generation
- Mermaid
- reproducibility
# doi: "10.xxxx/xxxxx"
118 changes: 115 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,128 @@
# MAESTRO

[![ci](https://github.com/Colinho22/maestro/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/Colinho22/maestro/actions/workflows/ci.yml) ![CodeRabbit Pull Request Reviews](https://img.shields.io/coderabbit/prs/github/Colinho22/maestro?utm_source=oss&utm_medium=github&utm_campaign=Colinho22%2Fmaestro&labelColor=171717&color=FF570A&link=https%3A%2F%2Fcoderabbit.ai&label=CodeRabbit+Reviews)
[![ci](https://github.com/Colinho22/maestro/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/Colinho22/maestro/actions/workflows/ci.yml) ![CodeRabbit Pull Request Reviews](https://img.shields.io/coderabbit/prs/github/Colinho22/maestro?utm_source=oss&utm_medium=github&utm_campaign=Colinho22%2Fmaestro&labelColor=171717&color=FF570A&link=https%3A%2F%2Fcoderabbit.ai&label=CodeRabbit+Reviews) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE) ![Python](https://img.shields.io/badge/python-3.11-blue.svg) [![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff) [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit)](https://github.com/pre-commit/pre-commit)

**M**ulti-**A**gent **E**valuation for **S**tructured **R**elational **O**utput

Comparing agentic orchestration frameworks for automated relational diagram generation.

## Running tests
---

Install the dev extras and run pytest from the project root:
## Running the experiment

The benchmark runs a matrix of `inputs × strategies × models × repeats`, scores
each generated Mermaid diagram against its ground truth, and records every
result — plus the runtime environment — in a SQLite database. The steps below
run the experiment from a clean checkout.

> This is a high-level walkthrough. A detailed guide (troubleshooting, full CLI
> reference) will follow as the code stabilises.

### Prerequisites

- Python 3.11
- API keys for the providers you intend to run —
[Anthropic](https://docs.anthropic.com/en/api/overview),
[OpenAI](https://platform.openai.com/docs/api-reference/authentication),
[Mistral](https://docs.mistral.ai/getting-started/quickstarts/studio/activate-and-generate-api-key),
[Gemini](https://ai.google.dev/gemini-api/docs/api-key),
[DeepSeek](https://api-docs.deepseek.com/) (see each provider's docs for
obtaining a key)
- [`mmdc`](https://github.com/mermaid-js/mermaid-cli) (mermaid-cli) for the
structural-validity metric — optional locally (the metric is skipped if it is
absent), bundled in the Docker image
- Docker (optional) — only if you prefer the container path over a local install

The local install path is tested on macOS. The Docker path runs Linux inside
the container, so it is platform-independent and is the recommended route on
Windows.

### 1. Clone and install

```bash
git clone https://github.com/Colinho22/maestro.git
cd maestro
pip install -e . # or: pip install -e ".[dev]" for the test/lint tools
```

Or build the container, which bundles Python, mermaid-cli, and Chromium:

```bash
docker compose build
```

### 2. Configure API keys

Copy the template and fill in the keys for the providers you will use:

```bash
cp .env.template .env
# edit .env — keys are read from the environment at run time
```

### 3. Validate the setup with a small run

A single tier-1 cell confirms the install, keys, and scoring pipeline work
before committing to the full matrix:

```bash
python -m maestro.run --strategy single_agent --tier 1 --repeats 1
# Docker: docker compose run --rm maestro python -m maestro.run --strategy single_agent --tier 1 --repeats 1
```

### 4. Run the full matrix

```bash
python -m maestro.run
# Docker: docker compose run --rm maestro python -m maestro.run
```

Runs are resumable by default: already-completed cells are skipped, so an
interrupted run can be restarted with the same command. Results are written to
`maestro.db` (or `./out/maestro.db` under Docker).

### 5. Analyse the results

```bash
python -m maestro.analysis
```

### 6. Explore the results in the dashboard

```bash
docker compose up # → http://localhost:8501
# Local (without Docker): streamlit run src/maestro/viz/app.py
```

### Reproducibility audit trail

Every invocation snapshots its runtime environment — OS, architecture, Python
version, library versions, git commit, and (under Docker) the image digest —
into the `run_environments` table, linked to each run. This lets a later
replication attempt diagnose diverging numbers against the exact stack that
produced the original data.

---

## Local development

Setup is tested on macOS. Install the dev extras and run the test suite and
linters from the project root:

```bash
pip install -e ".[dev]"
pytest
ruff check .
ruff format --check .
```

[pre-commit](https://pre-commit.com/) hooks (ruff lint + format) are configured
in `.pre-commit-config.yaml`; enable them with `pre-commit install`.

---

## Citing

If you use MAESTRO in your work, please cite it via the
[`CITATION.cff`](CITATION.cff) file (GitHub's "Cite this repository" button), or
see that file for the reference details.