diff --git a/CITATION.cff b/CITATION.cff new file mode 100644 index 0000000..879b7ea --- /dev/null +++ b/CITATION.cff @@ -0,0 +1,25 @@ +cff-version: 1.2.0 +message: "If you use this software, please cite it using these metadata." +title: "MAESTRO: Multi-Agent Evaluation for Structured Relational Output" +abstract: >- + A benchmark comparing agentic orchestration frameworks for automated + relational diagram generation, scoring generated Mermaid diagrams against + structured ground truth across entity, relationship, container, and + attachment dimensions. +type: software +authors: + - family-names: Bolli + given-names: Colin + email: colin.bolli@stud.fhgr.ch + affiliation: "FH Graubünden (University of Applied Sciences of the Grisons) FHGR" +license: MIT +repository-code: "https://github.com/Colinho22/maestro" +version: "0.1.0" +#date-released: "yyyy-mm-dd" +keywords: + - LLM evaluation + - agentic orchestration + - diagram generation + - Mermaid + - reproducibility +# doi: "10.xxxx/xxxxx" diff --git a/README.md b/README.md index 8fbb62b..198689f 100644 --- a/README.md +++ b/README.md @@ -1,16 +1,128 @@ # MAESTRO -[![ci](https://github.com/Colinho22/maestro/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/Colinho22/maestro/actions/workflows/ci.yml) ![CodeRabbit Pull Request Reviews](https://img.shields.io/coderabbit/prs/github/Colinho22/maestro?utm_source=oss&utm_medium=github&utm_campaign=Colinho22%2Fmaestro&labelColor=171717&color=FF570A&link=https%3A%2F%2Fcoderabbit.ai&label=CodeRabbit+Reviews) +[![ci](https://github.com/Colinho22/maestro/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/Colinho22/maestro/actions/workflows/ci.yml) ![CodeRabbit Pull Request Reviews](https://img.shields.io/coderabbit/prs/github/Colinho22/maestro?utm_source=oss&utm_medium=github&utm_campaign=Colinho22%2Fmaestro&labelColor=171717&color=FF570A&link=https%3A%2F%2Fcoderabbit.ai&label=CodeRabbit+Reviews) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE) ![Python](https://img.shields.io/badge/python-3.11-blue.svg) [![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff) [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit)](https://github.com/pre-commit/pre-commit) **M**ulti-**A**gent **E**valuation for **S**tructured **R**elational **O**utput Comparing agentic orchestration frameworks for automated relational diagram generation. -## Running tests +--- -Install the dev extras and run pytest from the project root: +## Running the experiment + +The benchmark runs a matrix of `inputs × strategies × models × repeats`, scores +each generated Mermaid diagram against its ground truth, and records every +result — plus the runtime environment — in a SQLite database. The steps below +run the experiment from a clean checkout. + +> This is a high-level walkthrough. A detailed guide (troubleshooting, full CLI +> reference) will follow as the code stabilises. + +### Prerequisites + +- Python 3.11 +- API keys for the providers you intend to run — + [Anthropic](https://docs.anthropic.com/en/api/overview), + [OpenAI](https://platform.openai.com/docs/api-reference/authentication), + [Mistral](https://docs.mistral.ai/getting-started/quickstarts/studio/activate-and-generate-api-key), + [Gemini](https://ai.google.dev/gemini-api/docs/api-key), + [DeepSeek](https://api-docs.deepseek.com/) (see each provider's docs for + obtaining a key) +- [`mmdc`](https://github.com/mermaid-js/mermaid-cli) (mermaid-cli) for the + structural-validity metric — optional locally (the metric is skipped if it is + absent), bundled in the Docker image +- Docker (optional) — only if you prefer the container path over a local install + +The local install path is tested on macOS. The Docker path runs Linux inside +the container, so it is platform-independent and is the recommended route on +Windows. + +### 1. Clone and install + +```bash +git clone https://github.com/Colinho22/maestro.git +cd maestro +pip install -e . # or: pip install -e ".[dev]" for the test/lint tools +``` + +Or build the container, which bundles Python, mermaid-cli, and Chromium: + +```bash +docker compose build +``` + +### 2. Configure API keys + +Copy the template and fill in the keys for the providers you will use: + +```bash +cp .env.template .env +# edit .env — keys are read from the environment at run time +``` + +### 3. Validate the setup with a small run + +A single tier-1 cell confirms the install, keys, and scoring pipeline work +before committing to the full matrix: + +```bash +python -m maestro.run --strategy single_agent --tier 1 --repeats 1 +# Docker: docker compose run --rm maestro python -m maestro.run --strategy single_agent --tier 1 --repeats 1 +``` + +### 4. Run the full matrix + +```bash +python -m maestro.run +# Docker: docker compose run --rm maestro python -m maestro.run +``` + +Runs are resumable by default: already-completed cells are skipped, so an +interrupted run can be restarted with the same command. Results are written to +`maestro.db` (or `./out/maestro.db` under Docker). + +### 5. Analyse the results + +```bash +python -m maestro.analysis +``` + +### 6. Explore the results in the dashboard + +```bash +docker compose up # → http://localhost:8501 +# Local (without Docker): streamlit run src/maestro/viz/app.py +``` + +### Reproducibility audit trail + +Every invocation snapshots its runtime environment — OS, architecture, Python +version, library versions, git commit, and (under Docker) the image digest — +into the `run_environments` table, linked to each run. This lets a later +replication attempt diagnose diverging numbers against the exact stack that +produced the original data. + +--- + +## Local development + +Setup is tested on macOS. Install the dev extras and run the test suite and +linters from the project root: ```bash pip install -e ".[dev]" pytest +ruff check . +ruff format --check . ``` + +[pre-commit](https://pre-commit.com/) hooks (ruff lint + format) are configured +in `.pre-commit-config.yaml`; enable them with `pre-commit install`. + +--- + +## Citing + +If you use MAESTRO in your work, please cite it via the +[`CITATION.cff`](CITATION.cff) file (GitHub's "Cite this repository" button), or +see that file for the reference details.