Motivation
Once MAESTRO is ready to run the experimental matrix that produces the thesis data the code should be frozen so it becomes a citable release so:
- The thesis can reference an exact, named code state — "the data in Table 3 was produced by MAESTRO v1.0.0" is something a reader can verify, unlike a bare commit SHA.
- The
run_environments.git_commit column already records the SHA per row — pairing that with a human-readable tag closes the provenance loop. Anyone can cross-check the DB against the GitHub release.
- Post-run code cleanup, small improvements, and CodeRabbit refactors don't pollute the experimental codebase — they happen on
main after v1.0.0, while the thesis points at the frozen tag.
Pre-flight checklist (before tagging)
Release procedure
# 1. Final verification on clean main
git checkout main && git pull
git status # must be clean
pytest -v # all green
docker build . # builds cleanly
# 2. Tag the commit
git tag -a v1.0.0 -m "Thesis experimental run — code freeze"
git push origin v1.0.0
# 3. Create the GitHub release with notes (gh CLI)
gh release create v1.0.0 \
--title "v1.0.0 — Thesis experimental run" \
--notes-file release-notes-v1.0.0.md
# 4. Run the experiment (containerised, on the tagged code)
docker run ... python -m maestro.run
# 5. After the run, attach the final maestro.db (or its SHA-256) as a release asset
sha256sum maestro.db > maestro.db.sha256
gh release upload v1.0.0 maestro.db.sha256
# Optional: gh release upload v1.0.0 maestro.db (if size + privacy allow)
Release notes template (release-notes-v1.0.0.md)
MAESTRO v1.0.0 — Thesis experimental run
Code state that produced the experimental data reported in the MAESTRO thesis (Multi-Agent Evaluation for Structured Relational Output, FHGR FS26).
What this release contains
- Four orchestration strategies under test: SingleAgent, SOP, CrewAI, LangGraph
- Three control conditions: NullControl, CopyInputControl, GroundTruthEchoControl
- N-LLM providers: Provider A (model 1; model n), ...
- Full evaluation pipeline: structural validity (via mmdc), entity F1 (id/name/lemma), relationship F1 (relaxed/strict), error taxonomy
- Reproducibility instrumentation: per-invocation environment capture (OS, arch, Python, git commit, lib versions), per-call retry counts, control-condition sanity floors and ceiling
Verifying the data was produced by this code
Every row in maestro.db is provenance-stamped via the environment_id foreign key into run_environments. The git_commit column on that table must equal the commit this release tag points at:
sqlite3 maestro.db "SELECT DISTINCT git_commit FROM run_environments"
# returns: <SHA>
git rev-parse v1.0.0
# returns: <SHA> — must match
git_dirty should be 0 for every row in the final dataset (rows with git_dirty=1 are from dev iterations, not the experimental run).
How to reproduce
git clone https://github.com/Colinho22/maestro.git
cd maestro
git checkout v1.0.0
# Configure API keys in .env (see .env.template)
docker compose up runner
Scope
In scope:
- Pre-flight checklist verification
- Git tag + GitHub release creation
- Release notes written
- SHA-256 of the final maestro.db attached
Out of scope:
- Patches / cleanup after the data is gathered → those go to
v1.0.1+ on main, separately. The v1.0.0 tag stays frozen.
- Attaching the raw maestro.db itself if it contains anything sensitive — start with the SHA-256 only, decide on the full upload separately.
If the run reveals a bug
Tag v1.0.0 stays frozen. Two paths:
- Bug is in the runner (data is salvageable, run was incomplete): fix on
main, tag v1.0.1, rerun. Thesis cites v1.0.1.
- Bug is in the strategy or metric code (results are biased / wrong): same — fix, tag
v1.0.1, rerun, cite v1.0.1.
Resist the temptation to retroactively edit v1.0.0. The whole point is the immutable reference.
Notes
git_dirty=1 rows in the current maestro.db (the 8-row sop_based test run) are from dev iterations. The final thesis run produces a separate clean dataset where every row has git_dirty=0.
- Consider also tagging a
v1.0.0-rc.1 candidate before the real run, to confirm Docker + the matrix produce sane output on a small subset (e.g. --repeats 1) before committing to the full multi-hour run.
Motivation
Once MAESTRO is ready to run the experimental matrix that produces the thesis data the code should be frozen so it becomes a citable release so:
run_environments.git_commitcolumn already records the SHA per row — pairing that with a human-readable tag closes the provenance loop. Anyone can cross-check the DB against the GitHub release.mainafter v1.0.0, while the thesis points at the frozen tag.Pre-flight checklist (before tagging)
mainis up to date,git statusis clean (no uncommitted work — thegit_dirty=1warning we built in must be silent on the run)ruff check .+ruff format --check .both cleanpytest -vgreenpython -m maestro.run --dry-runproduces the expected matrix shapedocker build .succeeds and the container runs--dry-runcorrectlydata/, tier classifications confirmed against proposal §3)MODELSlist reflects the final set of providers you intend to evaluateDEFAULT_REPEATSreflects the final statistical-power decisionRelease procedure
Release notes template (release-notes-v1.0.0.md)
MAESTRO v1.0.0 — Thesis experimental run
Code state that produced the experimental data reported in the MAESTRO thesis (Multi-Agent Evaluation for Structured Relational Output, FHGR FS26).
What this release contains
Verifying the data was produced by this code
Every row in
maestro.dbis provenance-stamped via theenvironment_idforeign key intorun_environments. Thegit_commitcolumn on that table must equal the commit this release tag points at:git_dirtyshould be0for every row in the final dataset (rows withgit_dirty=1are from dev iterations, not the experimental run).How to reproduce
Scope
In scope:
Out of scope:
v1.0.1+on main, separately. Thev1.0.0tag stays frozen.If the run reveals a bug
Tag v1.0.0 stays frozen. Two paths:
main, tagv1.0.1, rerun. Thesis citesv1.0.1.v1.0.1, rerun, citev1.0.1.Resist the temptation to retroactively edit
v1.0.0. The whole point is the immutable reference.Notes
git_dirty=1rows in the currentmaestro.db(the 8-row sop_based test run) are from dev iterations. The final thesis run produces a separate clean dataset where every row hasgit_dirty=0.v1.0.0-rc.1candidate before the real run, to confirm Docker + the matrix produce sane output on a small subset (e.g.--repeats 1) before committing to the full multi-hour run.