Skip to content

chore: cut v1.0.0 release before final thesis experimental run #43

@Colinho22

Description

@Colinho22

Motivation

Once MAESTRO is ready to run the experimental matrix that produces the thesis data the code should be frozen so it becomes a citable release so:

  1. The thesis can reference an exact, named code state — "the data in Table 3 was produced by MAESTRO v1.0.0" is something a reader can verify, unlike a bare commit SHA.
  2. The run_environments.git_commit column already records the SHA per row — pairing that with a human-readable tag closes the provenance loop. Anyone can cross-check the DB against the GitHub release.
  3. Post-run code cleanup, small improvements, and CodeRabbit refactors don't pollute the experimental codebase — they happen on main after v1.0.0, while the thesis points at the frozen tag.

Pre-flight checklist (before tagging)

  • All open feature/chore issues that affect experimental behaviour are closed
  • main is up to date, git status is clean (no uncommitted work — the git_dirty=1 warning we built in must be silent on the run)
  • ruff check . + ruff format --check . both clean
  • pytest -v green
  • python -m maestro.run --dry-run produces the expected matrix shape
  • docker build . succeeds and the container runs --dry-run correctly
  • Input corpus is finalised (data files committed under data/, tier classifications confirmed against proposal §3)
  • MODELS list reflects the final set of providers you intend to evaluate
  • DEFAULT_REPEATS reflects the final statistical-power decision

Release procedure

# 1. Final verification on clean main
git checkout main && git pull
git status                       # must be clean
pytest -v                        # all green
docker build .                   # builds cleanly

# 2. Tag the commit
git tag -a v1.0.0 -m "Thesis experimental run — code freeze"
git push origin v1.0.0

# 3. Create the GitHub release with notes (gh CLI)
gh release create v1.0.0 \
  --title "v1.0.0 — Thesis experimental run" \
  --notes-file release-notes-v1.0.0.md

# 4. Run the experiment (containerised, on the tagged code)
docker run ... python -m maestro.run

# 5. After the run, attach the final maestro.db (or its SHA-256) as a release asset
sha256sum maestro.db > maestro.db.sha256
gh release upload v1.0.0 maestro.db.sha256
# Optional: gh release upload v1.0.0 maestro.db  (if size + privacy allow)

Release notes template (release-notes-v1.0.0.md)

MAESTRO v1.0.0 — Thesis experimental run

Code state that produced the experimental data reported in the MAESTRO thesis (Multi-Agent Evaluation for Structured Relational Output, FHGR FS26).

What this release contains

  • Four orchestration strategies under test: SingleAgent, SOP, CrewAI, LangGraph
  • Three control conditions: NullControl, CopyInputControl, GroundTruthEchoControl
  • N-LLM providers: Provider A (model 1; model n), ...
  • Full evaluation pipeline: structural validity (via mmdc), entity F1 (id/name/lemma), relationship F1 (relaxed/strict), error taxonomy
  • Reproducibility instrumentation: per-invocation environment capture (OS, arch, Python, git commit, lib versions), per-call retry counts, control-condition sanity floors and ceiling

Verifying the data was produced by this code

Every row in maestro.db is provenance-stamped via the environment_id foreign key into run_environments. The git_commit column on that table must equal the commit this release tag points at:

sqlite3 maestro.db "SELECT DISTINCT git_commit FROM run_environments"
# returns: <SHA>
git rev-parse v1.0.0
# returns: <SHA>  — must match

git_dirty should be 0 for every row in the final dataset (rows with git_dirty=1 are from dev iterations, not the experimental run).

How to reproduce

git clone https://github.com/Colinho22/maestro.git
cd maestro
git checkout v1.0.0
# Configure API keys in .env (see .env.template)
docker compose up runner

Scope

In scope:

  • Pre-flight checklist verification
  • Git tag + GitHub release creation
  • Release notes written
  • SHA-256 of the final maestro.db attached

Out of scope:

  • Patches / cleanup after the data is gathered → those go to v1.0.1+ on main, separately. The v1.0.0 tag stays frozen.
  • Attaching the raw maestro.db itself if it contains anything sensitive — start with the SHA-256 only, decide on the full upload separately.

If the run reveals a bug

Tag v1.0.0 stays frozen. Two paths:

  1. Bug is in the runner (data is salvageable, run was incomplete): fix on main, tag v1.0.1, rerun. Thesis cites v1.0.1.
  2. Bug is in the strategy or metric code (results are biased / wrong): same — fix, tag v1.0.1, rerun, cite v1.0.1.

Resist the temptation to retroactively edit v1.0.0. The whole point is the immutable reference.

Notes

  • git_dirty=1 rows in the current maestro.db (the 8-row sop_based test run) are from dev iterations. The final thesis run produces a separate clean dataset where every row has git_dirty=0.
  • Consider also tagging a v1.0.0-rc.1 candidate before the real run, to confirm Docker + the matrix produce sane output on a small subset (e.g. --repeats 1) before committing to the full multi-hour run.

Metadata

Metadata

Assignees

Labels

choreMaintenance, dependencies and infra stuffdocumentationImprovements or additions to documentation

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions