refactor(prompts): unify Mermaid output contract + metric/CLI fixes surfaced by smoke test by Colinho22 · Pull Request #61 · Colinho22/maestro

Colinho22 · 2026-06-12T18:23:08Z

Summary

Consolidates the Mermaid output contract into a single source of truth (closes #60), then fixes two issues a real smoke run against OpenAI + DeepSeek surfaced: a scoring-core extractor bug and a contract gap that made model output unscoreable. Three independent commits, all on this branch.

This lands before any scored v1.0.0 runs, so the intentional prompt change (below) doesn't mix pre-/post-refactor data.

What changed

1. Unify the Mermaid output contract (#60)

The output rules were hand-copied in 5+ places (SYSTEM_PROMPT in 4 providers, plus inlined rules in single.py and _extraction.py step 3) and had drifted — single-agent and the multi-step strategies were given different output instructions, a latent confound since orchestration strategy is the independent variable.

New src/maestro/prompts.py — MERMAID_SYSTEM_IDENTITY, MERMAID_RULES, and render_rules(skill=None) (the append-only hook for the future "skills" condition). Dependency-free to avoid import cycles.
SYSTEM_PROMPT now defined once on LLMProvider; removed from the four subclasses (deepseek still inherits via OpenAIProvider).
single.py and step 3 build their prompts from render_rules() — no inline rule blocks remain. Drift resolved.

Intentional behavior change: single-agent now receives the hierarchy/subgraph guidance it previously lacked, which will change its container scores. This is the point — it removes the confound — and it's landing before scored runs.

2. Pin dialect + require quoted labels (contract content)

A smoke run revealed models default to output that can't be scored:

Wrong dialect: models emitted C4Container syntax for IT diagrams, but all 30 ground truths are flowchart and the extractor can't parse C4 → guaranteed 0 on containers. The contract now pins flowchart/graph.
Unquoted labels: labels with \n, parentheses, or slashes broke Mermaid parsing → parses_valid=0. The contract now requires quoted node labels (and quoted edge labels only when an edge has one — an over-broad first version produced empty |""| labels and was corrected).

3. Fix metric extractor for inline-labeled edges (scoring-core bug)

extract_relationships / extract_attachments returned [] when both edge endpoints redeclared a node label inline (a["A"] --> b["B"]) — valid Mermaid that renders fine, but silently zeroed the relationship/attachment score on otherwise-correct diagrams. GPT does this routinely, so this would have corrupted the scored run. Fixed by collapsing inline labels to bare ids before the operator scan; 4 regression tests added.

4. CLI: comma-separated filters (`run.py`)

--example, --model, --strategy now accept comma lists, so one command can target a subset matrix (e.g. --model a,b --strategy single_agent,lang_graph). Release polish; useful for targeted runs.

Smoke-test evidence (2 inputs × 2 providers × 2 strategies × 2 repeats)

| Metric | Original prompt | Af

Summary by CodeRabbit

New Features
- CLI filtering accepts comma-separated values for strategies, models, and examples.
Bug Fixes
- More robust relationship/attachment extraction when nodes include inline labels.
Refactor
- Unified Mermaid prompt rule/identity used across providers and prompts; templates now reference the shared rules.
Tests
- Added comprehensive tests for prompts, extraction, and run-filter behavior.
Chores
- Linter config adjusted to allow long prompt lines; several BPMN data labels corrected.

…f truth Consolidate the system identity + output rules into maestro.prompts so every provider and strategy shares one byte-identical contract, resolving the drift between single-agent and multi-step step 3. Pin flowchart dialect and require quoted labels so model output stays scoreable and parseable. Closes #60.

extract_relationships/extract_attachments returned [] when both edge endpoints redeclared a node label inline (a["A"] --> b["B"]) — valid Mermaid that silently zeroed relationship/attachment scores. Collapse inline labels to bare ids before the operator scan; add regression tests.

Each filter previously took one value; comma lists let a single command target a subset matrix (e.g. --model a,b --strategy single_agent,lang_graph).

coderabbitai · 2026-06-12T18:23:19Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c1cafd33-05c5-4f80-88fe-457cbd81c5d2

📥 Commits

Reviewing files that changed from the base of the PR and between 4442de8 and adf8502.

📒 Files selected for processing (6)

data/05_bpmn_1.JSON
data/25_bpmn_3.JSON
src/maestro/prompts.py
src/maestro/run.py
tests/test_prompts.py
tests/test_run_filters.py

✅ Files skipped from review due to trivial changes (2)

data/05_bpmn_1.JSON
data/25_bpmn_3.JSON

🚧 Files skipped from review as they are similar to previous changes (2)

src/maestro/prompts.py
tests/test_prompts.py

📝 Walkthrough

Walkthrough

This PR centralizes the Mermaid prompt contract in maestro.prompts, makes providers inherit a single SYSTEM_PROMPT, injects canonical rules into strategy templates, ensures edge parsing ignores inline node labels, and enables comma-separated CLI filters with accompanying tests and small data fixes.

Changes

Mermaid Output Contract Unification and Enhancement

Layer / File(s)	Summary
Canonical prompts module and tests `src/maestro/prompts.py`, `pyproject.toml`, `tests/test_prompts.py`	Adds `MERMAID_SYSTEM_IDENTITY`, `MERMAID_RULES`, and `render_rules()`. Exempts the prompts file from line-length linting and snapshots/validates the contract in tests.
Provider SYSTEM_PROMPT centralization `src/maestro/providers/base.py`, `src/maestro/providers/anthropic.py`, `src/maestro/providers/gemini.py`, `src/maestro/providers/mistral.py`, `src/maestro/providers/openai.py`	Defines `LLMProvider.SYSTEM_PROMPT = MERMAID_SYSTEM_IDENTITY` and removes duplicated provider-level `SYSTEM_PROMPT` literals so providers inherit the shared identity.
Strategy prompt unification `src/maestro/strategies/single.py`, `src/maestro/strategies/_extraction.py`	Both single-agent and step-3 templates now source their rules from `render_rules()` and use escaped runtime placeholders to preserve later `.format()` substitution.
Extraction inline-label robustness `src/maestro/analysis/metrics.py`, `tests/analysis/test_extraction.py`	Adds `_strip_inline_labels()` and applies it before relationship/attachment regex matching so inline `id["Label"]` node declarations don’t break edge parsing; adds regression tests.
CLI comma-separated multi-value filtering `src/maestro/run.py`, `tests/test_run_filters.py`	Adds `_split_csv()` and updates `build_matrix` to accept membership-based comma-separated `--strategy`, `--model`, and `--example` filters with upfront validation and control-strategy handling; tests cover parsing and fail-fast semantics.
Data display-name fixes `data/05_bpmn_1.JSON`, `data/25_bpmn_3.JSON`	Populate empty `exclusiveGateway.name` fields with meaningful labels for sample BPMN files.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Colinho22/maestro#32: Adds regression tests for step-3 system-prompt passthrough that depend on the centralized provider identity.
Colinho22/maestro#35: Related changes around --model filtering and control-vs-real strategy behavior in run matrix construction.
Colinho22/maestro#13: Prior work touching the Mermaid prompt contract and strategy wiring.

Suggested labels

enhancement, bug

Poem

🐰 A rabbit’s prompt parade

Five prompts once scattered, now one shining seed,
Rules stitched together so diagrams succeed.
Labels trimmed and CSVs neatly split,
Tests hop along — the refactor is fit. ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	Title directly summarizes the main change: unifying the Mermaid output contract and addressing metric/CLI fixes. Clear and specific.
Linked Issues check	✅ Passed	All objectives from issue `#60` are met: prompts.py added with constants/helper, SYSTEM_PROMPT unified on base, strategies use render_rules(), tests verify contract.
Out of Scope Changes check	✅ Passed	All changes are within scope: prompt unification, metric extractor fix for inline labels, CLI filter validation, and BPMN data corrections. No unrelated changes present.
Docstring Coverage (Src Only)	✅ Passed	Checked docstrings (module/class/function; public names only) for changed src files in PR `#61` via AST: 44/44 public definitions had docstrings (100% coverage).

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch refactor/unify-mermaid-contract

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/test_prompts.py (1)
97-106: ⚡ Quick win

test_identity_reaches_complete does not exercise the fallback path it describes.

On Line 104 and Line 105, the test only checks the provider attribute identity and never calls complete(..., system_prompt=None), so provider-boundary fallback behavior can regress undetected. Please invoke complete and assert the recorded effective system prompt equals MERMAID_SYSTEM_IDENTITY.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_prompts.py` around lines 97 - 106, The test
test_identity_reaches_complete only asserts provider.SYSTEM_PROMPT and never
exercises the fallback used by the completion API; call provider.complete(...,
system_prompt=None) (using recording_provider_factory to create the provider and
a minimal prompt) so the provider records the effective system prompt, then
assert that the recorded system prompt equals MERMAID_SYSTEM_IDENTITY (use the
RecordingProvider's recorded request/results structure your tests use to inspect
the effective system prompt). Ensure you pass system_prompt=None into the
complete call to trigger the fallback.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/maestro/run.py`:
- Around line 327-338: The strategy validation is strict; make --model and
--example validation consistent by detecting any unknown entries and failing
fast: for model_names and example_ids mirror the Strategy check (compute a valid
set for registered models and for registered example IDs, build unknown = [x for
x in model_names/example_ids if x not in valid], print an error to sys.stderr
with the unknown items and sorted valid set, then sys.exit(2] if unknown).
Preserve the intentional control-only no-op for --model by only performing the
error exit for unknown models when at least one real strategy is selected (use
the existing strategy_names/Strategy check to decide); reference the variables
model_names, example_ids, and the registered-model and registered-example name
sets when implementing.

---

Nitpick comments:
In `@tests/test_prompts.py`:
- Around line 97-106: The test test_identity_reaches_complete only asserts
provider.SYSTEM_PROMPT and never exercises the fallback used by the completion
API; call provider.complete(..., system_prompt=None) (using
recording_provider_factory to create the provider and a minimal prompt) so the
provider records the effective system prompt, then assert that the recorded
system prompt equals MERMAID_SYSTEM_IDENTITY (use the RecordingProvider's
recorded request/results structure your tests use to inspect the effective
system prompt). Ensure you pass system_prompt=None into the complete call to
trigger the fallback.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7933d8b1-ba70-49ad-b006-cdbae7a0045d

📥 Commits

Reviewing files that changed from the base of the PR and between 1121d78 and 4442de8.

📒 Files selected for processing (13)

pyproject.toml
src/maestro/analysis/metrics.py
src/maestro/prompts.py
src/maestro/providers/anthropic.py
src/maestro/providers/base.py
src/maestro/providers/gemini.py
src/maestro/providers/mistral.py
src/maestro/providers/openai.py
src/maestro/run.py
src/maestro/strategies/_extraction.py
src/maestro/strategies/single.py
tests/analysis/test_extraction.py
tests/test_prompts.py

Address review feedback. --example had no validation and --model only errored when the filter dropped every model, so a typo in a comma list silently shrank the matrix. Reject any unknown --example/--strategy value (strict) and any unknown --model value when a real LLM strategy is selected — preserving the control-only no-op where --model is intentionally ignored. Add test_run_filters covering all paths. Also make test_fallback_identity_resolves_to_shared exercise the provider system-prompt fallback expression rather than only the class attribute.

Smoke runs showed models emitting parse-breaking labels: spaces inside edge pipes (-->| "x" |) and empty brackets (node[""]). Add rules pinning a flowchart LR header, bare-id for unlabelled nodes, and tight quoted edge labels with no empty labels. Update the rules snapshot.

Four exclusive-gateway nodes had name="" in the JSON but a meaningful label in the ground truth (gw_result->Result, gw_manager_decision->Manager Decision, xgw_approval_result->Vacation Approval, xgw_manual_result->Vacation Approved), so models could not produce a label the expected output required. Add the names to the input. Generic event labels (Start/End/Error) and BPMN notation (+) are left for a separate dataset audit.

Colinho22 added 3 commits June 12, 2026 20:19

feat(run): accept comma-separated --example/--model/--strategy filters

4442de8

Each filter previously took one value; comma lists let a single command target a subset matrix (e.g. --model a,b --strategy single_agent,lang_graph).

Colinho22 added the chore Maintenance, dependencies and infra stuff label Jun 12, 2026

Colinho22 self-assigned this Jun 12, 2026

Colinho22 added this to the 🧪 Experimental Artifact milestone Jun 12, 2026

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread src/maestro/run.py Outdated

Colinho22 added 3 commits June 12, 2026 20:40

Colinho22 merged commit 6f5d819 into main Jun 12, 2026
2 checks passed

Colinho22 deleted the refactor/unify-mermaid-contract branch June 12, 2026 19:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(prompts): unify Mermaid output contract + metric/CLI fixes surfaced by smoke test#61

refactor(prompts): unify Mermaid output contract + metric/CLI fixes surfaced by smoke test#61
Colinho22 merged 6 commits into
mainfrom
refactor/unify-mermaid-contract

Colinho22 commented Jun 12, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Colinho22 commented Jun 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

1. Unify the Mermaid output contract (#60)

2. Pin dialect + require quoted labels (contract content)

3. Fix metric extractor for inline-labeled edges (scoring-core bug)

4. CLI: comma-separated filters (run.py)

Smoke-test evidence (2 inputs × 2 providers × 2 strategies × 2 repeats)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Colinho22 commented Jun 12, 2026 •

edited by coderabbitai Bot

Loading

4. CLI: comma-separated filters (`run.py`)

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading