Skip to content

refactor(prompts): unify Mermaid output contract + metric/CLI fixes surfaced by smoke test#61

Merged
Colinho22 merged 6 commits into
mainfrom
refactor/unify-mermaid-contract
Jun 12, 2026
Merged

refactor(prompts): unify Mermaid output contract + metric/CLI fixes surfaced by smoke test#61
Colinho22 merged 6 commits into
mainfrom
refactor/unify-mermaid-contract

Conversation

@Colinho22

@Colinho22 Colinho22 commented Jun 12, 2026

Copy link
Copy Markdown
Owner

Summary

Consolidates the Mermaid output contract into a single source of truth (closes #60), then fixes two issues a real smoke run against OpenAI + DeepSeek surfaced: a scoring-core extractor bug and a contract gap that made model output unscoreable. Three independent commits, all on this branch.

This lands before any scored v1.0.0 runs, so the intentional prompt change (below) doesn't mix pre-/post-refactor data.

What changed

1. Unify the Mermaid output contract (#60)

The output rules were hand-copied in 5+ places (SYSTEM_PROMPT in 4 providers, plus inlined rules in single.py and _extraction.py step 3) and had drifted — single-agent and the multi-step strategies were given different output instructions, a latent confound since orchestration strategy is the independent variable.

  • New src/maestro/prompts.pyMERMAID_SYSTEM_IDENTITY, MERMAID_RULES, and render_rules(skill=None) (the append-only hook for the future "skills" condition). Dependency-free to avoid import cycles.
  • SYSTEM_PROMPT now defined once on LLMProvider; removed from the four subclasses (deepseek still inherits via OpenAIProvider).
  • single.py and step 3 build their prompts from render_rules() — no inline rule blocks remain. Drift resolved.

Intentional behavior change: single-agent now receives the hierarchy/subgraph guidance it previously lacked, which will change its container scores. This is the point — it removes the confound — and it's landing before scored runs.

2. Pin dialect + require quoted labels (contract content)

A smoke run revealed models default to output that can't be scored:

  • Wrong dialect: models emitted C4Container syntax for IT diagrams, but all 30 ground truths are flowchart and the extractor can't parse C4 → guaranteed 0 on containers. The contract now pins flowchart/graph.
  • Unquoted labels: labels with \n, parentheses, or slashes broke Mermaid parsing → parses_valid=0. The contract now requires quoted node labels (and quoted edge labels only when an edge has one — an over-broad first version produced empty |""| labels and was corrected).

3. Fix metric extractor for inline-labeled edges (scoring-core bug)

extract_relationships / extract_attachments returned [] when both edge endpoints redeclared a node label inline (a["A"] --> b["B"]) — valid Mermaid that renders fine, but silently zeroed the relationship/attachment score on otherwise-correct diagrams. GPT does this routinely, so this would have corrupted the scored run. Fixed by collapsing inline labels to bare ids before the operator scan; 4 regression tests added.

4. CLI: comma-separated filters (run.py)

--example, --model, --strategy now accept comma lists, so one command can target a subset matrix (e.g. --model a,b --strategy single_agent,lang_graph). Release polish; useful for targeted runs.

Smoke-test evidence (2 inputs × 2 providers × 2 strategies × 2 repeats)

| Metric | Original prompt | Af

Summary by CodeRabbit

  • New Features

    • CLI filtering accepts comma-separated values for strategies, models, and examples.
  • Bug Fixes

    • More robust relationship/attachment extraction when nodes include inline labels.
  • Refactor

    • Unified Mermaid prompt rule/identity used across providers and prompts; templates now reference the shared rules.
  • Tests

    • Added comprehensive tests for prompts, extraction, and run-filter behavior.
  • Chores

    • Linter config adjusted to allow long prompt lines; several BPMN data labels corrected.

…f truth

Consolidate the system identity + output rules into maestro.prompts so every
provider and strategy shares one byte-identical contract, resolving the drift
between single-agent and multi-step step 3. Pin flowchart dialect and require
quoted labels so model output stays scoreable and parseable. Closes #60.
extract_relationships/extract_attachments returned [] when both edge endpoints
redeclared a node label inline (a["A"] --> b["B"]) — valid Mermaid that
silently zeroed relationship/attachment scores. Collapse inline labels to bare
ids before the operator scan; add regression tests.
Each filter previously took one value; comma lists let a single command target
a subset matrix (e.g. --model a,b --strategy single_agent,lang_graph).
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c1cafd33-05c5-4f80-88fe-457cbd81c5d2

📥 Commits

Reviewing files that changed from the base of the PR and between 4442de8 and adf8502.

📒 Files selected for processing (6)
  • data/05_bpmn_1.JSON
  • data/25_bpmn_3.JSON
  • src/maestro/prompts.py
  • src/maestro/run.py
  • tests/test_prompts.py
  • tests/test_run_filters.py
✅ Files skipped from review due to trivial changes (2)
  • data/05_bpmn_1.JSON
  • data/25_bpmn_3.JSON
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/maestro/prompts.py
  • tests/test_prompts.py

📝 Walkthrough

Walkthrough

This PR centralizes the Mermaid prompt contract in maestro.prompts, makes providers inherit a single SYSTEM_PROMPT, injects canonical rules into strategy templates, ensures edge parsing ignores inline node labels, and enables comma-separated CLI filters with accompanying tests and small data fixes.

Changes

Mermaid Output Contract Unification and Enhancement

Layer / File(s) Summary
Canonical prompts module and tests
src/maestro/prompts.py, pyproject.toml, tests/test_prompts.py
Adds MERMAID_SYSTEM_IDENTITY, MERMAID_RULES, and render_rules(). Exempts the prompts file from line-length linting and snapshots/validates the contract in tests.
Provider SYSTEM_PROMPT centralization
src/maestro/providers/base.py, src/maestro/providers/anthropic.py, src/maestro/providers/gemini.py, src/maestro/providers/mistral.py, src/maestro/providers/openai.py
Defines LLMProvider.SYSTEM_PROMPT = MERMAID_SYSTEM_IDENTITY and removes duplicated provider-level SYSTEM_PROMPT literals so providers inherit the shared identity.
Strategy prompt unification
src/maestro/strategies/single.py, src/maestro/strategies/_extraction.py
Both single-agent and step-3 templates now source their rules from render_rules() and use escaped runtime placeholders to preserve later .format() substitution.
Extraction inline-label robustness
src/maestro/analysis/metrics.py, tests/analysis/test_extraction.py
Adds _strip_inline_labels() and applies it before relationship/attachment regex matching so inline id["Label"] node declarations don’t break edge parsing; adds regression tests.
CLI comma-separated multi-value filtering
src/maestro/run.py, tests/test_run_filters.py
Adds _split_csv() and updates build_matrix to accept membership-based comma-separated --strategy, --model, and --example filters with upfront validation and control-strategy handling; tests cover parsing and fail-fast semantics.
Data display-name fixes
data/05_bpmn_1.JSON, data/25_bpmn_3.JSON
Populate empty exclusiveGateway.name fields with meaningful labels for sample BPMN files.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • Colinho22/maestro#32: Adds regression tests for step-3 system-prompt passthrough that depend on the centralized provider identity.
  • Colinho22/maestro#35: Related changes around --model filtering and control-vs-real strategy behavior in run matrix construction.
  • Colinho22/maestro#13: Prior work touching the Mermaid prompt contract and strategy wiring.

Suggested labels

enhancement, bug

Poem

🐰 A rabbit’s prompt parade

Five prompts once scattered, now one shining seed,
Rules stitched together so diagrams succeed.
Labels trimmed and CSVs neatly split,
Tests hop along — the refactor is fit. ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed Title directly summarizes the main change: unifying the Mermaid output contract and addressing metric/CLI fixes. Clear and specific.
Linked Issues check ✅ Passed All objectives from issue #60 are met: prompts.py added with constants/helper, SYSTEM_PROMPT unified on base, strategies use render_rules(), tests verify contract.
Out of Scope Changes check ✅ Passed All changes are within scope: prompt unification, metric extractor fix for inline labels, CLI filter validation, and BPMN data corrections. No unrelated changes present.
Docstring Coverage (Src Only) ✅ Passed Checked docstrings (module/class/function; public names only) for changed src files in PR #61 via AST: 44/44 public definitions had docstrings (100% coverage).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch refactor/unify-mermaid-contract

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Colinho22 Colinho22 added the chore Maintenance, dependencies and infra stuff label Jun 12, 2026
@Colinho22 Colinho22 self-assigned this Jun 12, 2026
@Colinho22 Colinho22 added this to the 🧪 Experimental Artifact milestone Jun 12, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/test_prompts.py (1)

97-106: ⚡ Quick win

test_identity_reaches_complete does not exercise the fallback path it describes.

On Line 104 and Line 105, the test only checks the provider attribute identity and never calls complete(..., system_prompt=None), so provider-boundary fallback behavior can regress undetected. Please invoke complete and assert the recorded effective system prompt equals MERMAID_SYSTEM_IDENTITY.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_prompts.py` around lines 97 - 106, The test
test_identity_reaches_complete only asserts provider.SYSTEM_PROMPT and never
exercises the fallback used by the completion API; call provider.complete(...,
system_prompt=None) (using recording_provider_factory to create the provider and
a minimal prompt) so the provider records the effective system prompt, then
assert that the recorded system prompt equals MERMAID_SYSTEM_IDENTITY (use the
RecordingProvider's recorded request/results structure your tests use to inspect
the effective system prompt). Ensure you pass system_prompt=None into the
complete call to trigger the fallback.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/maestro/run.py`:
- Around line 327-338: The strategy validation is strict; make --model and
--example validation consistent by detecting any unknown entries and failing
fast: for model_names and example_ids mirror the Strategy check (compute a valid
set for registered models and for registered example IDs, build unknown = [x for
x in model_names/example_ids if x not in valid], print an error to sys.stderr
with the unknown items and sorted valid set, then sys.exit(2] if unknown).
Preserve the intentional control-only no-op for --model by only performing the
error exit for unknown models when at least one real strategy is selected (use
the existing strategy_names/Strategy check to decide); reference the variables
model_names, example_ids, and the registered-model and registered-example name
sets when implementing.

---

Nitpick comments:
In `@tests/test_prompts.py`:
- Around line 97-106: The test test_identity_reaches_complete only asserts
provider.SYSTEM_PROMPT and never exercises the fallback used by the completion
API; call provider.complete(..., system_prompt=None) (using
recording_provider_factory to create the provider and a minimal prompt) so the
provider records the effective system prompt, then assert that the recorded
system prompt equals MERMAID_SYSTEM_IDENTITY (use the RecordingProvider's
recorded request/results structure your tests use to inspect the effective
system prompt). Ensure you pass system_prompt=None into the complete call to
trigger the fallback.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7933d8b1-ba70-49ad-b006-cdbae7a0045d

📥 Commits

Reviewing files that changed from the base of the PR and between 1121d78 and 4442de8.

📒 Files selected for processing (13)
  • pyproject.toml
  • src/maestro/analysis/metrics.py
  • src/maestro/prompts.py
  • src/maestro/providers/anthropic.py
  • src/maestro/providers/base.py
  • src/maestro/providers/gemini.py
  • src/maestro/providers/mistral.py
  • src/maestro/providers/openai.py
  • src/maestro/run.py
  • src/maestro/strategies/_extraction.py
  • src/maestro/strategies/single.py
  • tests/analysis/test_extraction.py
  • tests/test_prompts.py

Comment thread src/maestro/run.py Outdated
Address review feedback. --example had no validation and --model only errored
when the filter dropped every model, so a typo in a comma list silently shrank
the matrix. Reject any unknown --example/--strategy value (strict) and any
unknown --model value when a real LLM strategy is selected — preserving the
control-only no-op where --model is intentionally ignored. Add test_run_filters
covering all paths.

Also make test_fallback_identity_resolves_to_shared exercise the provider
system-prompt fallback expression rather than only the class attribute.
Smoke runs showed models emitting parse-breaking labels: spaces inside edge
pipes (-->| "x" |) and empty brackets (node[""]). Add rules pinning a
flowchart LR header, bare-id for unlabelled nodes, and tight quoted edge labels
with no empty labels. Update the rules snapshot.
Four exclusive-gateway nodes had name="" in the JSON but a meaningful label in
the ground truth (gw_result->Result, gw_manager_decision->Manager Decision,
xgw_approval_result->Vacation Approval, xgw_manual_result->Vacation Approved),
so models could not produce a label the expected output required. Add the names
to the input. Generic event labels (Start/End/Error) and BPMN notation (+) are
left for a separate dataset audit.
@Colinho22 Colinho22 merged commit 6f5d819 into main Jun 12, 2026
2 checks passed
@Colinho22 Colinho22 deleted the refactor/unify-mermaid-contract branch June 12, 2026 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chore Maintenance, dependencies and infra stuff

Projects

None yet

Development

Successfully merging this pull request may close these issues.

chore(refactor): Unify the Mermaid output contract into a single source of truth

1 participant