refactor(prompts): unify Mermaid output contract + metric/CLI fixes surfaced by smoke test#61
Conversation
…f truth Consolidate the system identity + output rules into maestro.prompts so every provider and strategy shares one byte-identical contract, resolving the drift between single-agent and multi-step step 3. Pin flowchart dialect and require quoted labels so model output stays scoreable and parseable. Closes #60.
extract_relationships/extract_attachments returned [] when both edge endpoints redeclared a node label inline (a["A"] --> b["B"]) — valid Mermaid that silently zeroed relationship/attachment scores. Collapse inline labels to bare ids before the operator scan; add regression tests.
Each filter previously took one value; comma lists let a single command target a subset matrix (e.g. --model a,b --strategy single_agent,lang_graph).
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
✅ Files skipped from review due to trivial changes (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughThis PR centralizes the Mermaid prompt contract in ChangesMermaid Output Contract Unification and Enhancement
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/test_prompts.py (1)
97-106: ⚡ Quick win
test_identity_reaches_completedoes not exercise the fallback path it describes.On Line 104 and Line 105, the test only checks the provider attribute identity and never calls
complete(..., system_prompt=None), so provider-boundary fallback behavior can regress undetected. Please invokecompleteand assert the recorded effective system prompt equalsMERMAID_SYSTEM_IDENTITY.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_prompts.py` around lines 97 - 106, The test test_identity_reaches_complete only asserts provider.SYSTEM_PROMPT and never exercises the fallback used by the completion API; call provider.complete(..., system_prompt=None) (using recording_provider_factory to create the provider and a minimal prompt) so the provider records the effective system prompt, then assert that the recorded system prompt equals MERMAID_SYSTEM_IDENTITY (use the RecordingProvider's recorded request/results structure your tests use to inspect the effective system prompt). Ensure you pass system_prompt=None into the complete call to trigger the fallback.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/maestro/run.py`:
- Around line 327-338: The strategy validation is strict; make --model and
--example validation consistent by detecting any unknown entries and failing
fast: for model_names and example_ids mirror the Strategy check (compute a valid
set for registered models and for registered example IDs, build unknown = [x for
x in model_names/example_ids if x not in valid], print an error to sys.stderr
with the unknown items and sorted valid set, then sys.exit(2] if unknown).
Preserve the intentional control-only no-op for --model by only performing the
error exit for unknown models when at least one real strategy is selected (use
the existing strategy_names/Strategy check to decide); reference the variables
model_names, example_ids, and the registered-model and registered-example name
sets when implementing.
---
Nitpick comments:
In `@tests/test_prompts.py`:
- Around line 97-106: The test test_identity_reaches_complete only asserts
provider.SYSTEM_PROMPT and never exercises the fallback used by the completion
API; call provider.complete(..., system_prompt=None) (using
recording_provider_factory to create the provider and a minimal prompt) so the
provider records the effective system prompt, then assert that the recorded
system prompt equals MERMAID_SYSTEM_IDENTITY (use the RecordingProvider's
recorded request/results structure your tests use to inspect the effective
system prompt). Ensure you pass system_prompt=None into the complete call to
trigger the fallback.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7933d8b1-ba70-49ad-b006-cdbae7a0045d
📒 Files selected for processing (13)
pyproject.tomlsrc/maestro/analysis/metrics.pysrc/maestro/prompts.pysrc/maestro/providers/anthropic.pysrc/maestro/providers/base.pysrc/maestro/providers/gemini.pysrc/maestro/providers/mistral.pysrc/maestro/providers/openai.pysrc/maestro/run.pysrc/maestro/strategies/_extraction.pysrc/maestro/strategies/single.pytests/analysis/test_extraction.pytests/test_prompts.py
Address review feedback. --example had no validation and --model only errored when the filter dropped every model, so a typo in a comma list silently shrank the matrix. Reject any unknown --example/--strategy value (strict) and any unknown --model value when a real LLM strategy is selected — preserving the control-only no-op where --model is intentionally ignored. Add test_run_filters covering all paths. Also make test_fallback_identity_resolves_to_shared exercise the provider system-prompt fallback expression rather than only the class attribute.
Smoke runs showed models emitting parse-breaking labels: spaces inside edge pipes (-->| "x" |) and empty brackets (node[""]). Add rules pinning a flowchart LR header, bare-id for unlabelled nodes, and tight quoted edge labels with no empty labels. Update the rules snapshot.
Four exclusive-gateway nodes had name="" in the JSON but a meaningful label in the ground truth (gw_result->Result, gw_manager_decision->Manager Decision, xgw_approval_result->Vacation Approval, xgw_manual_result->Vacation Approved), so models could not produce a label the expected output required. Add the names to the input. Generic event labels (Start/End/Error) and BPMN notation (+) are left for a separate dataset audit.
Summary
Consolidates the Mermaid output contract into a single source of truth (closes #60), then fixes two issues a real smoke run against OpenAI + DeepSeek surfaced: a scoring-core extractor bug and a contract gap that made model output unscoreable. Three independent commits, all on this branch.
This lands before any scored v1.0.0 runs, so the intentional prompt change (below) doesn't mix pre-/post-refactor data.
What changed
1. Unify the Mermaid output contract (#60)
The output rules were hand-copied in 5+ places (
SYSTEM_PROMPTin 4 providers, plus inlined rules insingle.pyand_extraction.pystep 3) and had drifted — single-agent and the multi-step strategies were given different output instructions, a latent confound since orchestration strategy is the independent variable.src/maestro/prompts.py—MERMAID_SYSTEM_IDENTITY,MERMAID_RULES, andrender_rules(skill=None)(the append-only hook for the future "skills" condition). Dependency-free to avoid import cycles.SYSTEM_PROMPTnow defined once onLLMProvider; removed from the four subclasses (deepseek still inherits viaOpenAIProvider).single.pyand step 3 build their prompts fromrender_rules()— no inline rule blocks remain. Drift resolved.Intentional behavior change: single-agent now receives the hierarchy/subgraph guidance it previously lacked, which will change its container scores. This is the point — it removes the confound — and it's landing before scored runs.
2. Pin dialect + require quoted labels (contract content)
A smoke run revealed models default to output that can't be scored:
C4Containersyntax for IT diagrams, but all 30 ground truths areflowchartand the extractor can't parse C4 → guaranteed 0 on containers. The contract now pinsflowchart/graph.\n, parentheses, or slashes broke Mermaid parsing →parses_valid=0. The contract now requires quoted node labels (and quoted edge labels only when an edge has one — an over-broad first version produced empty|""|labels and was corrected).3. Fix metric extractor for inline-labeled edges (scoring-core bug)
extract_relationships/extract_attachmentsreturned[]when both edge endpoints redeclared a node label inline (a["A"] --> b["B"]) — valid Mermaid that renders fine, but silently zeroed the relationship/attachment score on otherwise-correct diagrams. GPT does this routinely, so this would have corrupted the scored run. Fixed by collapsing inline labels to bare ids before the operator scan; 4 regression tests added.4. CLI: comma-separated filters (
run.py)--example,--model,--strategynow accept comma lists, so one command can target a subset matrix (e.g.--model a,b --strategy single_agent,lang_graph). Release polish; useful for targeted runs.Smoke-test evidence (2 inputs × 2 providers × 2 strategies × 2 repeats)
| Metric | Original prompt | Af
Summary by CodeRabbit
New Features
Bug Fixes
Refactor
Tests
Chores