fix: make check_mermaid_valid work on Windows (replace /dev/stdin + /dev/null)

## Motivation

`maestro.analysis.metrics.check_mermaid_valid` is the function that validates whether a generated diagram parses as Mermaid (via the `mmdc` CLI). Its result populates `metric_results.parses_valid` for every successful run — a core column for both the metric pipeline and the `GroundTruthEchoControlStrategy` ceiling test.

The current implementation uses two Unix-only paths:

```python
result = subprocess.run(
    [mmdc, "-i", "/dev/stdin", "-o", out.name, "-e", "png"],
    input=diagram_code,
    ...
)
```

There is already an in-source `NOTE`: Windows-incompatible comment acknowledging this, but the workaround was deferred to "when/if Windows matters".

**It now matters.** Thesis reviewers are expected to verify reproducibility on Windows, and the institute may pick up the artifact for downstream benchmark work. On Windows:

- `/dev/stdin` does not exist → `mmdc` fails with a file-not-found error
- `tempfile.NamedTemporaryFile(delete=True)` holds the file open exclusively → `mmdc` cannot write the out PNG even if the input path worked

The result is that every `parses_valid` column would be `False` on a Windows run, with a path error in `parse_error`. That contaminates every metric row that uses validity as a filter, and breaks the `GroundTruthEchoControl` ceiling invariant (it would no longer hit `parses_valid=True`).

## Proposal

Replace both `/dev/stdin` and the auto-deleted temp file with explicit, cross-platform temp files. mmdc accepts a file path for `-i` just as readily as it does `/dev/stdin`.

```python
def check_mermaid_valid(diagram_code: str) -> tuple[bool | None, str | None]:
    """
    ... (existing docstring)
    """
    mmdc = shutil.which("mmdc")
    if mmdc is None:
        return (None, "mmdc not found — validation skipped")

    # Cross-platform: write the diagram to a real temp file (avoids
    # /dev/stdin which doesn't exist on Windows), let mmdc render to
    # another temp file (avoids /dev/null and NamedTemporaryFile's
    # Windows exclusive-lock behaviour). Both files are explicitly
    # cleaned up in a finally block.
    in_fd, in_path = tempfile.mkstemp(suffix=".mmd")
    out_fd, out_path = tempfile.mkstemp(suffix=".png")
    os.close(in_fd)
    os.close(out_fd)
    try:
        Path(in_path).write_text(diagram_code, encoding="utf-8")
        result = subprocess.run(
            [mmdc, "-i", in_path, "-o", out_path, "-e", "png"],
            capture_output=True,
            text=True,
            timeout=15,
        )
        if result.returncode == 0:
            return (True, None)
        return (False, result.stderr.strip()[:500])
    except subprocess.TimeoutExpired:
        return (False, "mmdc timed out after 15s")
    except Exception as e:
        return (False, str(e)[:500])
    finally:
        # Best-effort cleanup; don't let teardown errors mask the result.
        for p in (in_path, out_path):
            try:
                Path(p).unlink(missing_ok=True)
            except OSError:
                pass
```

Notes on the approach:
- `tempfile.mkstemp` returns a file descriptor + path. Closing the fd immediately and writing via a separate `Path.write_text` avoids both the Windows exclusive-lock issue and the "did the fd actually get closed before subprocess opens it" race that the with `NamedTemporaryFile` pattern has.
- We `subprocess.run` _without_ `input=`, since we're now passing a real file path.
- The `.mmd` suffix on the input file is more accurate than no suffix; `mmdc` uses it for diagram-type heuristics in some versions.
- The output file is created so `mmdc` has somewhere to write, then immediately deleted. We never read it (we only care about the exit code), but Windows requires the file to exist before subprocess opens it.

## Scope

**In scope:**
- Rewrite of `check_mermaid_valid` per the proposal above
- Update / extend the existing test in tests/analysis/test_metrics.py:
   - The `test_ground_truth_echo_scores_perfect_ceiling` test already exercises the success path; verify it still passes
   - Add a test for the `parse-failure path` (input that should fail `mmdc`) — not mermaid at all is already in the parametrized `test_zero_entities_in_output_never_crashes`, but that test doesn't assert on `parses_valid`. New explicit test welcome.
- Drop the existing NOTE: `Windows-incompatible` comment from the docstring (it'll no longer be true)

**Out of scope:**
- Adding Windows to CI matrix — that's a bigger conversation (the `ubuntu-latest` runner gives the project its Linux baseline; verifying Windows would need a separate runner and adds CI minutes)
- Manual Windows-machine verification — filing this as a follow-up, but the change is mechanically correct enough to ship without it as long as Linux/macOS `pytest` still passes

## Acceptance criteria
- [ ] `check_mermaid_valid` no longer references `/dev/stdin` or `/dev/null`
- [ ] Both input and output paths come from `tempfile.mkstemp` with explicit `.unlink(missing_ok=True)` in a finally block
- [ ] `ruff check . + ruff format --check .` clean
- [ ] `pytest -v `green on the existing 20 tests (especially `test_ground_truth_echo_scores_perfect_ceiling` which exercises the success path through this function)
- [ ] At least one new test asserting `parses_valid=False` for known-bad input (the negative path through the new code)
- [ ] The `Windows-incompatible` NOTE comment in `metrics.py` is removed
- [ ] `python -m maestro.run --strategy ground_truth_control --dry-run` still produces 1 row (no regression in the matrix builder)

## Risk
Small. The function has one clear responsibility (call `mmdc`, return validity), no callers depend on the temp-file mechanics, and the test suite already covers the happy path with the real ground-truth file. The cleanup-in-finally pattern is standard Python.

Main thing to watch: ensure the `out_path` is created (zero-byte) on disk before subprocess starts, since `mmdc` on some platforms / versions checks existence rather than opening with `O_CREAT. mkstemp` creates the file as part of its contract, so we're fine — but worth a sanity check via the existing ceiling test.

## Verification path

- `pytest -v` on the dev machine (Linux/macOS) — all 20 tests pass, same as before.
- **Optional**: on a Windows machine with `mmdc` on `PATH`, run the same `pytest` invocation. The ceiling test should produce `parses_valid=True` for the bpmn ground truth. Without this PR, it would have produced False with a `/dev/stdin not found` error.

## Related

- `src/maestro/analysis/metrics.py:check_mermaid_valid` (the function being fixed)
- `tests/analysis/test_metrics.py:test_ground_truth_echo_scores_perfect_ceiling` (already exercises the happy path)
- PR [#42](https://github.com/Colinho22/maestro/issues/42) / Issue [#14](https://github.com/Colinho22/maestro/issues/14) (tzdata for Windows in PR [#42](https://github.com/Colinho22/maestro/issues/42) was the first Windows-compat fix; this is the second)
- Issue for `v1.0.0` (#43 )release tagging — this PR should land before `v1.0.0` since reviewer Windows compatibility is a thesis-defense concern
- Issues [#20](https://github.com/Colinho22/maestro/issues/20) / [#21](https://github.com/Colinho22/maestro/issues/21) (Docker setup) — Docker is the belt; this PR is the braces. A Windows reviewer who can install Docker is covered by #20; a Windows reviewer who runs pytest natively to sanity-check the test suite is covered by this PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: make check_mermaid_valid work on Windows (replace /dev/stdin + /dev/null) #44

Motivation

Proposal

Scope

Acceptance criteria

Risk

Verification path

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

fix: make check_mermaid_valid work on Windows (replace /dev/stdin + /dev/null) #44

Description

Motivation

Proposal

Scope

Acceptance criteria

Risk

Verification path

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions