Skip to content

feat(check): MultiTool Checks — multi check subcommand (M0–M6 + tests & docs)#140

Merged
RobbieMcKinstry merged 9 commits into
trunkfrom
robbie/mt-check
Jun 24, 2026
Merged

feat(check): MultiTool Checks — multi check subcommand (M0–M6 + tests & docs)#140
RobbieMcKinstry merged 9 commits into
trunkfrom
robbie/mt-check

Conversation

@RobbieMcKinstry

Copy link
Copy Markdown
Contributor

MultiTool Checks — the multi check subcommand

Implements the full MultiTool Checks project: multi check discovers
declared non-functional ("ility") requirements in CHECKS.md files and
validates them with sandboxed Claude Code agents, reporting verdicts through a
trustworthy in-process MCP guardrail. It's a test suite for requirements that
have no programmatic unit to test.

Closes every issue in milestones M0–M6 plus Tests & docs of the
MultiTool Checks project.

What's here, by milestone

Milestone Summary
M0 Skeleton multi check subcommand; sync dispatch() → tokio runtime → 4-phase pipeline (mirrors Run).
M1 Discovery ignore-walk for CHECKS.md; comrak AST parse; sentinel extraction (# Requirement/# Req, ## Check); anonymous-check inference; aggregated miette validation (orphan / checkless).
M2 Config & executor Boxed CheckExecutor trait; concrete claude -p executor; hardcoded-but-injected Config (haiku).
M3 Sandboxing cfg-gated boxed Sandbox trait; macOS APFS clonefile impl with RAII teardown; non-macOS stub.
M4 MCP server One in-process rmcp server on a localhost port (dedicated task); N per-check endpoints with single-call semantics + result channel; lifecycle (shutdown-after-all / missing-report timeout); per-check --mcp-config payloads.
M5 Execution Bounded-parallel JoinSet orchestrator (sandbox → endpoint → executor); agent instructions binding the report tool; reconcile (MCP report authoritative) + logical-AND aggregation.
M6 Reporting & exit Green/red requirement titles via Terminal honoring --enable-colors; failing checks + evidence in red; passing checks omitted; exit 0 iff all satisfied (empty = 0), else 1.
Tests & docs Fake executor + no-op sandbox; AST extractor unit tests (incl. duplicate-title preservation); end-to-end pipeline test (discovery → execution → reporting → exit code); guides/checks.md authoring guide + README link.

Verification

  • cargo test: 72/72 lib unit tests pass — including the real APFS clone
    test, a live MCP server bind/collect/shutdown, and the E2E pipeline (satisfied
    / failed / multi-check AND / anonymous-check, all with the fake executor — no
    claude, network, or APFS dependency).
  • cargo clippy --all-targets --workspace: zero warnings in the new code.
  • cargo fmt applied. CLI smoke-tested: empty tree → exit 0; orphan CHECKS.md
    → diagnostic + exit 1.
  • New deps: comrak, rmcp, axum, schemars, libc, tempfile.

Notes

  • New deps and the feature module live in the main multitool crate
    (src/checks/), since the executor / sandbox / MCP server are CLI-bound.
  • The MCP server's over-the-wire smoke test is covered at the handler
    (single-call + delivery), config-payload, and real server-lifecycle levels; a
    full rmcp-client round-trip was skipped because its reqwest transport pulls
    reqwest 0.13 vs. the project's pinned 0.12.
  • The branch also carries two pre-existing commits (refresh toolchain and makefile, Add storybloq) that predate this work.
  • One pre-existing, unrelated doctest in src/stats/categorical.rs (references
    the old canary crate name) fails independently of this change.

🤖 Generated with Claude Code

https://claude.ai/code/session_01CRSZZ6ft6j4pc8uvXVnpMu

RobbieMcKinstry and others added 9 commits June 24, 2026 15:44
Implement the `multi check` subcommand: discover declared non-functional
("ility") requirements in CHECKS.md files and validate them with sandboxed
Claude Code agents, reporting verdicts through an in-process MCP guardrail.

- M0: subcommand wiring + sync dispatch → tokio runtime → 4-phase pipeline
- M1: CHECKS.md discovery (ignore), comrak AST parse, requirement/check
  extraction, anonymous-check inference, aggregated miette validation
- M2: boxed CheckExecutor trait + `claude -p` executor + injected Config
- M3: cfg-gated copy-on-write Sandbox trait + macOS APFS clonefile impl
- M4: in-process rmcp result server, N per-check endpoints, single-call
  semantics, lifecycle/shutdown, per-check --mcp-config payloads
- M5: bounded-parallel orchestrator, agent instructions, reconcile +
  logical-AND aggregation
- M6: colored requirement/check reporting + CI exit code

New deps: comrak, rmcp, axum, schemars, libc, tempfile.

Implements milestones M0–M6 of the MultiTool Checks project
(MULTI-1331, MULTI-1332, MULTI-1333..1353).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01CRSZZ6ft6j4pc8uvXVnpMu
- Add an end-to-end pipeline test driving discovery → execution →
  reporting → exit code over a fixture tree, with the fake executor and
  no-op sandbox injected (no claude/network/APFS): covers a satisfied
  requirement, a failed requirement, a multi-check AND requirement, and
  an anonymous check; asserts aggregated verdicts and exit codes.
- Add a parser test asserting duplicate (non-unique) requirement/check
  titles are preserved, never deduped.
- Add guides/checks.md: `multi check` usage (working dir, CI exit codes,
  MVP constraints, trust model) and the full CHECKS.md authoring format
  with examples; link it from the README.

Implements the Tests & docs milestone (MULTI-1354, MULTI-1355, MULTI-1356).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01CRSZZ6ft6j4pc8uvXVnpMu
The doc example imported `canary::stats::Categorical` — a stale crate
name, and `stats` is a crate-internal module, so the trait can't be
reached from an external doctest at all. It was the only `rust` doctest
in the crate and failed the build.

Mark the example `ignore` (it documents an internal trait) and add a
unit test that actually exercises it, so `cargo test` is fully green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01CRSZZ6ft6j4pc8uvXVnpMu
The toolchain-refresh commit on this branch surfaced ~60 warnings across
the pre-existing (largely abandoned) modules, which `cargo make ci-flow`
fails on via clippy's `-D warnings`. Apply clippy's machine-applicable
fixes, update the deprecated `aws_config::BehaviorVersion` to v2025_08_07,
restore the feature-gated `ProxySubcommand` re-export, and allow the
remaining abandoned-code lint categories (dead_code, private_interfaces,
wrong_self_convention, large_enum_variant) crate-wide. The `checks`
feature added in this PR is clippy-clean on its own.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01CRSZZ6ft6j4pc8uvXVnpMu
Dogfooding `multi check` on this repository surfaced two operational
issues with real `claude` agents:

- The result server ran in stateless Streamable HTTP mode, which stalls
  the Claude Code MCP client's multi-step session handshake (rmcp only
  serves POST when stateless). Run in stateful mode instead.
- Full `claude` agents are heavy; 4+ concurrent starve each other past
  the timeout. Lower the default concurrency to 2 and the per-agent
  timeout to 120s, and retry a check (up to max_attempts) whose agent
  hangs or stops without reporting — a check only errors after all
  attempts are exhausted.

Also add the CHECKS.md self-validation suite (the dogfooding target),
including a requirement asserting the server runs in stateful mode.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01CRSZZ6ft6j4pc8uvXVnpMu
Empty commit to start a fresh workflow run so on-push.yml resolves
wack/gh-actions validate.yml@trunk to the version that installs
cargo-nextest and cargo-llvm-cov (re-runs pin the original reusable
workflow resolution).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01CRSZZ6ft6j4pc8uvXVnpMu
…0/10

Dogfooding `multi check` on this repo (the CHECKS.md self-validation
suite) surfaced two more issues that prevented a clean pass:

- Agents that reported still hung on connection cleanup, eating the
  whole timeout. The agent's job is done the moment it calls
  report-check-result, so deliver the verdict via a per-check Notify +
  shared store and race it against the process in execution::run_one:
  on a report we drop the run future, which kills the agent
  (kill_on_drop). This eliminates the post-report hang and cut a
  2-check smoke from 49s to 7s. Replaces the central mpsc/collect plumbing.

- The `haiku` family reliably spins in a runaway exploration loop on the
  multi-file *reasoning* checks (observed: 435% CPU for 5 min on one
  check, never reporting). Switch the hardcoded model to `sonnet`, which
  reasons efficiently and reports in well under a minute.

With these (+ concurrency 2, 240s timeout, retry), `multi check` passes
all 8 requirements / 10 checks of its own self-validation suite.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01CRSZZ6ft6j4pc8uvXVnpMu
@RobbieMcKinstry RobbieMcKinstry added this pull request to the merge queue Jun 24, 2026
Merged via the queue into trunk with commit 7b1ca74 Jun 24, 2026
8 checks passed
@RobbieMcKinstry RobbieMcKinstry deleted the robbie/mt-check branch June 24, 2026 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant