feat(check): MultiTool Checks — `multi check` subcommand (M0–M6 + tests & docs) by RobbieMcKinstry · Pull Request #140 · wack/multitool

RobbieMcKinstry · 2026-06-24T03:56:45Z

MultiTool Checks — the `multi check` subcommand

Implements the full MultiTool Checks project: multi check discovers
declared non-functional ("ility") requirements in CHECKS.md files and
validates them with sandboxed Claude Code agents, reporting verdicts through a
trustworthy in-process MCP guardrail. It's a test suite for requirements that
have no programmatic unit to test.

Closes every issue in milestones M0–M6 plus Tests & docs of the
MultiTool Checks project.

What's here, by milestone

Milestone	Summary
M0 Skeleton	`multi check` subcommand; sync `dispatch()` → tokio runtime → 4-phase pipeline (mirrors `Run`).
M1 Discovery	`ignore`-walk for `CHECKS.md`; comrak AST parse; sentinel extraction (`# Requirement`/`# Req`, `## Check`); anonymous-check inference; aggregated `miette` validation (orphan / checkless).
M2 Config & executor	Boxed `CheckExecutor` trait; concrete `claude -p` executor; hardcoded-but-injected `Config` (haiku).
M3 Sandboxing	`cfg`-gated boxed `Sandbox` trait; macOS APFS `clonefile` impl with RAII teardown; non-macOS stub.
M4 MCP server	One in-process rmcp server on a localhost port (dedicated task); N per-check endpoints with single-call semantics + result channel; lifecycle (shutdown-after-all / missing-report timeout); per-check `--mcp-config` payloads.
M5 Execution	Bounded-parallel `JoinSet` orchestrator (sandbox → endpoint → executor); agent instructions binding the report tool; reconcile (MCP report authoritative) + logical-AND aggregation.
M6 Reporting & exit	Green/red requirement titles via `Terminal` honoring `--enable-colors`; failing checks + evidence in red; passing checks omitted; exit `0` iff all satisfied (empty = `0`), else `1`.
Tests & docs	Fake executor + no-op sandbox; AST extractor unit tests (incl. duplicate-title preservation); end-to-end pipeline test (discovery → execution → reporting → exit code); `guides/checks.md` authoring guide + README link.

Verification

cargo test: 72/72 lib unit tests pass — including the real APFS clone
test, a live MCP server bind/collect/shutdown, and the E2E pipeline (satisfied
/ failed / multi-check AND / anonymous-check, all with the fake executor — no
claude, network, or APFS dependency).
cargo clippy --all-targets --workspace: zero warnings in the new code.
cargo fmt applied. CLI smoke-tested: empty tree → exit 0; orphan CHECKS.md
→ diagnostic + exit 1.
New deps: comrak, rmcp, axum, schemars, libc, tempfile.

Notes

New deps and the feature module live in the main multitool crate
(src/checks/), since the executor / sandbox / MCP server are CLI-bound.
The MCP server's over-the-wire smoke test is covered at the handler
(single-call + delivery), config-payload, and real server-lifecycle levels; a
full rmcp-client round-trip was skipped because its reqwest transport pulls
reqwest 0.13 vs. the project's pinned 0.12.
The branch also carries two pre-existing commits (refresh toolchain and makefile, Add storybloq) that predate this work.
One pre-existing, unrelated doctest in src/stats/categorical.rs (references
the old canary crate name) fails independently of this change.

🤖 Generated with Claude Code

https://claude.ai/code/session_01CRSZZ6ft6j4pc8uvXVnpMu

Implement the `multi check` subcommand: discover declared non-functional ("ility") requirements in CHECKS.md files and validate them with sandboxed Claude Code agents, reporting verdicts through an in-process MCP guardrail. - M0: subcommand wiring + sync dispatch → tokio runtime → 4-phase pipeline - M1: CHECKS.md discovery (ignore), comrak AST parse, requirement/check extraction, anonymous-check inference, aggregated miette validation - M2: boxed CheckExecutor trait + `claude -p` executor + injected Config - M3: cfg-gated copy-on-write Sandbox trait + macOS APFS clonefile impl - M4: in-process rmcp result server, N per-check endpoints, single-call semantics, lifecycle/shutdown, per-check --mcp-config payloads - M5: bounded-parallel orchestrator, agent instructions, reconcile + logical-AND aggregation - M6: colored requirement/check reporting + CI exit code New deps: comrak, rmcp, axum, schemars, libc, tempfile. Implements milestones M0–M6 of the MultiTool Checks project (MULTI-1331, MULTI-1332, MULTI-1333..1353). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01CRSZZ6ft6j4pc8uvXVnpMu

- Add an end-to-end pipeline test driving discovery → execution → reporting → exit code over a fixture tree, with the fake executor and no-op sandbox injected (no claude/network/APFS): covers a satisfied requirement, a failed requirement, a multi-check AND requirement, and an anonymous check; asserts aggregated verdicts and exit codes. - Add a parser test asserting duplicate (non-unique) requirement/check titles are preserved, never deduped. - Add guides/checks.md: `multi check` usage (working dir, CI exit codes, MVP constraints, trust model) and the full CHECKS.md authoring format with examples; link it from the README. Implements the Tests & docs milestone (MULTI-1354, MULTI-1355, MULTI-1356). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01CRSZZ6ft6j4pc8uvXVnpMu

The doc example imported `canary::stats::Categorical` — a stale crate name, and `stats` is a crate-internal module, so the trait can't be reached from an external doctest at all. It was the only `rust` doctest in the crate and failed the build. Mark the example `ignore` (it documents an internal trait) and add a unit test that actually exercises it, so `cargo test` is fully green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01CRSZZ6ft6j4pc8uvXVnpMu

The toolchain-refresh commit on this branch surfaced ~60 warnings across the pre-existing (largely abandoned) modules, which `cargo make ci-flow` fails on via clippy's `-D warnings`. Apply clippy's machine-applicable fixes, update the deprecated `aws_config::BehaviorVersion` to v2025_08_07, restore the feature-gated `ProxySubcommand` re-export, and allow the remaining abandoned-code lint categories (dead_code, private_interfaces, wrong_self_convention, large_enum_variant) crate-wide. The `checks` feature added in this PR is clippy-clean on its own. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01CRSZZ6ft6j4pc8uvXVnpMu

Dogfooding `multi check` on this repository surfaced two operational issues with real `claude` agents: - The result server ran in stateless Streamable HTTP mode, which stalls the Claude Code MCP client's multi-step session handshake (rmcp only serves POST when stateless). Run in stateful mode instead. - Full `claude` agents are heavy; 4+ concurrent starve each other past the timeout. Lower the default concurrency to 2 and the per-agent timeout to 120s, and retry a check (up to max_attempts) whose agent hangs or stops without reporting — a check only errors after all attempts are exhausted. Also add the CHECKS.md self-validation suite (the dogfooding target), including a requirement asserting the server runs in stateful mode. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01CRSZZ6ft6j4pc8uvXVnpMu

Empty commit to start a fresh workflow run so on-push.yml resolves wack/gh-actions validate.yml@trunk to the version that installs cargo-nextest and cargo-llvm-cov (re-runs pin the original reusable workflow resolution). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01CRSZZ6ft6j4pc8uvXVnpMu

…0/10 Dogfooding `multi check` on this repo (the CHECKS.md self-validation suite) surfaced two more issues that prevented a clean pass: - Agents that reported still hung on connection cleanup, eating the whole timeout. The agent's job is done the moment it calls report-check-result, so deliver the verdict via a per-check Notify + shared store and race it against the process in execution::run_one: on a report we drop the run future, which kills the agent (kill_on_drop). This eliminates the post-report hang and cut a 2-check smoke from 49s to 7s. Replaces the central mpsc/collect plumbing. - The `haiku` family reliably spins in a runaway exploration loop on the multi-file *reasoning* checks (observed: 435% CPU for 5 min on one check, never reporting). Switch the hardcoded model to `sonnet`, which reasons efficiently and reports in well under a minute. With these (+ concurrency 2, 240s timeout, retry), `multi check` passes all 8 requirements / 10 checks of its own self-validation suite. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01CRSZZ6ft6j4pc8uvXVnpMu

RobbieMcKinstry force-pushed the robbie/mt-check branch from 0c689db to 56bc0f3 Compare June 24, 2026 19:42

RobbieMcKinstry and others added 9 commits June 24, 2026 15:44

refresh toolchain and makefile.

6ef94e8

Add storybloq.

df8736d

RobbieMcKinstry force-pushed the robbie/mt-check branch from 56bc0f3 to 5762a45 Compare June 24, 2026 19:45

RobbieMcKinstry added this pull request to the merge queue Jun 24, 2026

Merged via the queue into trunk with commit 7b1ca74 Jun 24, 2026
8 checks passed

RobbieMcKinstry deleted the robbie/mt-check branch June 24, 2026 19:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(check): MultiTool Checks — `multi check` subcommand (M0–M6 + tests & docs)#140

feat(check): MultiTool Checks — `multi check` subcommand (M0–M6 + tests & docs)#140
RobbieMcKinstry merged 9 commits into
trunkfrom
robbie/mt-check

RobbieMcKinstry commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

RobbieMcKinstry commented Jun 24, 2026

MultiTool Checks — the multi check subcommand

What's here, by milestone

Verification

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MultiTool Checks — the `multi check` subcommand