feat(check): MultiTool Checks — multi check subcommand (M0–M6 + tests & docs)#140
Merged
Conversation
0c689db to
56bc0f3
Compare
Implement the `multi check` subcommand: discover declared non-functional
("ility") requirements in CHECKS.md files and validate them with sandboxed
Claude Code agents, reporting verdicts through an in-process MCP guardrail.
- M0: subcommand wiring + sync dispatch → tokio runtime → 4-phase pipeline
- M1: CHECKS.md discovery (ignore), comrak AST parse, requirement/check
extraction, anonymous-check inference, aggregated miette validation
- M2: boxed CheckExecutor trait + `claude -p` executor + injected Config
- M3: cfg-gated copy-on-write Sandbox trait + macOS APFS clonefile impl
- M4: in-process rmcp result server, N per-check endpoints, single-call
semantics, lifecycle/shutdown, per-check --mcp-config payloads
- M5: bounded-parallel orchestrator, agent instructions, reconcile +
logical-AND aggregation
- M6: colored requirement/check reporting + CI exit code
New deps: comrak, rmcp, axum, schemars, libc, tempfile.
Implements milestones M0–M6 of the MultiTool Checks project
(MULTI-1331, MULTI-1332, MULTI-1333..1353).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01CRSZZ6ft6j4pc8uvXVnpMu
- Add an end-to-end pipeline test driving discovery → execution → reporting → exit code over a fixture tree, with the fake executor and no-op sandbox injected (no claude/network/APFS): covers a satisfied requirement, a failed requirement, a multi-check AND requirement, and an anonymous check; asserts aggregated verdicts and exit codes. - Add a parser test asserting duplicate (non-unique) requirement/check titles are preserved, never deduped. - Add guides/checks.md: `multi check` usage (working dir, CI exit codes, MVP constraints, trust model) and the full CHECKS.md authoring format with examples; link it from the README. Implements the Tests & docs milestone (MULTI-1354, MULTI-1355, MULTI-1356). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01CRSZZ6ft6j4pc8uvXVnpMu
The doc example imported `canary::stats::Categorical` — a stale crate name, and `stats` is a crate-internal module, so the trait can't be reached from an external doctest at all. It was the only `rust` doctest in the crate and failed the build. Mark the example `ignore` (it documents an internal trait) and add a unit test that actually exercises it, so `cargo test` is fully green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01CRSZZ6ft6j4pc8uvXVnpMu
The toolchain-refresh commit on this branch surfaced ~60 warnings across the pre-existing (largely abandoned) modules, which `cargo make ci-flow` fails on via clippy's `-D warnings`. Apply clippy's machine-applicable fixes, update the deprecated `aws_config::BehaviorVersion` to v2025_08_07, restore the feature-gated `ProxySubcommand` re-export, and allow the remaining abandoned-code lint categories (dead_code, private_interfaces, wrong_self_convention, large_enum_variant) crate-wide. The `checks` feature added in this PR is clippy-clean on its own. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01CRSZZ6ft6j4pc8uvXVnpMu
Dogfooding `multi check` on this repository surfaced two operational issues with real `claude` agents: - The result server ran in stateless Streamable HTTP mode, which stalls the Claude Code MCP client's multi-step session handshake (rmcp only serves POST when stateless). Run in stateful mode instead. - Full `claude` agents are heavy; 4+ concurrent starve each other past the timeout. Lower the default concurrency to 2 and the per-agent timeout to 120s, and retry a check (up to max_attempts) whose agent hangs or stops without reporting — a check only errors after all attempts are exhausted. Also add the CHECKS.md self-validation suite (the dogfooding target), including a requirement asserting the server runs in stateful mode. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01CRSZZ6ft6j4pc8uvXVnpMu
Empty commit to start a fresh workflow run so on-push.yml resolves wack/gh-actions validate.yml@trunk to the version that installs cargo-nextest and cargo-llvm-cov (re-runs pin the original reusable workflow resolution). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01CRSZZ6ft6j4pc8uvXVnpMu
…0/10 Dogfooding `multi check` on this repo (the CHECKS.md self-validation suite) surfaced two more issues that prevented a clean pass: - Agents that reported still hung on connection cleanup, eating the whole timeout. The agent's job is done the moment it calls report-check-result, so deliver the verdict via a per-check Notify + shared store and race it against the process in execution::run_one: on a report we drop the run future, which kills the agent (kill_on_drop). This eliminates the post-report hang and cut a 2-check smoke from 49s to 7s. Replaces the central mpsc/collect plumbing. - The `haiku` family reliably spins in a runaway exploration loop on the multi-file *reasoning* checks (observed: 435% CPU for 5 min on one check, never reporting). Switch the hardcoded model to `sonnet`, which reasons efficiently and reports in well under a minute. With these (+ concurrency 2, 240s timeout, retry), `multi check` passes all 8 requirements / 10 checks of its own self-validation suite. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01CRSZZ6ft6j4pc8uvXVnpMu
56bc0f3 to
5762a45
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
MultiTool Checks — the
multi checksubcommandImplements the full MultiTool Checks project:
multi checkdiscoversdeclared non-functional ("ility") requirements in
CHECKS.mdfiles andvalidates them with sandboxed Claude Code agents, reporting verdicts through a
trustworthy in-process MCP guardrail. It's a test suite for requirements that
have no programmatic unit to test.
Closes every issue in milestones M0–M6 plus Tests & docs of the
MultiTool Checks project.
What's here, by milestone
multi checksubcommand; syncdispatch()→ tokio runtime → 4-phase pipeline (mirrorsRun).ignore-walk forCHECKS.md; comrak AST parse; sentinel extraction (# Requirement/# Req,## Check); anonymous-check inference; aggregatedmiettevalidation (orphan / checkless).CheckExecutortrait; concreteclaude -pexecutor; hardcoded-but-injectedConfig(haiku).cfg-gated boxedSandboxtrait; macOS APFSclonefileimpl with RAII teardown; non-macOS stub.--mcp-configpayloads.JoinSetorchestrator (sandbox → endpoint → executor); agent instructions binding the report tool; reconcile (MCP report authoritative) + logical-AND aggregation.Terminalhonoring--enable-colors; failing checks + evidence in red; passing checks omitted; exit0iff all satisfied (empty =0), else1.guides/checks.mdauthoring guide + README link.Verification
cargo test: 72/72 lib unit tests pass — including the real APFS clonetest, a live MCP server bind/collect/shutdown, and the E2E pipeline (satisfied
/ failed / multi-check AND / anonymous-check, all with the fake executor — no
claude, network, or APFS dependency).cargo clippy --all-targets --workspace: zero warnings in the new code.cargo fmtapplied. CLI smoke-tested: empty tree → exit 0; orphanCHECKS.md→ diagnostic + exit 1.
comrak,rmcp,axum,schemars,libc,tempfile.Notes
multitoolcrate(
src/checks/), since the executor / sandbox / MCP server are CLI-bound.(single-call + delivery), config-payload, and real server-lifecycle levels; a
full rmcp-client round-trip was skipped because its reqwest transport pulls
reqwest 0.13 vs. the project's pinned 0.12.
refresh toolchain and makefile,Add storybloq) that predate this work.src/stats/categorical.rs(referencesthe old
canarycrate name) fails independently of this change.🤖 Generated with Claude Code
https://claude.ai/code/session_01CRSZZ6ft6j4pc8uvXVnpMu