feat(eval): grove eval harness — CLI, eval-loop preset, hooks.eval, Nexus-aware (#211) by windoliver · Pull Request #474 · windoliver/grove

windoliver · 2026-05-30T19:59:46Z

Closes #211.

Completes the eval harness on top of the already-shipped evalOperation + grove_eval MCP tool, then makes the CLI usable against live Nexus sessions.

What's added

Contract

hooks.eval in the GROVE.md hooks schema (string and { cmd, timeout } forms). Added to both src/core/hooks.ts and the separate HooksSchema in src/core/contract.ts (the strict contract parser has its own copy).
evalOperation resolves the command/timeout from contract.hooks.eval when no input override is given (backward compatible).

CLI — grove eval

grove eval <cid>                      # cmd from contract hooks.eval
grove eval --frontier <metric>        # eval the frontier leader
grove eval --latest                   # eval the most recent contribution
grove eval <cid> --eval-command "..."  # override command
grove eval <cid> --submit             # eval, then submit a scored reproduction
grove eval <cid> --timeout <ms> --json

--submit maps parsed scores → {value, direction} using contract metric directions and calls reproduceOperation.
Nexus-aware: prefers Nexus-backed stores (resolved from nexus.yaml/.state.json/namespace) so --latest/--frontier/--submit see live session contributions; falls back to local SQLite. Auto-scopes to the latest session when GROVE_SESSION_ID is unset. New reusable helper src/cli/utils/nexus-stores.ts.

Preset — eval-loop

Hive-style competitive benchmark: flat topology, 8 competitor agents, score:maximize + metric_improves gate, hooks.eval placeholder, long-running budget. renderHooks now emits the eval hook.

Surfaces

docs/parity-matrix.md row + parity-matrix.test.ts CI gate (CLI + MCP + export).

Scope decision

Kept the shipped GROVE_SCORE <metric>=<value> line protocol (+ GROVE_TARGET_CID). The issue's prose sketched a JSON-stdout / GROVE_EVAL_DIR design, but the line protocol is already merged + tested + on the public MCP surface; rewriting it would break that contract for no functional gain. JSON-stdout and artifact-checkout are recorded as deliberate non-goals in the spec.

Tests

bun test src/ → 7315 pass / 0 fail / 5 skip; tsc + biome clean. New coverage: core evalOperation (parsing/timeout/validation/contract-resolution), CLI parse+run, eval-loop preset round-trip, resolveNexusParams.

Live E2E (grove TUI + tmux, real codex↔claude review-loop)

Session d970ac2a → archived, 3 contribs, 2m5s, "Session signaled done". Coder (claude) real fix + commit b95a951; reviewer (codex) workspace had .codex/CODEX.md. eval-loop preset showed live in the TUI picker.
grove eval on the real coder workspace: fixed tree → correctness=1.0 (via --eval-command and via hooks.eval), buggy root → 0.5.
Nexus-aware path verified: grove eval --latest resolved the real Nexus work CID; grove eval <cid> --submit created a scored reproduction persisted in the Nexus session store.

Follow-up

The remaining contribution CLI commands (reproduce/discuss/review/goal/log/frontier/…) still read local SQLite only — tracked in #473, which can adopt the same nexus-stores.ts resolver.

Design spec: docs/superpowers/specs/2026-05-30-eval-harness-211-design.md.

Complete the eval harness on top of the shipped GROVE_SCORE-line evalOperation/grove_eval MCP tool: - contract: add hooks.eval to HooksConfig + HooksSchema (string and { cmd, timeout } forms); evalOperation resolves the command/timeout from contract.hooks.eval when no input override is given - cli: new `grove eval <cid> | --frontier <metric> | --latest` with --eval-command, --submit (scored reproduction), --timeout, --json - preset: eval-loop (flat, 8 competitors, score:maximize, metric_improves gate, eval hook placeholder, long-running budget) - builder: renderHooks emits the eval hook; threaded through presets - parity: add eval row + CI gate entries (CLI + MCP + export) - tests: core evalOperation coverage (parsing/timeout/validation/contract resolution), CLI parse+run, preset round-trip Deliberately keeps the shipped GROVE_SCORE protocol (not the issue's JSON-stdout sketch) to avoid breaking the merged, tested MCP contract.

grove eval opened only local SQLite, so --latest/--frontier/--submit were blind to Nexus-managed session contributions (which live in the Nexus VFS). - new src/cli/utils/nexus-stores.ts: resolveNexusParams (URL+key+namespace from env / nexus.yaml / .state.json / namespace file) + openNexusStores (NexusContributionStore/ClaimStore/Cas, mirroring serve.ts). When no GROVE_SESSION_ID is set, auto-scopes to the latest session so standalone `grove eval --latest` finds live session data. - handleEval prefers Nexus stores when available, falls back to local SQLite; contract/goal resolution stays on the local store (GROVE.md read from disk). - unit tests for resolveNexusParams (env-driven decision). Verified live: review-loop session d970ac2a → `grove eval --latest` resolves the real Nexus work CID, `grove eval <cid> --submit` creates a scored reproduction persisted in the Nexus session store.

windoliver added 3 commits May 30, 2026 11:34

docs(eval): design spec for grove eval harness completion (#211)

3adc588

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): grove eval harness — CLI, eval-loop preset, hooks.eval, Nexus-aware (#211)#474

feat(eval): grove eval harness — CLI, eval-loop preset, hooks.eval, Nexus-aware (#211)#474
windoliver wants to merge 3 commits into
mainfrom
feat/211-eval-harness

windoliver commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

windoliver commented May 30, 2026

What's added

Scope decision

Tests

Live E2E (grove TUI + tmux, real codex↔claude review-loop)

Follow-up

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant