Skip to content

feat(eval): grove eval harness — CLI, eval-loop preset, hooks.eval, Nexus-aware (#211)#474

Open
windoliver wants to merge 3 commits into
mainfrom
feat/211-eval-harness
Open

feat(eval): grove eval harness — CLI, eval-loop preset, hooks.eval, Nexus-aware (#211)#474
windoliver wants to merge 3 commits into
mainfrom
feat/211-eval-harness

Conversation

@windoliver
Copy link
Copy Markdown
Owner

Closes #211.

Completes the eval harness on top of the already-shipped evalOperation + grove_eval MCP tool, then makes the CLI usable against live Nexus sessions.

What's added

Contract

  • hooks.eval in the GROVE.md hooks schema (string and { cmd, timeout } forms). Added to both src/core/hooks.ts and the separate HooksSchema in src/core/contract.ts (the strict contract parser has its own copy).
  • evalOperation resolves the command/timeout from contract.hooks.eval when no input override is given (backward compatible).

CLI — grove eval

grove eval <cid>                      # cmd from contract hooks.eval
grove eval --frontier <metric>        # eval the frontier leader
grove eval --latest                   # eval the most recent contribution
grove eval <cid> --eval-command "..."  # override command
grove eval <cid> --submit             # eval, then submit a scored reproduction
grove eval <cid> --timeout <ms> --json
  • --submit maps parsed scores → {value, direction} using contract metric directions and calls reproduceOperation.
  • Nexus-aware: prefers Nexus-backed stores (resolved from nexus.yaml/.state.json/namespace) so --latest/--frontier/--submit see live session contributions; falls back to local SQLite. Auto-scopes to the latest session when GROVE_SESSION_ID is unset. New reusable helper src/cli/utils/nexus-stores.ts.

Preset — eval-loop

  • Hive-style competitive benchmark: flat topology, 8 competitor agents, score:maximize + metric_improves gate, hooks.eval placeholder, long-running budget. renderHooks now emits the eval hook.

Surfaces

  • docs/parity-matrix.md row + parity-matrix.test.ts CI gate (CLI + MCP + export).

Scope decision

Kept the shipped GROVE_SCORE <metric>=<value> line protocol (+ GROVE_TARGET_CID). The issue's prose sketched a JSON-stdout / GROVE_EVAL_DIR design, but the line protocol is already merged + tested + on the public MCP surface; rewriting it would break that contract for no functional gain. JSON-stdout and artifact-checkout are recorded as deliberate non-goals in the spec.

Tests

bun test src/7315 pass / 0 fail / 5 skip; tsc + biome clean. New coverage: core evalOperation (parsing/timeout/validation/contract-resolution), CLI parse+run, eval-loop preset round-trip, resolveNexusParams.

Live E2E (grove TUI + tmux, real codex↔claude review-loop)

  • Session d970ac2aarchived, 3 contribs, 2m5s, "Session signaled done". Coder (claude) real fix + commit b95a951; reviewer (codex) workspace had .codex/CODEX.md. eval-loop preset showed live in the TUI picker.
  • grove eval on the real coder workspace: fixed tree → correctness=1.0 (via --eval-command and via hooks.eval), buggy root → 0.5.
  • Nexus-aware path verified: grove eval --latest resolved the real Nexus work CID; grove eval <cid> --submit created a scored reproduction persisted in the Nexus session store.

Follow-up

The remaining contribution CLI commands (reproduce/discuss/review/goal/log/frontier/…) still read local SQLite only — tracked in #473, which can adopt the same nexus-stores.ts resolver.

Design spec: docs/superpowers/specs/2026-05-30-eval-harness-211-design.md.

Complete the eval harness on top of the shipped GROVE_SCORE-line
evalOperation/grove_eval MCP tool:

- contract: add hooks.eval to HooksConfig + HooksSchema (string and
  { cmd, timeout } forms); evalOperation resolves the command/timeout
  from contract.hooks.eval when no input override is given
- cli: new `grove eval <cid> | --frontier <metric> | --latest`
  with --eval-command, --submit (scored reproduction), --timeout, --json
- preset: eval-loop (flat, 8 competitors, score:maximize, metric_improves
  gate, eval hook placeholder, long-running budget)
- builder: renderHooks emits the eval hook; threaded through presets
- parity: add eval row + CI gate entries (CLI + MCP + export)
- tests: core evalOperation coverage (parsing/timeout/validation/contract
  resolution), CLI parse+run, preset round-trip

Deliberately keeps the shipped GROVE_SCORE protocol (not the issue's
JSON-stdout sketch) to avoid breaking the merged, tested MCP contract.
grove eval opened only local SQLite, so --latest/--frontier/--submit were
blind to Nexus-managed session contributions (which live in the Nexus VFS).

- new src/cli/utils/nexus-stores.ts: resolveNexusParams (URL+key+namespace
  from env / nexus.yaml / .state.json / namespace file) + openNexusStores
  (NexusContributionStore/ClaimStore/Cas, mirroring serve.ts). When no
  GROVE_SESSION_ID is set, auto-scopes to the latest session so standalone
  `grove eval --latest` finds live session data.
- handleEval prefers Nexus stores when available, falls back to local SQLite;
  contract/goal resolution stays on the local store (GROVE.md read from disk).
- unit tests for resolveNexusParams (env-driven decision).

Verified live: review-loop session d970ac2a → `grove eval --latest` resolves
the real Nexus work CID, `grove eval <cid> --submit` creates a scored
reproduction persisted in the Nexus session store.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval harness: grove eval command + hooks.eval contract support

1 participant