feat(eval): grove eval harness — CLI, eval-loop preset, hooks.eval, Nexus-aware (#211)#474
Open
windoliver wants to merge 3 commits into
Open
feat(eval): grove eval harness — CLI, eval-loop preset, hooks.eval, Nexus-aware (#211)#474windoliver wants to merge 3 commits into
windoliver wants to merge 3 commits into
Conversation
Complete the eval harness on top of the shipped GROVE_SCORE-line
evalOperation/grove_eval MCP tool:
- contract: add hooks.eval to HooksConfig + HooksSchema (string and
{ cmd, timeout } forms); evalOperation resolves the command/timeout
from contract.hooks.eval when no input override is given
- cli: new `grove eval <cid> | --frontier <metric> | --latest`
with --eval-command, --submit (scored reproduction), --timeout, --json
- preset: eval-loop (flat, 8 competitors, score:maximize, metric_improves
gate, eval hook placeholder, long-running budget)
- builder: renderHooks emits the eval hook; threaded through presets
- parity: add eval row + CI gate entries (CLI + MCP + export)
- tests: core evalOperation coverage (parsing/timeout/validation/contract
resolution), CLI parse+run, preset round-trip
Deliberately keeps the shipped GROVE_SCORE protocol (not the issue's
JSON-stdout sketch) to avoid breaking the merged, tested MCP contract.
grove eval opened only local SQLite, so --latest/--frontier/--submit were blind to Nexus-managed session contributions (which live in the Nexus VFS). - new src/cli/utils/nexus-stores.ts: resolveNexusParams (URL+key+namespace from env / nexus.yaml / .state.json / namespace file) + openNexusStores (NexusContributionStore/ClaimStore/Cas, mirroring serve.ts). When no GROVE_SESSION_ID is set, auto-scopes to the latest session so standalone `grove eval --latest` finds live session data. - handleEval prefers Nexus stores when available, falls back to local SQLite; contract/goal resolution stays on the local store (GROVE.md read from disk). - unit tests for resolveNexusParams (env-driven decision). Verified live: review-loop session d970ac2a → `grove eval --latest` resolves the real Nexus work CID, `grove eval <cid> --submit` creates a scored reproduction persisted in the Nexus session store.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #211.
Completes the eval harness on top of the already-shipped
evalOperation+grove_evalMCP tool, then makes the CLI usable against live Nexus sessions.What's added
Contract
hooks.evalin the GROVE.md hooks schema (string and{ cmd, timeout }forms). Added to bothsrc/core/hooks.tsand the separateHooksSchemainsrc/core/contract.ts(the strict contract parser has its own copy).evalOperationresolves the command/timeout fromcontract.hooks.evalwhen no input override is given (backward compatible).CLI —
grove eval--submitmaps parsed scores →{value, direction}using contract metric directions and callsreproduceOperation.nexus.yaml/.state.json/namespace) so--latest/--frontier/--submitsee live session contributions; falls back to local SQLite. Auto-scopes to the latest session whenGROVE_SESSION_IDis unset. New reusable helpersrc/cli/utils/nexus-stores.ts.Preset —
eval-loopcompetitoragents,score:maximize+metric_improvesgate,hooks.evalplaceholder, long-running budget.renderHooksnow emits the eval hook.Surfaces
docs/parity-matrix.mdrow +parity-matrix.test.tsCI gate (CLI + MCP + export).Scope decision
Kept the shipped
GROVE_SCORE <metric>=<value>line protocol (+GROVE_TARGET_CID). The issue's prose sketched a JSON-stdout /GROVE_EVAL_DIRdesign, but the line protocol is already merged + tested + on the public MCP surface; rewriting it would break that contract for no functional gain. JSON-stdout and artifact-checkout are recorded as deliberate non-goals in the spec.Tests
bun test src/→ 7315 pass / 0 fail / 5 skip; tsc + biome clean. New coverage: coreevalOperation(parsing/timeout/validation/contract-resolution), CLI parse+run,eval-looppreset round-trip,resolveNexusParams.Live E2E (grove TUI + tmux, real codex↔claude review-loop)
d970ac2a→ archived, 3 contribs, 2m5s, "Session signaled done". Coder (claude) real fix + commitb95a951; reviewer (codex) workspace had.codex/CODEX.md.eval-looppreset showed live in the TUI picker.grove evalon the real coder workspace: fixed tree →correctness=1.0(via--eval-commandand viahooks.eval), buggy root →0.5.grove eval --latestresolved the real Nexus work CID;grove eval <cid> --submitcreated a scored reproduction persisted in the Nexus session store.Follow-up
The remaining contribution CLI commands (
reproduce/discuss/review/goal/log/frontier/…) still read local SQLite only — tracked in #473, which can adopt the samenexus-stores.tsresolver.Design spec:
docs/superpowers/specs/2026-05-30-eval-harness-211-design.md.