Skip to content

feat: add external eval result storage#204

Merged
steve-calvert-glean merged 3 commits into
mainfrom
codex/external-result-storage
Jun 15, 2026
Merged

feat: add external eval result storage#204
steve-calvert-glean merged 3 commits into
mainfrom
codex/external-result-storage

Conversation

@steve-calvert-glean

Copy link
Copy Markdown
Contributor

Summary

Adds external result storage for MCP eval runs, with a first-party GCS-backed implementation and a local file implementation behind the same EvalResultStore interface.

This makes eval outputs durable and reusable across local runs, CI, reporter history, baseline regression checks, tool override experiments, and server comparisons.

What Changed

  • Added EvalResultStore, FileEvalResultStore, GCSEvalResultStore, stored artifact envelopes, and a result-store factory.
  • Extended runEvalDataset so saveResultsTo and baselineResultsFrom support store-backed refs while preserving existing file-path behavior.
  • Added helpers to load stored eval runs into compareEvalRuns() and persist eval-run comparison artifacts.
  • Added optional persistence for runServerComparison() output.
  • Extended the MCP reporter to save MCPEvalRunData externally and load stored reporter history for trend charts.
  • Added docs and snippets for GCS auth, reporter storage, stored baselines, variant comparisons, and server comparison persistence.

Developer Impact

  • Existing local baseline/report workflows remain unchanged.
  • GCS users configure Application Default Credentials via GOOGLE_APPLICATION_CREDENTIALS; inline credentials are intentionally not supported.
  • @google-cloud/storage is optional and dynamically loaded, so non-GCS users do not need cloud configuration at runtime.

Validation

  • npm run format:check
  • npm run docs:check
  • npm run typecheck
  • npm run lint
  • npm test
  • npm run build

@steve-calvert-glean steve-calvert-glean added the enhancement New feature or request label May 22, 2026
@steve-calvert-glean steve-calvert-glean changed the title [codex] add external eval result storage feat: add external eval result storage May 23, 2026
@steve-calvert-glean steve-calvert-glean force-pushed the codex/external-result-storage branch from 85192ed to 3db10dc Compare May 24, 2026 23:39
@steve-calvert-glean

Copy link
Copy Markdown
Contributor Author

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Three blockers surfaced during merge-readiness review:

1. Inconsistent redactStoredResponses defaults between the eval runner
   and the reporter. The runner store path used
   `redactStoredResponses ?? omitResponsesFromBaseline` (effective
   default true via fallback), while MCPReporter passed the option
   through unset, producing falsy-undefined behavior (effective default
   false). Users with both write paths configured received a mix of
   redacted and non-redacted artifacts depending on which code path
   wrote them — silent data inconsistency.

   Both sites now default to true with explicit `?? true` and the
   intent is documented inline. To opt into stored responses, pass
   `redactStoredResponses: false` on the matching site.

2. loadStoredEvalRunnerResult returned `Promise<EvalRunnerResult>` but
   the documented usage accessed `loaded.metadata?.toolOverrideVariantId`
   — a field that lives on the storage envelope, not on the runtime
   result. The optional-chain hid the type error at compile time and
   the test relied on `metadata` happening to be present on the runner
   result, which is not where the production save path puts it.

   Changed the return type to
   `Promise<StoredEvalArtifact<EvalRunnerResult>>`. Callers now access
   the runtime data via `loaded.data` and storage metadata (run id,
   timestamp, package version, tool-override variant id, git hash, etc.)
   via `loaded.metadata`. Snippet, docs, and test updated to match.

3. GCSEvalResultStore had only two tests (saveArtifact and
   loadLatestArtifact). The immutable load path (loadArtifact by id)
   and the listArtifacts pagination/limit path were untested. Both are
   now covered with nock-mocked HTTP fixtures, including assertions
   that listArtifacts excludes latest.json from the listing and
   excludes non-JSON files (partial uploads, etc.).

The missing-package error path is left for a follow-up — testing it
properly requires either refactoring GCSEvalResultStore.getBucket to
accept an injectable importer, or scoped vi.doMock that doesn't
contaminate sibling tests. Both are non-trivial enough to defer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@steve-calvert-glean steve-calvert-glean marked this pull request as ready for review May 28, 2026 16:56
@steve-calvert-glean steve-calvert-glean requested a review from a team as a code owner May 28, 2026 16:56
Comment thread src/types/reporter.ts Outdated

@aditya-scio aditya-scio left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@aditya-scio aditya-scio left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@aditya-scio aditya-scio left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really like this separation. Making result storage a first-class interface, while preserving the old file-path workflows, is the right compatibility story. One small follow-up I’d make sure lands is the redactStoredResponses doc/default cleanup so the docs match runtime behavior.

Resolve additive type re-export conflicts in src/index.ts and
src/types/index.ts by keeping both the result-store exports
(SaveEvalRunComparisonOptions, StoredEvalRunRef) and main's
variantExperiment exports.

Also fix redactStoredResponses JSDoc default (false -> true) in
src/types/reporter.ts to match MCPReporter runtime behavior.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@steve-calvert-glean steve-calvert-glean merged commit 019b7a8 into main Jun 15, 2026
1 check passed
@steve-calvert-glean steve-calvert-glean deleted the codex/external-result-storage branch June 15, 2026 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants