feat: add external eval result storage by steve-calvert-glean · Pull Request #204 · gleanwork/mcp-server-tester

steve-calvert-glean · 2026-05-22T19:16:14Z

Summary

Adds external result storage for MCP eval runs, with a first-party GCS-backed implementation and a local file implementation behind the same EvalResultStore interface.

This makes eval outputs durable and reusable across local runs, CI, reporter history, baseline regression checks, tool override experiments, and server comparisons.

What Changed

Added EvalResultStore, FileEvalResultStore, GCSEvalResultStore, stored artifact envelopes, and a result-store factory.
Extended runEvalDataset so saveResultsTo and baselineResultsFrom support store-backed refs while preserving existing file-path behavior.
Added helpers to load stored eval runs into compareEvalRuns() and persist eval-run comparison artifacts.
Added optional persistence for runServerComparison() output.
Extended the MCP reporter to save MCPEvalRunData externally and load stored reporter history for trend charts.
Added docs and snippets for GCS auth, reporter storage, stored baselines, variant comparisons, and server comparison persistence.

Developer Impact

Existing local baseline/report workflows remain unchanged.
GCS users configure Application Default Credentials via GOOGLE_APPLICATION_CREDENTIALS; inline credentials are intentionally not supported.
@google-cloud/storage is optional and dynamically loaded, so non-GCS users do not need cloud configuration at runtime.

Validation

npm run format:check
npm run docs:check
npm run typecheck
npm run lint
npm test
npm run build

steve-calvert-glean · 2026-05-25T00:09:57Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

Three blockers surfaced during merge-readiness review: 1. Inconsistent redactStoredResponses defaults between the eval runner and the reporter. The runner store path used `redactStoredResponses ?? omitResponsesFromBaseline` (effective default true via fallback), while MCPReporter passed the option through unset, producing falsy-undefined behavior (effective default false). Users with both write paths configured received a mix of redacted and non-redacted artifacts depending on which code path wrote them — silent data inconsistency. Both sites now default to true with explicit `?? true` and the intent is documented inline. To opt into stored responses, pass `redactStoredResponses: false` on the matching site. 2. loadStoredEvalRunnerResult returned `Promise<EvalRunnerResult>` but the documented usage accessed `loaded.metadata?.toolOverrideVariantId` — a field that lives on the storage envelope, not on the runtime result. The optional-chain hid the type error at compile time and the test relied on `metadata` happening to be present on the runner result, which is not where the production save path puts it. Changed the return type to `Promise<StoredEvalArtifact<EvalRunnerResult>>`. Callers now access the runtime data via `loaded.data` and storage metadata (run id, timestamp, package version, tool-override variant id, git hash, etc.) via `loaded.metadata`. Snippet, docs, and test updated to match. 3. GCSEvalResultStore had only two tests (saveArtifact and loadLatestArtifact). The immutable load path (loadArtifact by id) and the listArtifacts pagination/limit path were untested. Both are now covered with nock-mocked HTTP fixtures, including assertions that listArtifacts excludes latest.json from the listing and excludes non-JSON files (partial uploads, etc.). The missing-package error path is left for a follow-up — testing it properly requires either refactoring GCSEvalResultStore.getBucket to accept an injectable importer, or scoped vi.doMock that doesn't contaminate sibling tests. Both are non-trivial enough to defer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

aditya-scio

LGTM

aditya-scio

LGTM

aditya-scio

Really like this separation. Making result storage a first-class interface, while preserving the old file-path workflows, is the right compatibility story. One small follow-up I’d make sure lands is the redactStoredResponses doc/default cleanup so the docs match runtime behavior.

Resolve additive type re-export conflicts in src/index.ts and src/types/index.ts by keeping both the result-store exports (SaveEvalRunComparisonOptions, StoredEvalRunRef) and main's variantExperiment exports. Also fix redactStoredResponses JSDoc default (false -> true) in src/types/reporter.ts to match MCPReporter runtime behavior. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

steve-calvert-glean added the enhancement New feature or request label May 22, 2026

steve-calvert-glean changed the title ~~[codex] add external eval result storage~~ feat: add external eval result storage May 23, 2026

feat: add external eval result storage

3db10dc

steve-calvert-glean force-pushed the codex/external-result-storage branch from 85192ed to 3db10dc Compare May 24, 2026 23:39

steve-calvert-glean marked this pull request as ready for review May 28, 2026 16:56

steve-calvert-glean requested a review from a team as a code owner May 28, 2026 16:56

glean-it Bot reviewed May 28, 2026

View reviewed changes

Comment thread src/types/reporter.ts Outdated

aditya-scio approved these changes Jun 15, 2026

View reviewed changes

aditya-scio reviewed Jun 15, 2026

View reviewed changes

steve-calvert-glean merged commit 019b7a8 into main Jun 15, 2026
1 check passed

steve-calvert-glean deleted the codex/external-result-storage branch June 15, 2026 22:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add external eval result storage#204

feat: add external eval result storage#204
steve-calvert-glean merged 3 commits into
mainfrom
codex/external-result-storage

steve-calvert-glean commented May 22, 2026

Uh oh!

steve-calvert-glean commented May 25, 2026

Uh oh!

Uh oh!

aditya-scio left a comment

Uh oh!

aditya-scio left a comment

Uh oh!

aditya-scio left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

steve-calvert-glean commented May 22, 2026

Summary

What Changed

Developer Impact

Validation

Uh oh!

steve-calvert-glean commented May 25, 2026

Code review

Uh oh!

Uh oh!

aditya-scio left a comment

Choose a reason for hiding this comment

Uh oh!

aditya-scio left a comment

Choose a reason for hiding this comment

Uh oh!

aditya-scio left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aditya-scio left a comment •

edited

Loading