experimental: Add external host driver runtime by steve-calvert-glean · Pull Request #196 · gleanwork/mcp-server-tester

steve-calvert-glean · 2026-05-11T23:44:26Z

Summary

This PR adds an experimental external_host eval mode that runs the existing dataset, expectation, iteration, and reporter flow through real MCP host applications instead of only the SDK-based mcp_host simulation path.

The implementation is capability-composed rather than host-runner-specific:

Adds structured external host driver identity, for example anthropic.claude.cowork.desktop-app.macos.
Adds a runtime that loads host configs into composed control, input, completion, trace, and normalize capabilities.
Adds built-in Claude Desktop drivers for separate Chat and Cowork surfaces.
Adds reusable macOS desktop app driving via AppleScript as builtin:desktop.macos.accessibilitySubmit, plus deterministic Cowork surface activation as builtin:anthropic.claude.activateCoworkSurface.
Adds high-confidence Claude Cowork trace collection from local-agent metadata, audit.jsonl, and embedded .claude/projects/**/*.jsonl transcripts.
Adds evidence gating so tool-call assertions require structured trace evidence rather than DOM/screenshot/accessibility text.
Adds external-host schema/reference helpers for docs, autocomplete, and agent-generated configs.
Updates reporter data and UI to show host identity, trace source/confidence, evidence source, artifacts, session metadata, usage, cost, and host/automation failure kinds.

Why

The current eval path only approximates an MCP host. For real confidence, we need to run eval cases through actual hosts such as Claude Cowork, Claude Desktop Chat, CLI tools, browser hosts, and future desktop apps, while preserving the same dataset and expectation model.

This PR establishes the foundation for that by making the external-host runner declarative and capability-driven. New hosts should be addable by composing or registering capabilities instead of adding one-off imperative runners.

How It Works

An eval case can now specify:

{
  "mode": "external_host",
  "scenario": "Please reply with exactly: external host integration acknowledged.",
  "externalHost": {
    "driver": "anthropic.claude.cowork.desktop-app.macos",
    "timeoutMs": 60000
  },
  "expect": {
    "containsText": "external host integration acknowledged"
  }
}

The runtime generates a unique run marker, submits the marked scenario to the configured host, waits for host evidence, normalizes the response into the existing expectation-compatible result shape, and attaches external-host metadata for the reporter.

Claude Cowork is configured as a composition of capabilities:

capabilities: {
  control: [
    { uses: 'builtin:platform.macos' },
    {
      uses: 'builtin:anthropic.claude.activateCoworkSurface',
      with: { appName: 'Claude' },
    },
  ],
  input: {
    uses: 'builtin:desktop.macos.accessibilitySubmit',
    with: { appName: 'Claude', createNewConversation: true },
  },
  completion: {
    uses: 'builtin:anthropic.claude.localAgentTrace',
    provides: ['trace'],
  },
  normalize: {
    uses: 'builtin:anthropic.claude.localAgentNormalize',
  },
}

Custom/project-local host capabilities can also be loaded dynamically:

{
  "capabilities": {
    "control": {
      "uses": "module:file:///path/to/fake-external-host.mjs#fakeExternalHost",
      "provides": ["input", "completion", "trace", "normalize"]
    }
  }
}

Cowork driver design

The Cowork driver is intentionally keyboard-only — no coordinate-based clicks, no recursive accessibility tree walks. The submit sequence is:

Activate Claude.app and verify it actually came to the foreground. tell-to-activate is unreliable on multi-monitor / multi-Space setups when another app holds focus precedence; the driver retries set frontmost to true for up to 2 seconds and errors fast with a clear message if the OS refuses, rather than letting downstream keystrokes route to the wrong app and surfacing as a 90-second eval timeout.
Send Cmd+2 to switch to the Cowork surface (idempotent — no-op if already on Cowork).
Send Cmd+N to open a new conversation. Cowork's React app autofocuses the composer when a new conversation view renders, even though Chromium's macOS accessibility bridge does not expose AXFocusedUIElement for that focus.
Send Cmd+V to paste the marked scenario from the clipboard. Keystrokes route to whatever has DOM focus inside the active window.
Send Return to submit.
Watch ~/Library/Application Support/Claude/local-agent-mode-sessions/ for a fresh session containing the run marker, then parse its audit.jsonl and embedded .claude/projects/**/*.jsonl transcript for trace evidence.

This avoids two known fragility classes: coordinate-based composer targeting (which breaks across window position, monitor placement, and layout drift) and recursive AX walks against Electron's Chromium accessibility bridge (which on a fully-loaded Cowork window can take 30+ seconds per query).

Validation

Validated locally with:

npm run build
npm run typecheck
npm run format:check
npm run lint
npm test
Focused external-host unit tests
Deterministic package-boundary e2e demo: npm test -- tests/mcp-tests.spec.ts

Manual end-to-end run against real Claude Cowork on macOS: a two-case dataset (echo + Glean MCP search tool call) passes end-to-end with traceConfidence: high and traceSource: host-local-transcript on both cases, regardless of which monitor Claude is on once it can be activated.

{
  "pass": true,
  "driverSlug": "anthropic.claude.cowork.desktop-app.macos",
  "traceSource": "host-local-transcript",
  "traceConfidence": "high",
  "artifacts": 3
}

Notes

GUI-based host evals depend on macOS automation:

Cowork must be running and signed in.
The terminal running the eval needs Accessibility/Automation permission for Claude.app.
The driver verifies foreground activation up to 2 seconds; if focus-prevention blocks the activation it errors fast rather than silently timing out.
external_host runs are inherently more dependent on the host process than direct or mcp_host runs. CI use should run on an isolated desktop session or VM where window state is predictable.

Several gaps prevented the anthropic.claude.cowork.desktop-app.macos driver from running reliably: - runAppleScript hit the default 1MB stdout buffer when the Claude AX tree was fully loaded, surfacing as opaque "stdout maxBuffer length exceeded" errors. Bump to 64MB and add a 30s per-script timeout so individual osascript invocations cannot hang the eval. - The reject-only coworkSurface capability required the user to be manually on the Cowork tab before a run, breaking CI use. Add activateCoworkSurface, which sends Cmd+2 to switch surfaces deterministically. Idempotent — no-op if already on Cowork. - Add wakeAccessibility capability that clicks the front window to force Chromium to populate its accessibility tree. Without this, recursive AX walks return empty on Electron apps until something interacts with the window. - Replace recursive findTextArea/findSubmitButton AppleScript helpers with coordinate-based clicks plus paste-and-Return. The composer role varies across Claude versions (AXTextArea, AXTextField, contenteditable); coordinate clicks work regardless. The Cowork driver registry now composes: control: [platform.macos, activateCoworkSurface, wakeAccessibility] input: accessibilitySubmit (coordinate-click + paste + Return) Manually verified on macOS: a two-case dataset (echo + Glean search tool call) passes end-to-end with traceConfidence=high and traceSource=host-local-transcript on both cases. No Cowork pre-flight required. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The PNG at docs/img/external-host-cowork-reporter.png was only referenced from the PR body, not from any docs or code in the repo. Inlined screenshots in PR descriptions are better hosted via GitHub's drag-and-drop upload (user-images.githubusercontent.com) than committed to the source tree, where they bloat git history without serving documentation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous submit script computed a (centerX, bottom-90px) click target to focus the composer before pasting. This was fragile to: - Window position (broke when Claude moved to a secondary display with negative Y coordinates) - Window size (the 90px offset assumed a specific layout) - UI layout drift (banners, status bars, or panel changes shift the composer's visible position) Replace coordinate clicks with pure-keyboard input: activate -> Cmd+2 (Cowork surface) -> Cmd+N (new conversation, autofocuses composer) -> Cmd+V (paste from clipboard) -> Return Cowork's React app autofocuses the composer when a new conversation view renders. Chromium's macOS accessibility bridge does not expose this focus to AppleScript (AXFocusedUIElement returns missing value), but keystrokes route to whatever has DOM focus regardless. Cmd+N forces a known-focus state without needing to find the composer in the AX tree. Side effects: - The wakeAccessibility capability is no longer needed in the Cowork driver registry — Cmd+N implicitly wakes the AX tree by triggering UI re-render. The capability remains available for other drivers that need it explicitly. - createNewConversation defaults to true on the Cowork driver. Each eval starts in a fresh conversation, isolating runs and ensuring the composer is autofocused. Manually verified: a two-case dataset (echo + Glean search tool call) passes with traceConfidence=high and traceSource=host-local-transcript on both cases, regardless of which monitor Claude is on or what window size it has. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`tell application X to activate` plus `set frontmost to true` is not reliable on macOS in three common scenarios: - Target app is on a different Space than the user's current Space. - Target app is on a secondary monitor while the user's focus is on a different monitor. - Another foreground app (browser, terminal) holds focus-prevention precedence over background activation requests. When activation silently doesn't take, AppleScript's subsequent `keystroke` calls route to whatever app actually holds focus. The external_host runtime then waits up to its case-level timeout (default 90–120s) for a Cowork local-agent session that never appears, and fails with `failureKind: no_matching_session`. The user-facing error points at the trace step but the real cause is the activation step upstream. Add a short verification loop around foregrounding the target app: poll `frontmost` for up to 2 seconds, retrying `set frontmost to true` each tick. If the app refuses to come forward, error out immediately with a message identifying the foreground problem rather than letting the eval timeout misattribute the failure. This makes the keyboard-only submission path (introduced in c43d4b9) robust to multi-monitor and multi-Space setups. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…foreground-verification test Two follow-ups identified during PR review: - The wakeAccessibility capability and its underlying wakeMacosAccessibility function were no longer wired into any driver after the keyboard-only refactor (Cmd+N replaces the AX-wake side effect that the coordinate click used to provide). The capability still used a coordinate-based click, contradicting the PR's stated keyboard-only design. Removed the function, the capability binding, and the test assertion. Can be reintroduced from git history if a future driver needs explicit AX wake. - The foreground-verification retry loop in buildMacosDesktopSubmitScript (added previously) had no unit test. Added an explicit test asserting both the retry loop structure and the failure-path error message, since this is the safety net that prevents silent 90-second eval timeouts when another app holds focus precedence. Also updates the docs/api-reference.md snippet line range to match the EvalRunnerResult interface position after rebasing onto main (L121-L195). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The api-reference.md inlined a copy of EvalRunnerResult that included the function-preamble JSDoc comment, but the snippet range L121-L195 in the rebased evalRunner.ts starts at the `export interface` line itself — the preamble lives at L118-L120. markdown-code's content check correctly flagged the divergence in CI; running `npx markdown-code sync` regenerates the inlined block to match the source exactly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

aditya-scio

LGTM

aditya-scio

LGTM

aditya-scio

This is a meaningful step up in eval realism

steve-calvert-glean force-pushed the external-host-driver-runtime branch 2 times, most recently from c36b421 to 752f3d6 Compare May 12, 2026 00:21

steve-calvert-glean added the enhancement New feature or request label May 12, 2026

steve-calvert-glean changed the title ~~Add external host driver runtime~~ experimental: Add external host driver runtime May 22, 2026

steve-calvert-glean and others added 6 commits May 28, 2026 09:35

feat: add external host driver runtime

3446462

steve-calvert-glean force-pushed the external-host-driver-runtime branch from 36d5428 to 94ed5fc Compare May 28, 2026 16:40

steve-calvert-glean marked this pull request as ready for review May 28, 2026 16:56

steve-calvert-glean requested a review from a team as a code owner May 28, 2026 16:56

aditya-scio approved these changes Jun 15, 2026

View reviewed changes

aditya-scio reviewed Jun 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experimental: Add external host driver runtime#196

experimental: Add external host driver runtime#196
steve-calvert-glean wants to merge 7 commits into
mainfrom
external-host-driver-runtime

steve-calvert-glean commented May 11, 2026 •

edited

Loading

Uh oh!

aditya-scio left a comment

Uh oh!

aditya-scio left a comment

Uh oh!

aditya-scio left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

steve-calvert-glean commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

How It Works

Cowork driver design

Validation

Notes

Uh oh!

aditya-scio left a comment

Choose a reason for hiding this comment

Uh oh!

aditya-scio left a comment

Choose a reason for hiding this comment

Uh oh!

aditya-scio left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

steve-calvert-glean commented May 11, 2026 •

edited

Loading

aditya-scio left a comment •

edited

Loading