Skip to content

experimental: Add external host driver runtime#196

Open
steve-calvert-glean wants to merge 7 commits into
mainfrom
external-host-driver-runtime
Open

experimental: Add external host driver runtime#196
steve-calvert-glean wants to merge 7 commits into
mainfrom
external-host-driver-runtime

Conversation

@steve-calvert-glean

@steve-calvert-glean steve-calvert-glean commented May 11, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR adds an experimental external_host eval mode that runs the existing dataset, expectation, iteration, and reporter flow through real MCP host applications instead of only the SDK-based mcp_host simulation path.

The implementation is capability-composed rather than host-runner-specific:

  • Adds structured external host driver identity, for example anthropic.claude.cowork.desktop-app.macos.
  • Adds a runtime that loads host configs into composed control, input, completion, trace, and normalize capabilities.
  • Adds built-in Claude Desktop drivers for separate Chat and Cowork surfaces.
  • Adds reusable macOS desktop app driving via AppleScript as builtin:desktop.macos.accessibilitySubmit, plus deterministic Cowork surface activation as builtin:anthropic.claude.activateCoworkSurface.
  • Adds high-confidence Claude Cowork trace collection from local-agent metadata, audit.jsonl, and embedded .claude/projects/**/*.jsonl transcripts.
  • Adds evidence gating so tool-call assertions require structured trace evidence rather than DOM/screenshot/accessibility text.
  • Adds external-host schema/reference helpers for docs, autocomplete, and agent-generated configs.
  • Updates reporter data and UI to show host identity, trace source/confidence, evidence source, artifacts, session metadata, usage, cost, and host/automation failure kinds.

Why

The current eval path only approximates an MCP host. For real confidence, we need to run eval cases through actual hosts such as Claude Cowork, Claude Desktop Chat, CLI tools, browser hosts, and future desktop apps, while preserving the same dataset and expectation model.

This PR establishes the foundation for that by making the external-host runner declarative and capability-driven. New hosts should be addable by composing or registering capabilities instead of adding one-off imperative runners.

How It Works

An eval case can now specify:

{
  "mode": "external_host",
  "scenario": "Please reply with exactly: external host integration acknowledged.",
  "externalHost": {
    "driver": "anthropic.claude.cowork.desktop-app.macos",
    "timeoutMs": 60000
  },
  "expect": {
    "containsText": "external host integration acknowledged"
  }
}

The runtime generates a unique run marker, submits the marked scenario to the configured host, waits for host evidence, normalizes the response into the existing expectation-compatible result shape, and attaches external-host metadata for the reporter.

Claude Cowork is configured as a composition of capabilities:

capabilities: {
  control: [
    { uses: 'builtin:platform.macos' },
    {
      uses: 'builtin:anthropic.claude.activateCoworkSurface',
      with: { appName: 'Claude' },
    },
  ],
  input: {
    uses: 'builtin:desktop.macos.accessibilitySubmit',
    with: { appName: 'Claude', createNewConversation: true },
  },
  completion: {
    uses: 'builtin:anthropic.claude.localAgentTrace',
    provides: ['trace'],
  },
  normalize: {
    uses: 'builtin:anthropic.claude.localAgentNormalize',
  },
}

Custom/project-local host capabilities can also be loaded dynamically:

{
  "capabilities": {
    "control": {
      "uses": "module:file:///path/to/fake-external-host.mjs#fakeExternalHost",
      "provides": ["input", "completion", "trace", "normalize"]
    }
  }
}

Cowork driver design

The Cowork driver is intentionally keyboard-only — no coordinate-based clicks, no recursive accessibility tree walks. The submit sequence is:

  1. Activate Claude.app and verify it actually came to the foreground. tell-to-activate is unreliable on multi-monitor / multi-Space setups when another app holds focus precedence; the driver retries set frontmost to true for up to 2 seconds and errors fast with a clear message if the OS refuses, rather than letting downstream keystrokes route to the wrong app and surfacing as a 90-second eval timeout.
  2. Send Cmd+2 to switch to the Cowork surface (idempotent — no-op if already on Cowork).
  3. Send Cmd+N to open a new conversation. Cowork's React app autofocuses the composer when a new conversation view renders, even though Chromium's macOS accessibility bridge does not expose AXFocusedUIElement for that focus.
  4. Send Cmd+V to paste the marked scenario from the clipboard. Keystrokes route to whatever has DOM focus inside the active window.
  5. Send Return to submit.
  6. Watch ~/Library/Application Support/Claude/local-agent-mode-sessions/ for a fresh session containing the run marker, then parse its audit.jsonl and embedded .claude/projects/**/*.jsonl transcript for trace evidence.

This avoids two known fragility classes: coordinate-based composer targeting (which breaks across window position, monitor placement, and layout drift) and recursive AX walks against Electron's Chromium accessibility bridge (which on a fully-loaded Cowork window can take 30+ seconds per query).

Validation

Validated locally with:

  • npm run build
  • npm run typecheck
  • npm run format:check
  • npm run lint
  • npm test
  • Focused external-host unit tests
  • Deterministic package-boundary e2e demo: npm test -- tests/mcp-tests.spec.ts

Manual end-to-end run against real Claude Cowork on macOS: a two-case dataset (echo + Glean MCP search tool call) passes end-to-end with traceConfidence: high and traceSource: host-local-transcript on both cases, regardless of which monitor Claude is on once it can be activated.

{
  "pass": true,
  "driverSlug": "anthropic.claude.cowork.desktop-app.macos",
  "traceSource": "host-local-transcript",
  "traceConfidence": "high",
  "artifacts": 3
}

Notes

GUI-based host evals depend on macOS automation:

  • Cowork must be running and signed in.
  • The terminal running the eval needs Accessibility/Automation permission for Claude.app.
  • The driver verifies foreground activation up to 2 seconds; if focus-prevention blocks the activation it errors fast rather than silently timing out.
  • external_host runs are inherently more dependent on the host process than direct or mcp_host runs. CI use should run on an isolated desktop session or VM where window state is predictable.

@steve-calvert-glean steve-calvert-glean force-pushed the external-host-driver-runtime branch 2 times, most recently from c36b421 to 752f3d6 Compare May 12, 2026 00:21
@steve-calvert-glean steve-calvert-glean added the enhancement New feature or request label May 12, 2026
@steve-calvert-glean steve-calvert-glean changed the title Add external host driver runtime experimental: Add external host driver runtime May 22, 2026
steve-calvert-glean and others added 6 commits May 28, 2026 09:35
Several gaps prevented the anthropic.claude.cowork.desktop-app.macos
driver from running reliably:

- runAppleScript hit the default 1MB stdout buffer when the Claude
  AX tree was fully loaded, surfacing as opaque "stdout maxBuffer
  length exceeded" errors. Bump to 64MB and add a 30s per-script
  timeout so individual osascript invocations cannot hang the eval.

- The reject-only coworkSurface capability required the user to be
  manually on the Cowork tab before a run, breaking CI use. Add
  activateCoworkSurface, which sends Cmd+2 to switch surfaces
  deterministically. Idempotent — no-op if already on Cowork.

- Add wakeAccessibility capability that clicks the front window to
  force Chromium to populate its accessibility tree. Without this,
  recursive AX walks return empty on Electron apps until something
  interacts with the window.

- Replace recursive findTextArea/findSubmitButton AppleScript
  helpers with coordinate-based clicks plus paste-and-Return. The
  composer role varies across Claude versions (AXTextArea,
  AXTextField, contenteditable); coordinate clicks work regardless.

The Cowork driver registry now composes:
  control: [platform.macos, activateCoworkSurface, wakeAccessibility]
  input:   accessibilitySubmit (coordinate-click + paste + Return)

Manually verified on macOS: a two-case dataset (echo + Glean search
tool call) passes end-to-end with traceConfidence=high and
traceSource=host-local-transcript on both cases. No Cowork
pre-flight required.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The PNG at docs/img/external-host-cowork-reporter.png was only
referenced from the PR body, not from any docs or code in the repo.
Inlined screenshots in PR descriptions are better hosted via
GitHub's drag-and-drop upload (user-images.githubusercontent.com)
than committed to the source tree, where they bloat git history
without serving documentation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous submit script computed a (centerX, bottom-90px) click target
to focus the composer before pasting. This was fragile to:

- Window position (broke when Claude moved to a secondary display with
  negative Y coordinates)
- Window size (the 90px offset assumed a specific layout)
- UI layout drift (banners, status bars, or panel changes shift the
  composer's visible position)

Replace coordinate clicks with pure-keyboard input:

  activate -> Cmd+2 (Cowork surface)
            -> Cmd+N (new conversation, autofocuses composer)
            -> Cmd+V (paste from clipboard)
            -> Return

Cowork's React app autofocuses the composer when a new conversation
view renders. Chromium's macOS accessibility bridge does not expose this
focus to AppleScript (AXFocusedUIElement returns missing value), but
keystrokes route to whatever has DOM focus regardless. Cmd+N forces a
known-focus state without needing to find the composer in the AX tree.

Side effects:

- The wakeAccessibility capability is no longer needed in the Cowork
  driver registry — Cmd+N implicitly wakes the AX tree by triggering UI
  re-render. The capability remains available for other drivers that
  need it explicitly.
- createNewConversation defaults to true on the Cowork driver. Each
  eval starts in a fresh conversation, isolating runs and ensuring
  the composer is autofocused.

Manually verified: a two-case dataset (echo + Glean search tool call)
passes with traceConfidence=high and traceSource=host-local-transcript
on both cases, regardless of which monitor Claude is on or what window
size it has.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`tell application X to activate` plus `set frontmost to true` is not
reliable on macOS in three common scenarios:

- Target app is on a different Space than the user's current Space.
- Target app is on a secondary monitor while the user's focus is on a
  different monitor.
- Another foreground app (browser, terminal) holds focus-prevention
  precedence over background activation requests.

When activation silently doesn't take, AppleScript's subsequent
`keystroke` calls route to whatever app actually holds focus. The
external_host runtime then waits up to its case-level timeout (default
90–120s) for a Cowork local-agent session that never appears, and
fails with `failureKind: no_matching_session`. The user-facing error
points at the trace step but the real cause is the activation step
upstream.

Add a short verification loop around foregrounding the target app:
poll `frontmost` for up to 2 seconds, retrying `set frontmost to true`
each tick. If the app refuses to come forward, error out immediately
with a message identifying the foreground problem rather than letting
the eval timeout misattribute the failure.

This makes the keyboard-only submission path (introduced in c43d4b9)
robust to multi-monitor and multi-Space setups.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…foreground-verification test

Two follow-ups identified during PR review:

- The wakeAccessibility capability and its underlying wakeMacosAccessibility
  function were no longer wired into any driver after the keyboard-only
  refactor (Cmd+N replaces the AX-wake side effect that the coordinate click
  used to provide). The capability still used a coordinate-based click,
  contradicting the PR's stated keyboard-only design. Removed the function,
  the capability binding, and the test assertion. Can be reintroduced from
  git history if a future driver needs explicit AX wake.

- The foreground-verification retry loop in buildMacosDesktopSubmitScript
  (added previously) had no unit test. Added an explicit test asserting both
  the retry loop structure and the failure-path error message, since this
  is the safety net that prevents silent 90-second eval timeouts when
  another app holds focus precedence.

Also updates the docs/api-reference.md snippet line range to match the
EvalRunnerResult interface position after rebasing onto main (L121-L195).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@steve-calvert-glean steve-calvert-glean force-pushed the external-host-driver-runtime branch from 36d5428 to 94ed5fc Compare May 28, 2026 16:40
The api-reference.md inlined a copy of EvalRunnerResult that included
the function-preamble JSDoc comment, but the snippet range L121-L195
in the rebased evalRunner.ts starts at the `export interface` line
itself — the preamble lives at L118-L120. markdown-code's content
check correctly flagged the divergence in CI; running `npx
markdown-code sync` regenerates the inlined block to match the source
exactly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@steve-calvert-glean steve-calvert-glean marked this pull request as ready for review May 28, 2026 16:56
@steve-calvert-glean steve-calvert-glean requested a review from a team as a code owner May 28, 2026 16:56

@aditya-scio aditya-scio left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@aditya-scio aditya-scio left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@aditya-scio aditya-scio left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a meaningful step up in eval realism

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants