experimental: Add external host driver runtime#196
Open
steve-calvert-glean wants to merge 7 commits into
Open
Conversation
c36b421 to
752f3d6
Compare
Several gaps prevented the anthropic.claude.cowork.desktop-app.macos driver from running reliably: - runAppleScript hit the default 1MB stdout buffer when the Claude AX tree was fully loaded, surfacing as opaque "stdout maxBuffer length exceeded" errors. Bump to 64MB and add a 30s per-script timeout so individual osascript invocations cannot hang the eval. - The reject-only coworkSurface capability required the user to be manually on the Cowork tab before a run, breaking CI use. Add activateCoworkSurface, which sends Cmd+2 to switch surfaces deterministically. Idempotent — no-op if already on Cowork. - Add wakeAccessibility capability that clicks the front window to force Chromium to populate its accessibility tree. Without this, recursive AX walks return empty on Electron apps until something interacts with the window. - Replace recursive findTextArea/findSubmitButton AppleScript helpers with coordinate-based clicks plus paste-and-Return. The composer role varies across Claude versions (AXTextArea, AXTextField, contenteditable); coordinate clicks work regardless. The Cowork driver registry now composes: control: [platform.macos, activateCoworkSurface, wakeAccessibility] input: accessibilitySubmit (coordinate-click + paste + Return) Manually verified on macOS: a two-case dataset (echo + Glean search tool call) passes end-to-end with traceConfidence=high and traceSource=host-local-transcript on both cases. No Cowork pre-flight required. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The PNG at docs/img/external-host-cowork-reporter.png was only referenced from the PR body, not from any docs or code in the repo. Inlined screenshots in PR descriptions are better hosted via GitHub's drag-and-drop upload (user-images.githubusercontent.com) than committed to the source tree, where they bloat git history without serving documentation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous submit script computed a (centerX, bottom-90px) click target
to focus the composer before pasting. This was fragile to:
- Window position (broke when Claude moved to a secondary display with
negative Y coordinates)
- Window size (the 90px offset assumed a specific layout)
- UI layout drift (banners, status bars, or panel changes shift the
composer's visible position)
Replace coordinate clicks with pure-keyboard input:
activate -> Cmd+2 (Cowork surface)
-> Cmd+N (new conversation, autofocuses composer)
-> Cmd+V (paste from clipboard)
-> Return
Cowork's React app autofocuses the composer when a new conversation
view renders. Chromium's macOS accessibility bridge does not expose this
focus to AppleScript (AXFocusedUIElement returns missing value), but
keystrokes route to whatever has DOM focus regardless. Cmd+N forces a
known-focus state without needing to find the composer in the AX tree.
Side effects:
- The wakeAccessibility capability is no longer needed in the Cowork
driver registry — Cmd+N implicitly wakes the AX tree by triggering UI
re-render. The capability remains available for other drivers that
need it explicitly.
- createNewConversation defaults to true on the Cowork driver. Each
eval starts in a fresh conversation, isolating runs and ensuring
the composer is autofocused.
Manually verified: a two-case dataset (echo + Glean search tool call)
passes with traceConfidence=high and traceSource=host-local-transcript
on both cases, regardless of which monitor Claude is on or what window
size it has.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`tell application X to activate` plus `set frontmost to true` is not reliable on macOS in three common scenarios: - Target app is on a different Space than the user's current Space. - Target app is on a secondary monitor while the user's focus is on a different monitor. - Another foreground app (browser, terminal) holds focus-prevention precedence over background activation requests. When activation silently doesn't take, AppleScript's subsequent `keystroke` calls route to whatever app actually holds focus. The external_host runtime then waits up to its case-level timeout (default 90–120s) for a Cowork local-agent session that never appears, and fails with `failureKind: no_matching_session`. The user-facing error points at the trace step but the real cause is the activation step upstream. Add a short verification loop around foregrounding the target app: poll `frontmost` for up to 2 seconds, retrying `set frontmost to true` each tick. If the app refuses to come forward, error out immediately with a message identifying the foreground problem rather than letting the eval timeout misattribute the failure. This makes the keyboard-only submission path (introduced in c43d4b9) robust to multi-monitor and multi-Space setups. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…foreground-verification test Two follow-ups identified during PR review: - The wakeAccessibility capability and its underlying wakeMacosAccessibility function were no longer wired into any driver after the keyboard-only refactor (Cmd+N replaces the AX-wake side effect that the coordinate click used to provide). The capability still used a coordinate-based click, contradicting the PR's stated keyboard-only design. Removed the function, the capability binding, and the test assertion. Can be reintroduced from git history if a future driver needs explicit AX wake. - The foreground-verification retry loop in buildMacosDesktopSubmitScript (added previously) had no unit test. Added an explicit test asserting both the retry loop structure and the failure-path error message, since this is the safety net that prevents silent 90-second eval timeouts when another app holds focus precedence. Also updates the docs/api-reference.md snippet line range to match the EvalRunnerResult interface position after rebasing onto main (L121-L195). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
36d5428 to
94ed5fc
Compare
The api-reference.md inlined a copy of EvalRunnerResult that included the function-preamble JSDoc comment, but the snippet range L121-L195 in the rebased evalRunner.ts starts at the `export interface` line itself — the preamble lives at L118-L120. markdown-code's content check correctly flagged the divergence in CI; running `npx markdown-code sync` regenerates the inlined block to match the source exactly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aditya-scio
reviewed
Jun 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds an experimental
external_hosteval mode that runs the existing dataset, expectation, iteration, and reporter flow through real MCP host applications instead of only the SDK-basedmcp_hostsimulation path.The implementation is capability-composed rather than host-runner-specific:
anthropic.claude.cowork.desktop-app.macos.control,input,completion,trace, andnormalizecapabilities.builtin:desktop.macos.accessibilitySubmit, plus deterministic Cowork surface activation asbuiltin:anthropic.claude.activateCoworkSurface.audit.jsonl, and embedded.claude/projects/**/*.jsonltranscripts.Why
The current eval path only approximates an MCP host. For real confidence, we need to run eval cases through actual hosts such as Claude Cowork, Claude Desktop Chat, CLI tools, browser hosts, and future desktop apps, while preserving the same dataset and expectation model.
This PR establishes the foundation for that by making the external-host runner declarative and capability-driven. New hosts should be addable by composing or registering capabilities instead of adding one-off imperative runners.
How It Works
An eval case can now specify:
{ "mode": "external_host", "scenario": "Please reply with exactly: external host integration acknowledged.", "externalHost": { "driver": "anthropic.claude.cowork.desktop-app.macos", "timeoutMs": 60000 }, "expect": { "containsText": "external host integration acknowledged" } }The runtime generates a unique run marker, submits the marked scenario to the configured host, waits for host evidence, normalizes the response into the existing expectation-compatible result shape, and attaches external-host metadata for the reporter.
Claude Cowork is configured as a composition of capabilities:
Custom/project-local host capabilities can also be loaded dynamically:
{ "capabilities": { "control": { "uses": "module:file:///path/to/fake-external-host.mjs#fakeExternalHost", "provides": ["input", "completion", "trace", "normalize"] } } }Cowork driver design
The Cowork driver is intentionally keyboard-only — no coordinate-based clicks, no recursive accessibility tree walks. The submit sequence is:
Claude.appand verify it actually came to the foreground.tell-to-activateis unreliable on multi-monitor / multi-Space setups when another app holds focus precedence; the driver retriesset frontmost to truefor up to 2 seconds and errors fast with a clear message if the OS refuses, rather than letting downstream keystrokes route to the wrong app and surfacing as a 90-second eval timeout.Cmd+2to switch to the Cowork surface (idempotent — no-op if already on Cowork).Cmd+Nto open a new conversation. Cowork's React app autofocuses the composer when a new conversation view renders, even though Chromium's macOS accessibility bridge does not exposeAXFocusedUIElementfor that focus.Cmd+Vto paste the marked scenario from the clipboard. Keystrokes route to whatever has DOM focus inside the active window.Returnto submit.~/Library/Application Support/Claude/local-agent-mode-sessions/for a fresh session containing the run marker, then parse itsaudit.jsonland embedded.claude/projects/**/*.jsonltranscript for trace evidence.This avoids two known fragility classes: coordinate-based composer targeting (which breaks across window position, monitor placement, and layout drift) and recursive AX walks against Electron's Chromium accessibility bridge (which on a fully-loaded Cowork window can take 30+ seconds per query).
Validation
Validated locally with:
npm run buildnpm run typechecknpm run format:checknpm run lintnpm testnpm test -- tests/mcp-tests.spec.tsManual end-to-end run against real Claude Cowork on macOS: a two-case dataset (echo + Glean MCP search tool call) passes end-to-end with
traceConfidence: highandtraceSource: host-local-transcripton both cases, regardless of which monitor Claude is on once it can be activated.{ "pass": true, "driverSlug": "anthropic.claude.cowork.desktop-app.macos", "traceSource": "host-local-transcript", "traceConfidence": "high", "artifacts": 3 }Notes
GUI-based host evals depend on macOS automation:
Claude.app.external_hostruns are inherently more dependent on the host process thandirectormcp_hostruns. CI use should run on an isolated desktop session or VM where window state is predictable.