Skip to content

Phase 4a: Python local runner — run_local_tests + runner_doctor + strict-mode gate#39

Merged
SPerekrestova merged 6 commits into
devin/1778098342-phase-0-hygienefrom
devin/1778179706-phase-4a-python-runner
Jun 1, 2026
Merged

Phase 4a: Python local runner — run_local_tests + runner_doctor + strict-mode gate#39
SPerekrestova merged 6 commits into
devin/1778098342-phase-0-hygienefrom
devin/1778179706-phase-4a-python-runner

Conversation

@devin-ai-integration
Copy link
Copy Markdown

Summary

Phase 4a of the redesign plan. Adds a local subprocess runner so the inner-loop "edit → run → fix" cycle no longer burns real LeetCode submissions. Stacked on top of #38; base auto-rebases as #38 reaches main.

No Docker. No new runtime deps. Subprocess + PATH auto-discovery, with cheap safety nets (timeout, clean env, temp cwd, output ceiling) and transparent OS-sandbox detection (bwrapfirejailsandbox-exec → none).

Type layer (src/types/runner.ts, one commit):

  • RunnerLanguage (python3 | go | java), SandboxKind, RunInput, RunResult, RunnerCapabilities.
  • New error codes: RUNNER_NOT_IMPLEMENTED_FOR_LANGUAGE, LANGUAGE_RUNTIME_NOT_FOUND, RUNNER_TIMEOUT, SANDBOX_REQUIRED, LOCAL_TESTS_NOT_PASSED.

Runner layer (src/runner/, one commit):

  • LocalRunner interface, SUPPORTED_LANGUAGES (all three), IMPLEMENTED_LANGUAGES (Phase 4a ships python3 only — go and java land in 4b/4c).
  • SubprocessRunner: lazy probe-and-cache for each language's runtime; spawn with PATH/HOME/LANG forwarded only (parent env secrets never leak in); 5 s default wall-clock timeout with SIGTERM-then-SIGKILL escalation; 1 MB output ceiling per stream; temp mkdtemp cwd that's removed regardless of outcome.
  • sandbox.ts: detects best available OS sandbox at first use, wraps the subprocess transparently. RunResult.sandbox reports which kind was applied; RunResult.warning surfaces when no sandbox is available so the agent can mention it.

Application layer (src/domain/session-service.ts, one commit):

  • requireSession(slug) is now public so the runner-tools layer can reuse the same SESSION_NOT_FOUND envelope as the rest of the pedagogy gate.
  • recordLocalRun(slug, passed) bumps attempts, sets lastLocalRunPassed, and promotes status from startedattempting on first run. The strict-mode submit gate reads from this.

Wire layer (src/mcp/tools/, one commit):

  • New tool run_local_tests (titleSlug, language, code, timeoutMs?): requires an active session, delegates to LocalRunner, records attempts + lastLocalRunPassed. Pre-run rejections (no session, language not implemented) do not bump the counters.
  • New tool runner_doctor (no args): reports detected runtime versions and sandbox kind. Drives the troubleshooting flow for LANGUAGE_RUNTIME_NOT_FOUND.
  • LEETCODE_MCP_STRICT_MODE=1 gates submit_solution: when set, refuses to submit unless the active session's lastLocalRunPassed === true. Off by default (preserves current behavior). Non-disruptive when no session exists for the slug — strict mode only fires inside the tutoring flow.
  • index.ts constructs a single SubprocessRunner per server lifetime and threads the session service through the submission registry.

Tests (one commit, additive):

  • tests/runner/subprocess-runner.test.ts (9): happy path, non-zero exit, captured stderr, timeout (kills within budget+grace), RUNNER_NOT_IMPLEMENTED_FOR_LANGUAGE for go, clean-env (parent secret never reaches child), capabilities snapshot.
  • tests/domain/session-service.test.ts (5): requireSession + recordLocalRun unit coverage including persistence across service instances.
  • tests/integration/runner-tools-integration.test.ts (5): SESSION_NOT_FOUND pre-run, happy delegate, attempts bump on failure, error surface for RUNNER_NOT_IMPLEMENTED_FOR_LANGUAGE, doctor capability snapshot.
  • tests/integration/submission-tools-integration.test.ts (+4): strict-mode gate (blocks/permits/no-session-passthrough/default-off).
  • tests/e2e/runner.test.ts (7): doctor over the wire, no-session, pass, fail, timeout, unimplemented language, end-to-end strict-mode gate. Uses describe.skipIf(!python3) so the suite stays portable.
  • tests/e2e/lifecycle.test.ts: extend tool list to include run_local_tests and runner_doctor (22 → 24).

Tests: unit/integration 178 → 201 (+23); e2e 9 → 16 (+7). npm run build clean; npm run test:types clean; npm run format clean; full npm run test:all exits 0 locally.

Why subprocess and not Docker:

  • The threat model is single-user-running-their-own-code. They could trivially python solution.py themselves; a sandbox doesn't add safety they don't already have. Docker would impose a multi-GB install, daemon dependency, and Mac/Windows/WSL2 friction for zero meaningful gain.
  • OS-native sandboxes (bwrap / firejail / sandbox-exec) are detected and applied transparently when available. RunResult surfaces which one was used; users who refuse to run unsandboxed will get LEETCODE_MCP_REQUIRE_SANDBOX=1 enforcement in Phase 4d. (The error code is reserved here but not yet enforced — keeps this PR scoped.)

What didn't change:

  • Existing tools keep their current signatures.
  • submit_solution still works for all LeetCode languages (the runner is purely additive; only run_local_tests is gated by IMPLEMENTED_LANGUAGES).
  • Strict mode is off by default and never blocks ad-hoc submissions outside the tutoring flow.

Review & Testing Checklist for Human

  • Spot-check src/runner/subprocess-runner.ts — the timeout + kill escalation, output ceiling, and clean-env wiring are the only places "user code does something we don't want" can hurt the host.
  • Sanity-check src/runner/sandbox.ts profiles, especially the macOS sandbox-exec profile — it's terse and deny-default, but I'd appreciate a second pair of eyes confirming we're not denying something legitimately needed for python3.
  • Try LEETCODE_MCP_STRICT_MODE=1 end-to-end: open start_problem, attempt submit_solution (should reject with LOCAL_TESTS_NOT_PASSED), run_local_tests until passed, retry submit_solution (should now go through).
  • Confirm runner_doctor reports what you expect on your machine — and that an unavailable language degrades to available: false rather than throwing.

Notes

  • Go and Java are present in RunnerLanguage and SUPPORTED_LANGUAGES so types stay honest, but their buildArgs throws RUNNER_NOT_IMPLEMENTED_FOR_LANGUAGE. Phase 4b will hook in Go (go run + cache); Phase 4c will hook in Java (javac + classfile cache).
  • SANDBOX_REQUIRED error code exists but isn't enforced yet — Phase 4d will gate runs on LEETCODE_MCP_REQUIRE_SANDBOX=1.
  • Workspace integration (run_local_tests reads the user's actual file when code is omitted) lands in Phase 5.

Link to Devin session: https://app.devin.ai/sessions/d003a60939484686b2953ae32fe2794d
Requested by: @SPerekrestova

- New types: RunnerLanguage, SandboxKind, RunInput, RunResult, RunnerCapabilities
- New error codes: RUNNER_NOT_IMPLEMENTED_FOR_LANGUAGE, LANGUAGE_RUNTIME_NOT_FOUND,
  RUNNER_TIMEOUT, SANDBOX_REQUIRED, LOCAL_TESTS_NOT_PASSED
- Re-export hub now exposes runner.ts
Subprocess-based local runner with PATH probe-and-cache:
- LocalRunner interface, SUPPORTED_LANGUAGES, IMPLEMENTED_LANGUAGES (Phase 4a ships
  python3; go and java land in 4b/4c)
- SubprocessRunner: lazy probe, clean env (PATH/HOME/LANG only), temp cwd,
  5s default wall-clock timeout with SIGTERM-then-SIGKILL escalation,
  1 MB output ceiling per stream
- Sandbox detection: bwrap > firejail > sandbox-exec > none, applied
  transparently. RunResult.warning surfaces when no sandbox is found.
- Test helpers __resetProbeCacheForTest / __resetSandboxCacheForTest

No Docker, no extra runtime deps.
- requireSession(slug) is now public so the runner-tools layer can
  reuse the same SESSION_NOT_FOUND envelope as the rest of the
  pedagogy gate.
- recordLocalRun(slug, passed) bumps attempts, sets lastLocalRunPassed,
  and promotes status from 'started' to 'attempting' on first run.
…on gate

- New tool 'run_local_tests' (titleSlug, language, code, timeoutMs?):
  requires an active session, delegates to LocalRunner, records
  attempts + lastLocalRunPassed on the session.
- New tool 'runner_doctor' (no args): reports detected runtime
  versions and sandbox kind. Drives the README troubleshooting flow
  for LANGUAGE_RUNTIME_NOT_FOUND.
- LEETCODE_MCP_STRICT_MODE=1 gates submit_solution: when set, refuses
  to submit unless the session's lastLocalRunPassed === true. Off by
  default; non-disruptive when no session exists for the slug.
- index.ts wires SubprocessRunner into the server and threads the
  session service through the submission registry.
- tests/runner/subprocess-runner.test.ts (9): happy path, exit codes,
  stderr capture, timeout (SIGTERM+SIGKILL), language-not-implemented,
  clean env (no parent secrets leak)
- tests/domain/session-service.test.ts (5): requireSession +
  recordLocalRun unit coverage
- tests/integration/runner-tools-integration.test.ts (5): SESSION_NOT_FOUND
  pre-run, happy delegate, attempts bump on failure, error surface for
  RUNNER_NOT_IMPLEMENTED_FOR_LANGUAGE, doctor capability snapshot
- tests/integration/submission-tools-integration.test.ts (+4):
  strict-mode gate (blocks/permits/no-session-passthrough/default-off)
- tests/e2e/runner.test.ts (7): doctor over the wire, no-session,
  pass, fail, timeout, unimplemented language, end-to-end strict-mode
  gate; skipped automatically on hosts without python3
- tests/e2e/lifecycle.test.ts: extend tool list to include run_local_tests
  and runner_doctor (22 -> 24)

Totals: 178 -> 201 unit/integration; 9 -> 16 e2e.
@devin-ai-integration
Copy link
Copy Markdown
Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Copy link
Copy Markdown
Owner

@SPerekrestova SPerekrestova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed for reliability, security, simplicity, idiomacy, maintainability, and code style. One concrete bug worth flagging:

src/runner/sandbox.ts — macOS sandbox-exec detection contradicts its own comment (likely silently broken)

// sandbox-exec is /usr/bin/sandbox-exec on every macOS we care
// about. It accepts no `--version`; probe with `-help` (any
// exit code is fine — it always prints to stderr).
try {
    await execFile("/usr/bin/sandbox-exec -help", { timeout: 1500 });
    cached = { kind: "sandbox-exec", path: "/usr/bin/sandbox-exec" };
    return cached;
} catch {
    /* fall through to "none" */
}

The comment explicitly states "any exit code is fine", but execFile here is promisify(exec) — it rejects on any non-zero exit. sandbox-exec -help is an undocumented flag and on at least some macOS versions exits non-zero (usage). When that happens, the catch swallows it and we fall through to kind: "none", even though sandbox-exec is installed and usable. The RunResult.warning will then mislead users into thinking no sandbox exists on their host.

Two ways to fix:

  • Probe by file existence (fs.access("/usr/bin/sandbox-exec", fs.constants.X_OK)) — matches the comment’s intent and avoids spawning a process whose exit semantics are unspecified.
  • Or actually run sandbox-exec end-to-end with a no-op profile (e.g. sandbox-exec -p '(version 1)(allow default)' /usr/bin/true) so a non-zero exit genuinely means the tool is broken.

Minor (not blockers)

  • src/runner/subprocess-runner.ts:42 and src/runner/sandbox.ts:26 alias promisify(exec) as execFile. The name is misleading — every probe call is going through /bin/sh -c. Inputs are static today so there’s no exploitable path, but the pattern is a foot-gun if anyone ever interpolates a dynamic value into one of these probes. Prefer the actual execFile (no shell), or spawn, for binary probes.
  • src/runner/subprocess-runner.ts:260-271: the per-chunk guard stdoutBytes < MAX_OUTPUT_BYTES lets the buffer grow up to MAX_OUTPUT_BYTES + chunk_size before stopping. Cosmetic — clampOutput enforces the final cap — but worth tightening if a tighter bound matters.

Otherwise the layering (types → runner → application → wire), error-envelope reuse, strict-mode passthrough when no session exists, and test coverage all look solid.


Generated by Claude Code

…tighten output guard

Three issues raised on PR #39:

1. sandbox-exec detection on macOS was effectively broken. The probe
   spawned 'sandbox-exec -help' and trusted any non-zero exit to mean
   'not installed', but '-help' is undocumented and exits non-zero on
   some macOS versions. The catch swallowed it and we fell through to
   kind: 'none', misleading users that no sandbox was available.
   Now probe by file existence + executable bit (fs.access X_OK on
   /usr/bin/sandbox-exec), which is what the comment always said the
   intent was.

2. The probes in sandbox.ts and subprocess-runner.ts aliased
   promisify(exec) as 'execFile'. The name was misleading: exec routes
   through /bin/sh -c and is a shell-expansion foot-gun if anyone ever
   interpolates a dynamic value. Switched to the actual node
   execFile (no shell) for every probe — same call sites, just the
   no-shell variant. Static inputs only today; hardened for future.

3. The per-chunk capture in subprocess-runner allowed the buffer to
   grow up to MAX_OUTPUT_BYTES + chunk_size before stopping (the guard
   only checked 'is bytes already at the cap?'). Now slice the
   overflowing chunk to the exact remaining headroom and drop the
   rest, so the buffered total never exceeds MAX_OUTPUT_BYTES.
   clampOutput stays as a final cap.

No behaviour change for the happy path. 201/201 unit/integration,
16/16 e2e.
@devin-ai-integration
Copy link
Copy Markdown
Author

@SPerekrestova good catches — addressed all three in dbabfbe. Pushed to the same branch; CI re-running.

1. macOS sandbox-exec detection bug — fixed. You were right that promisify(exec) rejects on any non-zero exit, so the catch was swallowing real installs and silently falling through to kind: "none". The comment's stated intent ("any exit code is fine") was unimplementable that way. Switched to the alternative you suggested — probe by file existence + executable bit:

await access("/usr/bin/sandbox-exec", fsConstants.X_OK);
cached = { kind: "sandbox-exec", path: "/usr/bin/sandbox-exec" };

sandbox-exec ships at that path on every macOS version we support and the X_OK check is enough to know we can spawn it. The end-to-end no-op-profile alternative would also work but adds a real subprocess at startup just to learn what fs.access already tells us.

2. Misleading execFile alias for promisify(exec) — fixed. You're right, this was a foot-gun even if the inputs are static today. Replaced with the actual node:child_process.execFile (no shell) in both sandbox.ts and subprocess-runner.ts:

import { execFile as execFileCb } from "node:child_process";
const execFile = promisify(execFileCb);
// ...
await execFile(spec.probe.cmd, spec.probe.args, { timeout: 2000 });
await execFile("which", [spec.probe.cmd], { timeout: 1000 });

Same probe semantics, no /bin/sh -c involvement, future maintainers can't accidentally interpolate a dynamic value into a shell.

3. Per-chunk buffer overshoot up to MAX_OUTPUT_BYTES + chunk_size — fixed. The capture path now slices the overflowing chunk to the exact remaining headroom and drops the rest, so the buffered total never exceeds MAX_OUTPUT_BYTES. clampOutput stays at finalize as a final-cap belt-and-braces:

const captureChunk = (buffers, bytes, chunk) => {
    const remaining = MAX_OUTPUT_BYTES - bytes;
    if (remaining <= 0) return bytes;
    if (chunk.length <= remaining) {
        buffers.push(chunk);
        return bytes + chunk.length;
    }
    buffers.push(chunk.subarray(0, remaining));
    return bytes + remaining;
};

No behaviour change on the happy path. 201/201 unit+integration and 16/16 e2e still pass locally.

@SPerekrestova SPerekrestova merged commit 4a125de into devin/1778098342-phase-0-hygiene Jun 1, 2026
1 check passed
@SPerekrestova SPerekrestova deleted the devin/1778179706-phase-4a-python-runner branch June 1, 2026 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant