Phase 4a: Python local runner — run_local_tests + runner_doctor + strict-mode gate#39
Conversation
- New types: RunnerLanguage, SandboxKind, RunInput, RunResult, RunnerCapabilities - New error codes: RUNNER_NOT_IMPLEMENTED_FOR_LANGUAGE, LANGUAGE_RUNTIME_NOT_FOUND, RUNNER_TIMEOUT, SANDBOX_REQUIRED, LOCAL_TESTS_NOT_PASSED - Re-export hub now exposes runner.ts
Subprocess-based local runner with PATH probe-and-cache: - LocalRunner interface, SUPPORTED_LANGUAGES, IMPLEMENTED_LANGUAGES (Phase 4a ships python3; go and java land in 4b/4c) - SubprocessRunner: lazy probe, clean env (PATH/HOME/LANG only), temp cwd, 5s default wall-clock timeout with SIGTERM-then-SIGKILL escalation, 1 MB output ceiling per stream - Sandbox detection: bwrap > firejail > sandbox-exec > none, applied transparently. RunResult.warning surfaces when no sandbox is found. - Test helpers __resetProbeCacheForTest / __resetSandboxCacheForTest No Docker, no extra runtime deps.
- requireSession(slug) is now public so the runner-tools layer can reuse the same SESSION_NOT_FOUND envelope as the rest of the pedagogy gate. - recordLocalRun(slug, passed) bumps attempts, sets lastLocalRunPassed, and promotes status from 'started' to 'attempting' on first run.
…on gate - New tool 'run_local_tests' (titleSlug, language, code, timeoutMs?): requires an active session, delegates to LocalRunner, records attempts + lastLocalRunPassed on the session. - New tool 'runner_doctor' (no args): reports detected runtime versions and sandbox kind. Drives the README troubleshooting flow for LANGUAGE_RUNTIME_NOT_FOUND. - LEETCODE_MCP_STRICT_MODE=1 gates submit_solution: when set, refuses to submit unless the session's lastLocalRunPassed === true. Off by default; non-disruptive when no session exists for the slug. - index.ts wires SubprocessRunner into the server and threads the session service through the submission registry.
- tests/runner/subprocess-runner.test.ts (9): happy path, exit codes, stderr capture, timeout (SIGTERM+SIGKILL), language-not-implemented, clean env (no parent secrets leak) - tests/domain/session-service.test.ts (5): requireSession + recordLocalRun unit coverage - tests/integration/runner-tools-integration.test.ts (5): SESSION_NOT_FOUND pre-run, happy delegate, attempts bump on failure, error surface for RUNNER_NOT_IMPLEMENTED_FOR_LANGUAGE, doctor capability snapshot - tests/integration/submission-tools-integration.test.ts (+4): strict-mode gate (blocks/permits/no-session-passthrough/default-off) - tests/e2e/runner.test.ts (7): doctor over the wire, no-session, pass, fail, timeout, unimplemented language, end-to-end strict-mode gate; skipped automatically on hosts without python3 - tests/e2e/lifecycle.test.ts: extend tool list to include run_local_tests and runner_doctor (22 -> 24) Totals: 178 -> 201 unit/integration; 9 -> 16 e2e.
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
SPerekrestova
left a comment
There was a problem hiding this comment.
Reviewed for reliability, security, simplicity, idiomacy, maintainability, and code style. One concrete bug worth flagging:
src/runner/sandbox.ts — macOS sandbox-exec detection contradicts its own comment (likely silently broken)
// sandbox-exec is /usr/bin/sandbox-exec on every macOS we care
// about. It accepts no `--version`; probe with `-help` (any
// exit code is fine — it always prints to stderr).
try {
await execFile("/usr/bin/sandbox-exec -help", { timeout: 1500 });
cached = { kind: "sandbox-exec", path: "/usr/bin/sandbox-exec" };
return cached;
} catch {
/* fall through to "none" */
}The comment explicitly states "any exit code is fine", but execFile here is promisify(exec) — it rejects on any non-zero exit. sandbox-exec -help is an undocumented flag and on at least some macOS versions exits non-zero (usage). When that happens, the catch swallows it and we fall through to kind: "none", even though sandbox-exec is installed and usable. The RunResult.warning will then mislead users into thinking no sandbox exists on their host.
Two ways to fix:
- Probe by file existence (
fs.access("/usr/bin/sandbox-exec", fs.constants.X_OK)) — matches the comment’s intent and avoids spawning a process whose exit semantics are unspecified. - Or actually run sandbox-exec end-to-end with a no-op profile (e.g.
sandbox-exec -p '(version 1)(allow default)' /usr/bin/true) so a non-zero exit genuinely means the tool is broken.
Minor (not blockers)
src/runner/subprocess-runner.ts:42andsrc/runner/sandbox.ts:26aliaspromisify(exec)asexecFile. The name is misleading — every probe call is going through/bin/sh -c. Inputs are static today so there’s no exploitable path, but the pattern is a foot-gun if anyone ever interpolates a dynamic value into one of these probes. Prefer the actualexecFile(no shell), orspawn, for binary probes.src/runner/subprocess-runner.ts:260-271: the per-chunk guardstdoutBytes < MAX_OUTPUT_BYTESlets the buffer grow up toMAX_OUTPUT_BYTES + chunk_sizebefore stopping. Cosmetic —clampOutputenforces the final cap — but worth tightening if a tighter bound matters.
Otherwise the layering (types → runner → application → wire), error-envelope reuse, strict-mode passthrough when no session exists, and test coverage all look solid.
Generated by Claude Code
…tighten output guard Three issues raised on PR #39: 1. sandbox-exec detection on macOS was effectively broken. The probe spawned 'sandbox-exec -help' and trusted any non-zero exit to mean 'not installed', but '-help' is undocumented and exits non-zero on some macOS versions. The catch swallowed it and we fell through to kind: 'none', misleading users that no sandbox was available. Now probe by file existence + executable bit (fs.access X_OK on /usr/bin/sandbox-exec), which is what the comment always said the intent was. 2. The probes in sandbox.ts and subprocess-runner.ts aliased promisify(exec) as 'execFile'. The name was misleading: exec routes through /bin/sh -c and is a shell-expansion foot-gun if anyone ever interpolates a dynamic value. Switched to the actual node execFile (no shell) for every probe — same call sites, just the no-shell variant. Static inputs only today; hardened for future. 3. The per-chunk capture in subprocess-runner allowed the buffer to grow up to MAX_OUTPUT_BYTES + chunk_size before stopping (the guard only checked 'is bytes already at the cap?'). Now slice the overflowing chunk to the exact remaining headroom and drop the rest, so the buffered total never exceeds MAX_OUTPUT_BYTES. clampOutput stays as a final cap. No behaviour change for the happy path. 201/201 unit/integration, 16/16 e2e.
|
@SPerekrestova good catches — addressed all three in dbabfbe. Pushed to the same branch; CI re-running. 1. macOS await access("/usr/bin/sandbox-exec", fsConstants.X_OK);
cached = { kind: "sandbox-exec", path: "/usr/bin/sandbox-exec" };
2. Misleading import { execFile as execFileCb } from "node:child_process";
const execFile = promisify(execFileCb);
// ...
await execFile(spec.probe.cmd, spec.probe.args, { timeout: 2000 });
await execFile("which", [spec.probe.cmd], { timeout: 1000 });Same probe semantics, no 3. Per-chunk buffer overshoot up to const captureChunk = (buffers, bytes, chunk) => {
const remaining = MAX_OUTPUT_BYTES - bytes;
if (remaining <= 0) return bytes;
if (chunk.length <= remaining) {
buffers.push(chunk);
return bytes + chunk.length;
}
buffers.push(chunk.subarray(0, remaining));
return bytes + remaining;
};No behaviour change on the happy path. 201/201 unit+integration and 16/16 e2e still pass locally. |
4a125de
into
devin/1778098342-phase-0-hygiene
Summary
Phase 4a of the redesign plan. Adds a local subprocess runner so the inner-loop "edit → run → fix" cycle no longer burns real LeetCode submissions. Stacked on top of #38; base auto-rebases as #38 reaches main.
No Docker. No new runtime deps. Subprocess + PATH auto-discovery, with cheap safety nets (timeout, clean env, temp cwd, output ceiling) and transparent OS-sandbox detection (
bwrap→firejail→sandbox-exec→ none).Type layer (
src/types/runner.ts, one commit):RunnerLanguage(python3 | go | java),SandboxKind,RunInput,RunResult,RunnerCapabilities.RUNNER_NOT_IMPLEMENTED_FOR_LANGUAGE,LANGUAGE_RUNTIME_NOT_FOUND,RUNNER_TIMEOUT,SANDBOX_REQUIRED,LOCAL_TESTS_NOT_PASSED.Runner layer (
src/runner/, one commit):LocalRunnerinterface,SUPPORTED_LANGUAGES(all three),IMPLEMENTED_LANGUAGES(Phase 4a shipspython3only — go and java land in 4b/4c).SubprocessRunner: lazy probe-and-cache for each language's runtime;spawnwith PATH/HOME/LANG forwarded only (parent env secrets never leak in); 5 s default wall-clock timeout with SIGTERM-then-SIGKILL escalation; 1 MB output ceiling per stream; tempmkdtempcwd that's removed regardless of outcome.sandbox.ts: detects best available OS sandbox at first use, wraps the subprocess transparently.RunResult.sandboxreports which kind was applied;RunResult.warningsurfaces when no sandbox is available so the agent can mention it.Application layer (
src/domain/session-service.ts, one commit):requireSession(slug)is now public so the runner-tools layer can reuse the sameSESSION_NOT_FOUNDenvelope as the rest of the pedagogy gate.recordLocalRun(slug, passed)bumpsattempts, setslastLocalRunPassed, and promotesstatusfromstarted→attemptingon first run. The strict-mode submit gate reads from this.Wire layer (
src/mcp/tools/, one commit):run_local_tests(titleSlug,language,code,timeoutMs?): requires an active session, delegates toLocalRunner, recordsattempts+lastLocalRunPassed. Pre-run rejections (no session, language not implemented) do not bump the counters.runner_doctor(no args): reports detected runtime versions and sandbox kind. Drives the troubleshooting flow forLANGUAGE_RUNTIME_NOT_FOUND.LEETCODE_MCP_STRICT_MODE=1gatessubmit_solution: when set, refuses to submit unless the active session'slastLocalRunPassed === true. Off by default (preserves current behavior). Non-disruptive when no session exists for the slug — strict mode only fires inside the tutoring flow.index.tsconstructs a singleSubprocessRunnerper server lifetime and threads the session service through the submission registry.Tests (one commit, additive):
tests/runner/subprocess-runner.test.ts(9): happy path, non-zero exit, captured stderr, timeout (kills within budget+grace),RUNNER_NOT_IMPLEMENTED_FOR_LANGUAGEforgo, clean-env (parent secret never reaches child), capabilities snapshot.tests/domain/session-service.test.ts(5):requireSession+recordLocalRununit coverage including persistence across service instances.tests/integration/runner-tools-integration.test.ts(5):SESSION_NOT_FOUNDpre-run, happy delegate,attemptsbump on failure, error surface forRUNNER_NOT_IMPLEMENTED_FOR_LANGUAGE, doctor capability snapshot.tests/integration/submission-tools-integration.test.ts(+4): strict-mode gate (blocks/permits/no-session-passthrough/default-off).tests/e2e/runner.test.ts(7): doctor over the wire, no-session, pass, fail, timeout, unimplemented language, end-to-end strict-mode gate. Usesdescribe.skipIf(!python3)so the suite stays portable.tests/e2e/lifecycle.test.ts: extend tool list to includerun_local_testsandrunner_doctor(22 → 24).Tests: unit/integration 178 → 201 (+23); e2e 9 → 16 (+7).
npm run buildclean;npm run test:typesclean;npm run formatclean; fullnpm run test:allexits 0 locally.Why subprocess and not Docker:
python solution.pythemselves; a sandbox doesn't add safety they don't already have. Docker would impose a multi-GB install, daemon dependency, and Mac/Windows/WSL2 friction for zero meaningful gain.bwrap/firejail/sandbox-exec) are detected and applied transparently when available.RunResultsurfaces which one was used; users who refuse to run unsandboxed will getLEETCODE_MCP_REQUIRE_SANDBOX=1enforcement in Phase 4d. (The error code is reserved here but not yet enforced — keeps this PR scoped.)What didn't change:
submit_solutionstill works for all LeetCode languages (the runner is purely additive; onlyrun_local_testsis gated byIMPLEMENTED_LANGUAGES).Review & Testing Checklist for Human
src/runner/subprocess-runner.ts— the timeout + kill escalation, output ceiling, and clean-env wiring are the only places "user code does something we don't want" can hurt the host.src/runner/sandbox.tsprofiles, especially the macOSsandbox-execprofile — it's terse and deny-default, but I'd appreciate a second pair of eyes confirming we're not denying something legitimately needed forpython3.LEETCODE_MCP_STRICT_MODE=1end-to-end: openstart_problem, attemptsubmit_solution(should reject withLOCAL_TESTS_NOT_PASSED),run_local_testsuntil passed, retrysubmit_solution(should now go through).runner_doctorreports what you expect on your machine — and that an unavailable language degrades toavailable: falserather than throwing.Notes
RunnerLanguageandSUPPORTED_LANGUAGESso types stay honest, but theirbuildArgsthrowsRUNNER_NOT_IMPLEMENTED_FOR_LANGUAGE. Phase 4b will hook in Go (go run+ cache); Phase 4c will hook in Java (javac+ classfile cache).SANDBOX_REQUIREDerror code exists but isn't enforced yet — Phase 4d will gate runs onLEETCODE_MCP_REQUIRE_SANDBOX=1.run_local_testsreads the user's actual file whencodeis omitted) lands in Phase 5.Link to Devin session: https://app.devin.ai/sessions/d003a60939484686b2953ae32fe2794d
Requested by: @SPerekrestova