Force-terminate worker subprocesses on shutdown to prevent orphan accumulation — Closes #260#276
Merged
conradbzura merged 4 commits intoJul 2, 2026
Conversation
Worker subprocesses are non-daemon spawn processes whose termination depended entirely on the graceful stop RPC, so any teardown that never completed it left an orphan that survived the parent and accumulated across runs, eventually exhausting process-table and port resources. Reap workers on every stop: LocalWorker stop now always joins the subprocess after the graceful attempt, escalating to SIGTERM and then SIGKILL when it lingers. The stop RPC also carries a deadline so an unresponsive worker can no longer hang stop() and dodge the fallback. Inside the worker, a parent-death watchdog thread ties the process to its parent: when the parent dies, including by SIGKILL, the worker initiates the same graceful shutdown as SIGTERM and hard-exits if the grace window elapses. Claude-Session: https://claude.ai/code/session_011Xw7kU5GN556rbn6sZBdzg
Releasing a resource with a positive TTL parked a task on a TTL sleep; loops that closed before the TTL elapsed destroyed it pending, and a task that never started emitted a coroutine-never-awaited RuntimeWarning in the warnings summary of sub-second runs. Arm a plain call_later timer instead and only spawn the cleanup task once the TTL actually fires: an unfired TimerHandle is discarded silently at loop close. Cleanup cancellation on re-acquire and clear now cancels the timer or the in-flight task, and cleanup no longer relies on Task internals when it runs inside its own finalize task. Claude-Session: https://claude.ai/code/session_011Xw7kU5GN556rbn6sZBdzg
Unit tests pin the reap escalation ladder and its timeout defaulting, the watchdog's daemon flag, stop dispatch, and hard-exit guarantees, and the stop paths that must always reap: success, RPC failure, dead process, and cancellation mid RPC. Two stale stop tests subsumed by the new reap assertions are removed. Integration tests prove the contracts on real subprocesses: a stopped worker is fully reaped before stop returns, pool exit leaves no live workers and tolerates a crashed one, an unresponsive worker is killed within the RPC deadline, and a worker whose parent is SIGKILLed exits on its own. Claude-Session: https://claude.ai/code/session_011Xw7kU5GN556rbn6sZBdzg
Pin the no-pending-work-at-loop-close regression, the in-flight cleanup cancellation races on the pool lock for both re-acquire and clear, and the bookkeeping invariants across arbitrary acquire and release sequences via Hypothesis. Rewrite the TTL and cross-loop tests for the timer design, dropping the stale asyncio.sleep scaffolding. Claude-Session: https://claude.ai/code/session_011Xw7kU5GN556rbn6sZBdzg
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Worker subprocesses could outlive the process that spawned them: a parent that crashed, was SIGKILLed, or never completed the graceful stop RPC left orphaned gRPC workers reparented to init and accumulating across runs. Close every escape path with layered defenses on both sides of the process boundary: the parent always force-terminates workers through a bounded reap escalation, the graceful stop RPC carries a deadline so a wedged worker cannot dodge that reap, and the worker itself watches for parent death and hard-exits as a backstop when no parent-side path runs at all. Rework ResourcePool's TTL deferral from a parked sleep-task to a plain timer so the worker's own teardown no longer leaks pending cleanup tasks or emits "coroutine never awaited" warnings. Closes #260.
Proposed changes
Bounded reap escalation in
WorkerProcessAdd
WorkerProcess.reap(timeout=None): join with the caller's bound (defaulting to the worker's shutdown grace period), escalate to SIGTERM with a bounded follow-up join (_REAP_GRACE, 5s), then SIGKILL.LocalWorker._stopreaps unconditionally in afinally, running the blocking escalation in an executor thread so it completes even when the stopping coroutine is cancelled by a pool teardown deadline. Reuse the same escalation onstart()'s failure path viareap(timeout=0), replacing a weaker inline terminate/unbounded-join.Deadline on the graceful stop RPC
Bound
stub.stopwithtimeout + _STOP_RPC_MARGIN(5s over the worker's drain bound, which the server may fully consume before responding);timeout=Nonepreserves the caller's explicit unbounded-graceful contract. A wedged worker now surfacesDEADLINE_EXCEEDEDinstead of hangingstop()forever, and the finally-reap escalates to force-termination.Parent-death watchdog in the worker
Install
_parent_watchdogin_serve: a daemon thread blocks onmultiprocessing.parent_process().join()— which fires even for SIGKILL — then schedules the same graceful stop as SIGTERM and hard-exits viaos._exit(1)once the grace window elapses, surviving an event loop that closes mid-dispatch. Extract the shared_schedule_stop(loop, service, timeout)dispatcher used by the watchdog and both signal handlers, documenting the divergent semantics (SIGTERM and the watchdog cancel in-flight tasks immediately; SIGINT drains indefinitely).ResourcePool TTL deferral via timer
Replace the parked sleep-task with a
loop.call_latertimer whose_expirecallback spawns the cleanup task only when the TTL actually fires: an unfiredTimerHandleis discarded silently at loop close, eliminating the "Task was destroyed but it is pending!" teardown noise and the "coroutine never awaited" RuntimeWarning. Extract_cancel_timerand_cancel_cleanupfor the same-loop (cancel and await) and cross-loop (best-effortcall_soon_threadsafe) cancellation paths, guard the expiry path against cancelling its own_finalizetask, and correct the documented contracts (releaseis a silent no-op for missing keys and raisesValueErroron over-release;clearraisesKeyError).Documentation homes
Home the reap/cancellation contract — including the executor-thread cancellation survival and its second-cancel caveat — in
LocalWorker._stop's docstring, withWorkerPool.shutdown_timeoutdocumenting only the pool-observable consequence (teardown may overrun the deadline by the reap escalation) and pointing back.Test cases
TestWorkerOrphanPreventionstop()is calledTestWorkerOrphanPreventionTestWorkerOrphanPreventionTestWorkerOrphanPreventionTestWorkerOrphanPreventionreap()is called with a short timeoutTestWorkerOrphanPreventionstop()is awaited with a short timeouttest_process.py_sigterm_handlerfires and its scheduled callback executesservice.stopis awaited with a zero-timeout StopRequest and no contexttest_process.py_sigint_handlerfires and its scheduled callback executesservice.stopis awaited with a negative-timeout StopRequest — an unbounded draintest_process.py_parent_watchdogis calledtest_process.py_parent_watchdogis calledtest_process.pytest_process.pyservice.stopis awaited with a zero-timeout StopRequest and no contexttest_process.pyTestWorkerProcessrun()is calledTestWorkerProcessreap()is calledTestWorkerProcessreap()is calledTestWorkerProcessreap()is calledTestWorkerProcessreap(timeout=0)is calledis not NonepinTestWorkerProcessreap()escalatesTestWorkerProcessreap()escalates furtherTestWorkerProcessstart()times outtimeout=0TestLocalWorkerstop(timeout=12.5)is calledTestLocalWorkerstop()is called with itdeadline == timeout + _STOP_RPC_MARGIN, strictly above the timeout, and reap gets the same boundTestLocalWorkerstop()is calledTestLocalWorkerstop()is calledTestLocalWorkerstop()is calledTestLocalWorkerstop()task is cancelled mid-RPCTestResourcePoolTestResourcePoolTestResourcePoolTestResourcePoolTestResourcePoolTestResourcePoolhttps://claude.ai/code/session_011Xw7kU5GN556rbn6sZBdzg