Skip to content

Force-terminate worker subprocesses on shutdown to prevent orphan accumulation — Closes #260#276

Merged
conradbzura merged 4 commits into
wool-labs:masterfrom
conradbzura:260-force-terminate-workers-on-shutdown
Jul 2, 2026
Merged

Force-terminate worker subprocesses on shutdown to prevent orphan accumulation — Closes #260#276
conradbzura merged 4 commits into
wool-labs:masterfrom
conradbzura:260-force-terminate-workers-on-shutdown

Conversation

@conradbzura

Copy link
Copy Markdown
Contributor

Summary

Worker subprocesses could outlive the process that spawned them: a parent that crashed, was SIGKILLed, or never completed the graceful stop RPC left orphaned gRPC workers reparented to init and accumulating across runs. Close every escape path with layered defenses on both sides of the process boundary: the parent always force-terminates workers through a bounded reap escalation, the graceful stop RPC carries a deadline so a wedged worker cannot dodge that reap, and the worker itself watches for parent death and hard-exits as a backstop when no parent-side path runs at all. Rework ResourcePool's TTL deferral from a parked sleep-task to a plain timer so the worker's own teardown no longer leaks pending cleanup tasks or emits "coroutine never awaited" warnings. Closes #260.

Proposed changes

Bounded reap escalation in WorkerProcess

Add WorkerProcess.reap(timeout=None): join with the caller's bound (defaulting to the worker's shutdown grace period), escalate to SIGTERM with a bounded follow-up join (_REAP_GRACE, 5s), then SIGKILL. LocalWorker._stop reaps unconditionally in a finally, running the blocking escalation in an executor thread so it completes even when the stopping coroutine is cancelled by a pool teardown deadline. Reuse the same escalation on start()'s failure path via reap(timeout=0), replacing a weaker inline terminate/unbounded-join.

Deadline on the graceful stop RPC

Bound stub.stop with timeout + _STOP_RPC_MARGIN (5s over the worker's drain bound, which the server may fully consume before responding); timeout=None preserves the caller's explicit unbounded-graceful contract. A wedged worker now surfaces DEADLINE_EXCEEDED instead of hanging stop() forever, and the finally-reap escalates to force-termination.

Parent-death watchdog in the worker

Install _parent_watchdog in _serve: a daemon thread blocks on multiprocessing.parent_process().join() — which fires even for SIGKILL — then schedules the same graceful stop as SIGTERM and hard-exits via os._exit(1) once the grace window elapses, surviving an event loop that closes mid-dispatch. Extract the shared _schedule_stop(loop, service, timeout) dispatcher used by the watchdog and both signal handlers, documenting the divergent semantics (SIGTERM and the watchdog cancel in-flight tasks immediately; SIGINT drains indefinitely).

ResourcePool TTL deferral via timer

Replace the parked sleep-task with a loop.call_later timer whose _expire callback spawns the cleanup task only when the TTL actually fires: an unfired TimerHandle is discarded silently at loop close, eliminating the "Task was destroyed but it is pending!" teardown noise and the "coroutine never awaited" RuntimeWarning. Extract _cancel_timer and _cancel_cleanup for the same-loop (cancel and await) and cross-loop (best-effort call_soon_threadsafe) cancellation paths, guard the expiry path against cancelling its own _finalize task, and correct the documented contracts (release is a silent no-op for missing keys and raises ValueError on over-release; clear raises KeyError).

Documentation homes

Home the reap/cancellation contract — including the executor-thread cancellation survival and its second-cancel caveat — in LocalWorker._stop's docstring, with WorkerPool.shutdown_timeout documenting only the pool-observable consequence (teardown may overrun the deadline by the reap escalation) and pointing back.

Test cases

# Test Suite Given When Then Coverage Target
1 TestWorkerOrphanPrevention A started LocalWorker backed by a real subprocess stop() is called It returns only once the worker pid no longer exists Parent-side reap
2 TestWorkerOrphanPrevention A parent process holding no teardown path The parent is killed with SIGKILL The orphaned worker detects parent death and exits within the grace window Parent-death watchdog
3 TestWorkerOrphanPrevention An ephemeral WorkerPool with two spawned workers The pool context exits normally No worker subprocess remains alive Pool-scale teardown
4 TestWorkerOrphanPrevention An entered single-worker pool whose worker was SIGKILLed mid-context The pool context exits Teardown completes without raising and the corpse is reaped Crashed-worker tolerance
5 TestWorkerOrphanPrevention A started real WorkerProcess that never receives a stop RPC reap() is called with a short timeout The worker subprocess is terminated Reap escalation against a live process
6 TestWorkerOrphanPrevention A started worker suspended with SIGSTOP stop() is awaited with a short timeout DEADLINE_EXCEEDED surfaces within a bound and the worker is killed Stop RPC deadline plus SIGKILL rung
7 test_process.py A running event loop and worker service _sigterm_handler fires and its scheduled callback executes service.stop is awaited with a zero-timeout StopRequest and no context SIGTERM semantics
8 test_process.py A running event loop and worker service _sigint_handler fires and its scheduled callback executes service.stop is awaited with a negative-timeout StopRequest — an unbounded drain SIGINT semantics
9 test_process.py A process not spawned by multiprocessing _parent_watchdog is called It returns None without starting a thread Watchdog no-op outside a child
10 test_process.py A watchdog whose parent exits immediately and a running loop _parent_watchdog is called A daemon thread schedules the service stop and hard-exits after the grace window Watchdog happy path
11 test_process.py A parent that exits and a loop no longer running The watchdog runs through the grace window It skips the dispatch and still hard-exits Watchdog exit without a loop
12 test_process.py A watchdog whose scheduled callback is captured The callback executes on a real loop service.stop is awaited with a zero-timeout StopRequest and no context Watchdog dispatch payload
13 test_process.py A loop that reports running but raises RuntimeError on scheduling The watchdog runs through the grace window It swallows the error and still hard-exits Closed-loop race guard
14 TestWorkerProcess A WorkerProcess with a custom grace period run() is called The watchdog is installed once with the running loop, service, and grace period Watchdog wiring
15 TestWorkerProcess A WorkerProcess never started reap() is called It returns without touching the process Reap no-op guard
16 TestWorkerProcess A process that exits within the join bound reap() is called It neither terminates nor kills Graceful fast path
17 TestWorkerProcess No explicit timeout reap() is called It joins with the shutdown grace period Timeout defaulting
18 TestWorkerProcess An explicit zero timeout reap(timeout=0) is called It joins with 0, not the grace default is not None pin
19 TestWorkerProcess A process that survives the graceful join reap() escalates SIGTERM is sent with a bounded follow-up join Terminate rung
20 TestWorkerProcess A process that survives SIGTERM reap() escalates further SIGKILL follows terminate, in order Kill rung ordering
21 TestWorkerProcess A pipe that never reports metadata start() times out RuntimeError is raised and the process is reaped with timeout=0 Start-failure escalation
22 TestLocalWorker A started worker whose stop RPC succeeds stop(timeout=12.5) is called The request carries the timeout and reap follows the RPC Stop ordering contract
23 TestLocalWorker Any finite positive stop timeout (Hypothesis) stop() is called with it deadline == timeout + _STOP_RPC_MARGIN, strictly above the timeout, and reap gets the same bound Deadline-margin invariant
24 TestLocalWorker No stop timeout stop() is called The RPC carries no gRPC deadline Unbounded contract arm
25 TestLocalWorker A stop RPC that raises stop() is called The error propagates, the process is still reaped, and the channel is closed Failure-path reap
26 TestLocalWorker A worker process no longer alive stop() is called The RPC is skipped but the process is still reaped Dead-process arm
27 TestLocalWorker A stop RPC that never completes The stop() task is cancelled mid-RPC CancelledError propagates and the process is still reaped Cancellation-surviving reap
28 TestResourcePool An unfired TTL timer bound to a since-closed loop The key is re-acquired from another loop The timer is cancelled threadsafe and the cached object returned Cross-loop timer cancel
29 TestResourcePool An unfired TTL timer bound to a since-closed loop The key is cleared from another loop The finalizer still runs and the entry is evicted Cross-loop clear
30 TestResourcePool A released entry on a dedicated loop The loop closes and is collected before the TTL elapses No pending task remains and no RuntimeWarning is emitted Timer-deferral regression
31 TestResourcePool A fired cleanup queued behind a re-acquire on the pool lock The re-acquire runs The in-flight cleanup is cancelled and the cached object survives Fired-timer race, acquire
32 TestResourcePool A fired cleanup queued behind a clear on the pool lock The clear runs The finalizer runs exactly once and the entry is evicted Fired-timer race, clear
33 TestResourcePool Any interleaved acquire/release sequence (Hypothesis) Applied step by step to a long-TTL pool Entry, reference, and pending-cleanup counts match the model at every step TTL bookkeeping invariants

https://claude.ai/code/session_011Xw7kU5GN556rbn6sZBdzg

Worker subprocesses are non-daemon spawn processes whose termination
depended entirely on the graceful stop RPC, so any teardown that never
completed it left an orphan that survived the parent and accumulated
across runs, eventually exhausting process-table and port resources.

Reap workers on every stop: LocalWorker stop now always joins the
subprocess after the graceful attempt, escalating to SIGTERM and then
SIGKILL when it lingers. The stop RPC also carries a deadline so an
unresponsive worker can no longer hang stop() and dodge the fallback.
Inside the worker, a parent-death watchdog thread ties the process to
its parent: when the parent dies, including by SIGKILL, the worker
initiates the same graceful shutdown as SIGTERM and hard-exits if the
grace window elapses.

Claude-Session: https://claude.ai/code/session_011Xw7kU5GN556rbn6sZBdzg
Releasing a resource with a positive TTL parked a task on a TTL sleep;
loops that closed before the TTL elapsed destroyed it pending, and a
task that never started emitted a coroutine-never-awaited
RuntimeWarning in the warnings summary of sub-second runs.

Arm a plain call_later timer instead and only spawn the cleanup task
once the TTL actually fires: an unfired TimerHandle is discarded
silently at loop close. Cleanup cancellation on re-acquire and clear
now cancels the timer or the in-flight task, and cleanup no longer
relies on Task internals when it runs inside its own finalize task.

Claude-Session: https://claude.ai/code/session_011Xw7kU5GN556rbn6sZBdzg
Unit tests pin the reap escalation ladder and its timeout defaulting,
the watchdog's daemon flag, stop dispatch, and hard-exit guarantees,
and the stop paths that must always reap: success, RPC failure, dead
process, and cancellation mid RPC. Two stale stop tests subsumed by
the new reap assertions are removed. Integration tests prove the
contracts on real subprocesses: a stopped worker is fully reaped
before stop returns, pool exit leaves no live workers and tolerates a
crashed one, an unresponsive worker is killed within the RPC deadline,
and a worker whose parent is SIGKILLed exits on its own.

Claude-Session: https://claude.ai/code/session_011Xw7kU5GN556rbn6sZBdzg
Pin the no-pending-work-at-loop-close regression, the in-flight
cleanup cancellation races on the pool lock for both re-acquire and
clear, and the bookkeeping invariants across arbitrary acquire and
release sequences via Hypothesis. Rewrite the TTL and cross-loop tests
for the timer design, dropping the stale asyncio.sleep scaffolding.

Claude-Session: https://claude.ai/code/session_011Xw7kU5GN556rbn6sZBdzg
@conradbzura conradbzura self-assigned this Jul 2, 2026
@conradbzura conradbzura marked this pull request as ready for review July 2, 2026 22:34
@conradbzura conradbzura merged commit a860300 into wool-labs:master Jul 2, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant