Force-terminate worker subprocesses on shutdown to prevent orphan accumulation — Closes #260 by conradbzura · Pull Request #276 · wool-labs/wool

conradbzura · 2026-07-02T19:49:06Z

Summary

Worker subprocesses could outlive the process that spawned them: a parent that crashed, was SIGKILLed, or never completed the graceful stop RPC left orphaned gRPC workers reparented to init and accumulating across runs. Close every escape path with layered defenses on both sides of the process boundary: the parent always force-terminates workers through a bounded reap escalation, the graceful stop RPC carries a deadline so a wedged worker cannot dodge that reap, and the worker itself watches for parent death and hard-exits as a backstop when no parent-side path runs at all. Rework ResourcePool's TTL deferral from a parked sleep-task to a plain timer so the worker's own teardown no longer leaks pending cleanup tasks or emits "coroutine never awaited" warnings. Closes #260.

Proposed changes

Bounded reap escalation in `WorkerProcess`

Add WorkerProcess.reap(timeout=None): join with the caller's bound (defaulting to the worker's shutdown grace period), escalate to SIGTERM with a bounded follow-up join (_REAP_GRACE, 5s), then SIGKILL. LocalWorker._stop reaps unconditionally in a finally, running the blocking escalation in an executor thread so it completes even when the stopping coroutine is cancelled by a pool teardown deadline. Reuse the same escalation on start()'s failure path via reap(timeout=0), replacing a weaker inline terminate/unbounded-join.

Deadline on the graceful stop RPC

Bound stub.stop with timeout + _STOP_RPC_MARGIN (5s over the worker's drain bound, which the server may fully consume before responding); timeout=None preserves the caller's explicit unbounded-graceful contract. A wedged worker now surfaces DEADLINE_EXCEEDED instead of hanging stop() forever, and the finally-reap escalates to force-termination.

Parent-death watchdog in the worker

Install _parent_watchdog in _serve: a daemon thread blocks on multiprocessing.parent_process().join() — which fires even for SIGKILL — then schedules the same graceful stop as SIGTERM and hard-exits via os._exit(1) once the grace window elapses, surviving an event loop that closes mid-dispatch. Extract the shared _schedule_stop(loop, service, timeout) dispatcher used by the watchdog and both signal handlers, documenting the divergent semantics (SIGTERM and the watchdog cancel in-flight tasks immediately; SIGINT drains indefinitely).

ResourcePool TTL deferral via timer

Replace the parked sleep-task with a loop.call_later timer whose _expire callback spawns the cleanup task only when the TTL actually fires: an unfired TimerHandle is discarded silently at loop close, eliminating the "Task was destroyed but it is pending!" teardown noise and the "coroutine never awaited" RuntimeWarning. Extract _cancel_timer and _cancel_cleanup for the same-loop (cancel and await) and cross-loop (best-effort call_soon_threadsafe) cancellation paths, guard the expiry path against cancelling its own _finalize task, and correct the documented contracts (release is a silent no-op for missing keys and raises ValueError on over-release; clear raises KeyError).

Documentation homes

Home the reap/cancellation contract — including the executor-thread cancellation survival and its second-cancel caveat — in LocalWorker._stop's docstring, with WorkerPool.shutdown_timeout documenting only the pool-observable consequence (teardown may overrun the deadline by the reap escalation) and pointing back.

Test cases

#	Test Suite	Given	When	Then	Coverage Target
1	`TestWorkerOrphanPrevention`	A started LocalWorker backed by a real subprocess	`stop()` is called	It returns only once the worker pid no longer exists	Parent-side reap
2	`TestWorkerOrphanPrevention`	A parent process holding no teardown path	The parent is killed with SIGKILL	The orphaned worker detects parent death and exits within the grace window	Parent-death watchdog
3	`TestWorkerOrphanPrevention`	An ephemeral WorkerPool with two spawned workers	The pool context exits normally	No worker subprocess remains alive	Pool-scale teardown
4	`TestWorkerOrphanPrevention`	An entered single-worker pool whose worker was SIGKILLed mid-context	The pool context exits	Teardown completes without raising and the corpse is reaped	Crashed-worker tolerance
5	`TestWorkerOrphanPrevention`	A started real WorkerProcess that never receives a stop RPC	`reap()` is called with a short timeout	The worker subprocess is terminated	Reap escalation against a live process
6	`TestWorkerOrphanPrevention`	A started worker suspended with SIGSTOP	`stop()` is awaited with a short timeout	DEADLINE_EXCEEDED surfaces within a bound and the worker is killed	Stop RPC deadline plus SIGKILL rung
7	`test_process.py`	A running event loop and worker service	`_sigterm_handler` fires and its scheduled callback executes	`service.stop` is awaited with a zero-timeout StopRequest and no context	SIGTERM semantics
8	`test_process.py`	A running event loop and worker service	`_sigint_handler` fires and its scheduled callback executes	`service.stop` is awaited with a negative-timeout StopRequest — an unbounded drain	SIGINT semantics
9	`test_process.py`	A process not spawned by multiprocessing	`_parent_watchdog` is called	It returns None without starting a thread	Watchdog no-op outside a child
10	`test_process.py`	A watchdog whose parent exits immediately and a running loop	`_parent_watchdog` is called	A daemon thread schedules the service stop and hard-exits after the grace window	Watchdog happy path
11	`test_process.py`	A parent that exits and a loop no longer running	The watchdog runs through the grace window	It skips the dispatch and still hard-exits	Watchdog exit without a loop
12	`test_process.py`	A watchdog whose scheduled callback is captured	The callback executes on a real loop	`service.stop` is awaited with a zero-timeout StopRequest and no context	Watchdog dispatch payload
13	`test_process.py`	A loop that reports running but raises RuntimeError on scheduling	The watchdog runs through the grace window	It swallows the error and still hard-exits	Closed-loop race guard
14	`TestWorkerProcess`	A WorkerProcess with a custom grace period	`run()` is called	The watchdog is installed once with the running loop, service, and grace period	Watchdog wiring
15	`TestWorkerProcess`	A WorkerProcess never started	`reap()` is called	It returns without touching the process	Reap no-op guard
16	`TestWorkerProcess`	A process that exits within the join bound	`reap()` is called	It neither terminates nor kills	Graceful fast path
17	`TestWorkerProcess`	No explicit timeout	`reap()` is called	It joins with the shutdown grace period	Timeout defaulting
18	`TestWorkerProcess`	An explicit zero timeout	`reap(timeout=0)` is called	It joins with 0, not the grace default	`is not None` pin
19	`TestWorkerProcess`	A process that survives the graceful join	`reap()` escalates	SIGTERM is sent with a bounded follow-up join	Terminate rung
20	`TestWorkerProcess`	A process that survives SIGTERM	`reap()` escalates further	SIGKILL follows terminate, in order	Kill rung ordering
21	`TestWorkerProcess`	A pipe that never reports metadata	`start()` times out	RuntimeError is raised and the process is reaped with `timeout=0`	Start-failure escalation
22	`TestLocalWorker`	A started worker whose stop RPC succeeds	`stop(timeout=12.5)` is called	The request carries the timeout and reap follows the RPC	Stop ordering contract
23	`TestLocalWorker`	Any finite positive stop timeout (Hypothesis)	`stop()` is called with it	`deadline == timeout + _STOP_RPC_MARGIN`, strictly above the timeout, and reap gets the same bound	Deadline-margin invariant
24	`TestLocalWorker`	No stop timeout	`stop()` is called	The RPC carries no gRPC deadline	Unbounded contract arm
25	`TestLocalWorker`	A stop RPC that raises	`stop()` is called	The error propagates, the process is still reaped, and the channel is closed	Failure-path reap
26	`TestLocalWorker`	A worker process no longer alive	`stop()` is called	The RPC is skipped but the process is still reaped	Dead-process arm
27	`TestLocalWorker`	A stop RPC that never completes	The `stop()` task is cancelled mid-RPC	CancelledError propagates and the process is still reaped	Cancellation-surviving reap
28	`TestResourcePool`	An unfired TTL timer bound to a since-closed loop	The key is re-acquired from another loop	The timer is cancelled threadsafe and the cached object returned	Cross-loop timer cancel
29	`TestResourcePool`	An unfired TTL timer bound to a since-closed loop	The key is cleared from another loop	The finalizer still runs and the entry is evicted	Cross-loop clear
30	`TestResourcePool`	A released entry on a dedicated loop	The loop closes and is collected before the TTL elapses	No pending task remains and no RuntimeWarning is emitted	Timer-deferral regression
31	`TestResourcePool`	A fired cleanup queued behind a re-acquire on the pool lock	The re-acquire runs	The in-flight cleanup is cancelled and the cached object survives	Fired-timer race, acquire
32	`TestResourcePool`	A fired cleanup queued behind a clear on the pool lock	The clear runs	The finalizer runs exactly once and the entry is evicted	Fired-timer race, clear
33	`TestResourcePool`	Any interleaved acquire/release sequence (Hypothesis)	Applied step by step to a long-TTL pool	Entry, reference, and pending-cleanup counts match the model at every step	TTL bookkeeping invariants

https://claude.ai/code/session_011Xw7kU5GN556rbn6sZBdzg

Worker subprocesses are non-daemon spawn processes whose termination depended entirely on the graceful stop RPC, so any teardown that never completed it left an orphan that survived the parent and accumulated across runs, eventually exhausting process-table and port resources. Reap workers on every stop: LocalWorker stop now always joins the subprocess after the graceful attempt, escalating to SIGTERM and then SIGKILL when it lingers. The stop RPC also carries a deadline so an unresponsive worker can no longer hang stop() and dodge the fallback. Inside the worker, a parent-death watchdog thread ties the process to its parent: when the parent dies, including by SIGKILL, the worker initiates the same graceful shutdown as SIGTERM and hard-exits if the grace window elapses. Claude-Session: https://claude.ai/code/session_011Xw7kU5GN556rbn6sZBdzg

Releasing a resource with a positive TTL parked a task on a TTL sleep; loops that closed before the TTL elapsed destroyed it pending, and a task that never started emitted a coroutine-never-awaited RuntimeWarning in the warnings summary of sub-second runs. Arm a plain call_later timer instead and only spawn the cleanup task once the TTL actually fires: an unfired TimerHandle is discarded silently at loop close. Cleanup cancellation on re-acquire and clear now cancels the timer or the in-flight task, and cleanup no longer relies on Task internals when it runs inside its own finalize task. Claude-Session: https://claude.ai/code/session_011Xw7kU5GN556rbn6sZBdzg

Unit tests pin the reap escalation ladder and its timeout defaulting, the watchdog's daemon flag, stop dispatch, and hard-exit guarantees, and the stop paths that must always reap: success, RPC failure, dead process, and cancellation mid RPC. Two stale stop tests subsumed by the new reap assertions are removed. Integration tests prove the contracts on real subprocesses: a stopped worker is fully reaped before stop returns, pool exit leaves no live workers and tolerates a crashed one, an unresponsive worker is killed within the RPC deadline, and a worker whose parent is SIGKILLed exits on its own. Claude-Session: https://claude.ai/code/session_011Xw7kU5GN556rbn6sZBdzg

Pin the no-pending-work-at-loop-close regression, the in-flight cleanup cancellation races on the pool lock for both re-acquire and clear, and the bookkeeping invariants across arbitrary acquire and release sequences via Hypothesis. Rewrite the TTL and cross-loop tests for the timer design, dropping the stale asyncio.sleep scaffolding. Claude-Session: https://claude.ai/code/session_011Xw7kU5GN556rbn6sZBdzg

conradbzura added 4 commits July 2, 2026 15:41

conradbzura self-assigned this Jul 2, 2026

conradbzura marked this pull request as ready for review July 2, 2026 22:34

conradbzura merged commit a860300 into wool-labs:master Jul 2, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Force-terminate worker subprocesses on shutdown to prevent orphan accumulation — Closes #260#276

Force-terminate worker subprocesses on shutdown to prevent orphan accumulation — Closes #260#276
conradbzura merged 4 commits into
wool-labs:masterfrom
conradbzura:260-force-terminate-workers-on-shutdown

conradbzura commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

conradbzura commented Jul 2, 2026

Summary

Proposed changes

Bounded reap escalation in WorkerProcess

Deadline on the graceful stop RPC

Parent-death watchdog in the worker

ResourcePool TTL deferral via timer

Documentation homes

Test cases

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bounded reap escalation in `WorkerProcess`