Fix Linux runner-container pileup when a runner never reaches GitHub (0.1.3)#43
Conversation
…(0.1.3) An ephemeral Linux container that registered but never connected (its egress failed) never exited, so the orchestrator dropped it from tracking and launched a replacement every reap cycle — piling up zombie containers (15 in one incident) that exhausted Apple `container`'s shared vmnet NAT bridge and wedged the network for the rest. Root cause: LinuxContainerProvider.stop() only SIGTERM'd the foreground `container run` client (an attachment to the daemon-managed container) and relied on its termination handler to force-delete it — which didn't fire reliably, so the container kept running. stop() now force-deletes the container by name directly (`container delete --force`, verified to reap a running one), so every reap actually removes its container. Hardening so a future blip can't snowball: - Liveness reaper: a runner alive past neverConfirmedReapInterval (2 min) that GitHub never confirmed online is force-deleted, deregistered, and counted toward the launch-failure backoff — not held for the 5-min sustained grace. - Bridge-safe concurrency ceiling (maxConcurrentContainersCeiling = 4): the effective Linux cap is the lower of the RAM/CPU divide and this ceiling, since the shared vmnet bridge — not RAM/CPU — binds first. - Sustained listRunners-outage backstop: hold the fleet instead of scaling up against unknown busy-ness. 4 new tests (budget ceiling x2, never-online reaper, outage backstop); full suite 220 passing; debug + release builds clean. Bumps MARKETING_VERSION 0.1.3.
There was a problem hiding this comment.
Pull request overview
Fixes a failure mode in the Linux container runner provider where runners that never successfully come online can accumulate indefinitely, eventually exhausting Apple container’s shared vmnet NAT bridge and preventing new runners from networking.
Changes:
- Force-delete Linux runner containers by name during
LinuxContainerProvider.stop()to ensure reaped runners actually remove their underlying containers. - Add orchestrator safeguards for “registered but never confirmed online” runners (fast reap + launch-failure backoff) and for sustained
listRunnersoutages (hold fleet instead of scaling up). - Clamp Linux container concurrency with a bridge-safe ceiling and update app budgeting logic; bump version/changelog entries for 0.1.3.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| Tests/MactionsCoreTests/OrchestratorTests.swift | Adds tests covering never-confirmed-online zombie reaping and sustained listRunners outage hold behavior. |
| Tests/MactionsCoreTests/LinuxContainerBudgetTests.swift | Adds tests for the new effective cap that clamps to the bridge-safe ceiling. |
| Sources/MactionsCore/Orchestrator.swift | Implements never-confirmed-online reaper and sustained runner-API outage backstop; extends init with neverConfirmedReapInterval. |
| Sources/MactionsCore/LinuxContainerProvider.swift | Makes stop() unconditionally force-delete the container by name (no longer relying on container run termination). |
| Sources/MactionsCore/LinuxContainerBudget.swift | Introduces a hard concurrency ceiling and an effectiveMaxConcurrentContainers helper. |
| Sources/Mactions/AppState.swift | Switches Linux budgeting to use the effective capped value. |
| project.yml | Bumps MARKETING_VERSION to 0.1.3. |
| CHANGELOG.md | Adds a 0.1.3 changelog entry documenting the fix and new safeguards. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| guard let health else { | ||
| // SUSTAINED runner-API outage → HOLD the fleet (don't spawn more). With | ||
| // busy-ness unknown, the scale-up below assumes queued jobs need NEW | ||
| // runners — but when the outage is an egress failure (often the same one | ||
| // stopping runners from connecting), those replacements can never come | ||
| // online, which is exactly the snowball that piled up containers. Mirrors | ||
| // the "failed queue poll HOLDS" rule above. A brief blip still scales up. | ||
| if let since = listRunnersFailingSince, | ||
| Date().timeIntervalSince(since) >= neverConfirmedReapInterval | ||
| { |
There was a problem hiding this comment.
Valid — fixed in #44 (folded into the re-cut v0.1.3). The outage backstop now uses its own listRunnersOutageHoldInterval (default 120s), decoupled from neverConfirmedReapInterval, so tuning or disabling the zombie reaper can no longer change (or disable) the backstop window.
| /// A slot ALIVE this long without GitHub ever confirming it `online` | ||
| /// (`confirmedOnline` never set) is a never-connected zombie: it minted a | ||
| /// registration but its egress failed, so it never reached GitHub and never | ||
| /// claimed a job — and, unlike a launch-FAILURE *exit*, its container keeps | ||
| /// running, so it never trips `launchFailureGraceInterval`. Reap it well before | ||
| /// the 5-min sustained-offline grace (there's no job to orphan: an unconfirmed | ||
| /// runner is never assigned work), and count it toward the launch-failure | ||
| /// backoff so a repeated failure surfaces instead of spawning endlessly. | ||
| private let neverConfirmedReapInterval: TimeInterval |
There was a problem hiding this comment.
Fixed in #44: documented neverConfirmedReapInterval's value semantics — a negative value disables the reaper, and a very small value (e.g. 0) reaps a runner before it can come online (test-only); the production default (120s) leaves a real connect window.
…rval PR #43 review (Copilot): the runner-API outage backstop reused neverConfirmedReapInterval as its "sustained" threshold, coupling two independent behaviors — tuning or disabling the zombie reaper would silently widen or disable the outage backstop and could re-enable the runaway scale-up during a prolonged outage. Give the backstop its own listRunnersOutageHoldInterval (default 120s). Also document neverConfirmedReapInterval's value semantics: a negative value disables the reaper; a very small value (e.g. 0) reaps a runner before it can come online (test-only) — the production default leaves a real connect window. OrchestratorTests 37 passing; build clean.
What happened
A repo's
test (linux)job sat queued for hours while 15 Linux runner containers piled up on the host (one spawned every ~5 min, none exiting). Each ephemeral container registered with GitHub but never came online because its network egress failed (SocketException 11 / TryAgainreaching*.actions.githubusercontent.com), and the accumulation exhausted Applecontainer's shared vmnet NAT bridge — which kept new containers from connecting too. A self-sustaining snowball.Root cause
LinuxContainerProvider.stop()onlyprocess.terminate()'d the foregroundcontainer runclient and relied on its termination handler to callcleanup()(container delete --force). But that client is just an attachment to a daemon-managed container, and SIGTERM doesn't reliably makecontainer runexit — so the handler never fired and the container kept running. The orchestrator's reaper did drop the slot from tracking and deregister it from GitHub every cycle, but the container itself survived → pileup.Verified empirically:
container delete --forcedoes reap a running container, so the fix is simply to call it directly.Changes
stop()force-deletes the container by name directly (keystone fix), so every reap actually removes its container.neverConfirmedReapInterval, default 2 min): a runner GitHub never confirmedonlineis reaped + deregistered + counted toward the launch-failure backoff, instead of waiting out the 5-min sustained-offline grace. SurfaceslaunchFailing("substrate broken") rather than respawning forever.maxConcurrentContainersCeiling = 4): the effective Linux cap is the lower of the RAM/CPU divide and this ceiling, since the shared vmnet bridge — not RAM/CPU — binds first.listRunners-outage backstop: holds the fleet instead of scaling up against unknown busy-ness.Tests
4 new (
testEffectiveCapClampsLargeHostToBridgeCeiling,testEffectiveCapNeverRaisesASmallHost,testReapsRegisteredButNeverConfirmedOnlineZombieFast,testSustainedListRunnersOutageHoldsTheFleet). Full suite 220 passing; debug + release builds clean.Bumps
MARKETING_VERSIONto 0.1.3; taggingv0.1.3cuts the signed/notarized release.