Fix Linux runner-container pileup when a runner never reaches GitHub (0.1.3) by NickReisenauer · Pull Request #43 · Kyter-com/Mactions

NickReisenauer · 2026-06-17T02:22:49Z

What happened

A repo's test (linux) job sat queued for hours while 15 Linux runner containers piled up on the host (one spawned every ~5 min, none exiting). Each ephemeral container registered with GitHub but never came online because its network egress failed (SocketException 11 / TryAgain reaching *.actions.githubusercontent.com), and the accumulation exhausted Apple container's shared vmnet NAT bridge — which kept new containers from connecting too. A self-sustaining snowball.

Root cause

LinuxContainerProvider.stop() only process.terminate()'d the foreground container run client and relied on its termination handler to call cleanup() (container delete --force). But that client is just an attachment to a daemon-managed container, and SIGTERM doesn't reliably make container run exit — so the handler never fired and the container kept running. The orchestrator's reaper did drop the slot from tracking and deregister it from GitHub every cycle, but the container itself survived → pileup.

Verified empirically: container delete --force does reap a running container, so the fix is simply to call it directly.

Changes

stop() force-deletes the container by name directly (keystone fix), so every reap actually removes its container.
Liveness reaper (neverConfirmedReapInterval, default 2 min): a runner GitHub never confirmed online is reaped + deregistered + counted toward the launch-failure backoff, instead of waiting out the 5-min sustained-offline grace. Surfaces launchFailing ("substrate broken") rather than respawning forever.
Bridge-safe concurrency ceiling (maxConcurrentContainersCeiling = 4): the effective Linux cap is the lower of the RAM/CPU divide and this ceiling, since the shared vmnet bridge — not RAM/CPU — binds first.
Sustained listRunners-outage backstop: holds the fleet instead of scaling up against unknown busy-ness.

Tests

4 new (testEffectiveCapClampsLargeHostToBridgeCeiling, testEffectiveCapNeverRaisesASmallHost, testReapsRegisteredButNeverConfirmedOnlineZombieFast, testSustainedListRunnersOutageHoldsTheFleet). Full suite 220 passing; debug + release builds clean.

Bumps MARKETING_VERSION to 0.1.3; tagging v0.1.3 cuts the signed/notarized release.

…(0.1.3) An ephemeral Linux container that registered but never connected (its egress failed) never exited, so the orchestrator dropped it from tracking and launched a replacement every reap cycle — piling up zombie containers (15 in one incident) that exhausted Apple `container`'s shared vmnet NAT bridge and wedged the network for the rest. Root cause: LinuxContainerProvider.stop() only SIGTERM'd the foreground `container run` client (an attachment to the daemon-managed container) and relied on its termination handler to force-delete it — which didn't fire reliably, so the container kept running. stop() now force-deletes the container by name directly (`container delete --force`, verified to reap a running one), so every reap actually removes its container. Hardening so a future blip can't snowball: - Liveness reaper: a runner alive past neverConfirmedReapInterval (2 min) that GitHub never confirmed online is force-deleted, deregistered, and counted toward the launch-failure backoff — not held for the 5-min sustained grace. - Bridge-safe concurrency ceiling (maxConcurrentContainersCeiling = 4): the effective Linux cap is the lower of the RAM/CPU divide and this ceiling, since the shared vmnet bridge — not RAM/CPU — binds first. - Sustained listRunners-outage backstop: hold the fleet instead of scaling up against unknown busy-ness. 4 new tests (budget ceiling x2, never-online reaper, outage backstop); full suite 220 passing; debug + release builds clean. Bumps MARKETING_VERSION 0.1.3.

Copilot

Pull request overview

Fixes a failure mode in the Linux container runner provider where runners that never successfully come online can accumulate indefinitely, eventually exhausting Apple container’s shared vmnet NAT bridge and preventing new runners from networking.

Changes:

Force-delete Linux runner containers by name during LinuxContainerProvider.stop() to ensure reaped runners actually remove their underlying containers.
Add orchestrator safeguards for “registered but never confirmed online” runners (fast reap + launch-failure backoff) and for sustained listRunners outages (hold fleet instead of scaling up).
Clamp Linux container concurrency with a bridge-safe ceiling and update app budgeting logic; bump version/changelog entries for 0.1.3.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
Tests/MactionsCoreTests/OrchestratorTests.swift	Adds tests covering never-confirmed-online zombie reaping and sustained `listRunners` outage hold behavior.
Tests/MactionsCoreTests/LinuxContainerBudgetTests.swift	Adds tests for the new effective cap that clamps to the bridge-safe ceiling.
Sources/MactionsCore/Orchestrator.swift	Implements never-confirmed-online reaper and sustained runner-API outage backstop; extends init with `neverConfirmedReapInterval`.
Sources/MactionsCore/LinuxContainerProvider.swift	Makes `stop()` unconditionally force-delete the container by name (no longer relying on `container run` termination).
Sources/MactionsCore/LinuxContainerBudget.swift	Introduces a hard concurrency ceiling and an `effectiveMaxConcurrentContainers` helper.
Sources/Mactions/AppState.swift	Switches Linux budgeting to use the effective capped value.
project.yml	Bumps `MARKETING_VERSION` to 0.1.3.
CHANGELOG.md	Adds a 0.1.3 changelog entry documenting the fix and new safeguards.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

NickReisenauer · 2026-06-17T02:31:59Z

    guard let health else {
+      // SUSTAINED runner-API outage → HOLD the fleet (don't spawn more). With
+      // busy-ness unknown, the scale-up below assumes queued jobs need NEW
+      // runners — but when the outage is an egress failure (often the same one
+      // stopping runners from connecting), those replacements can never come
+      // online, which is exactly the snowball that piled up containers. Mirrors
+      // the "failed queue poll HOLDS" rule above. A brief blip still scales up.
+      if let since = listRunnersFailingSince,
+        Date().timeIntervalSince(since) >= neverConfirmedReapInterval
+      {


Valid — fixed in #44 (folded into the re-cut v0.1.3). The outage backstop now uses its own listRunnersOutageHoldInterval (default 120s), decoupled from neverConfirmedReapInterval, so tuning or disabling the zombie reaper can no longer change (or disable) the backstop window.

NickReisenauer · 2026-06-17T02:32:01Z

+  /// A slot ALIVE this long without GitHub ever confirming it `online`
+  /// (`confirmedOnline` never set) is a never-connected zombie: it minted a
+  /// registration but its egress failed, so it never reached GitHub and never
+  /// claimed a job — and, unlike a launch-FAILURE *exit*, its container keeps
+  /// running, so it never trips `launchFailureGraceInterval`. Reap it well before
+  /// the 5-min sustained-offline grace (there's no job to orphan: an unconfirmed
+  /// runner is never assigned work), and count it toward the launch-failure
+  /// backoff so a repeated failure surfaces instead of spawning endlessly.
+  private let neverConfirmedReapInterval: TimeInterval


Fixed in #44: documented neverConfirmedReapInterval's value semantics — a negative value disables the reaper, and a very small value (e.g. 0) reaps a runner before it can come online (test-only); the production default (120s) leaves a real connect window.

…rval PR #43 review (Copilot): the runner-API outage backstop reused neverConfirmedReapInterval as its "sustained" threshold, coupling two independent behaviors — tuning or disabling the zombie reaper would silently widen or disable the outage backstop and could re-enable the runaway scale-up during a prolonged outage. Give the backstop its own listRunnersOutageHoldInterval (default 120s). Also document neverConfirmedReapInterval's value semantics: a negative value disables the reaper; a very small value (e.g. 0) reaps a runner before it can come online (test-only) — the production default leaves a real connect window. OrchestratorTests 37 passing; build clean.

NickReisenauer requested a review from Copilot June 17, 2026 02:23

Copilot started reviewing on behalf of NickReisenauer June 17, 2026 02:23 View session

Copilot AI reviewed Jun 17, 2026

View reviewed changes

NickReisenauer merged commit 4d428fb into main Jun 17, 2026
1 check passed

NickReisenauer deleted the fix/linux-container-zombie-reaping branch June 17, 2026 02:25

NickReisenauer mentioned this pull request Jun 17, 2026

Decouple listRunners-outage backstop window from the zombie-reap interval #44

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Linux runner-container pileup when a runner never reaches GitHub (0.1.3)#43

Fix Linux runner-container pileup when a runner never reaches GitHub (0.1.3)#43
NickReisenauer merged 1 commit into
mainfrom
fix/linux-container-zombie-reaping

NickReisenauer commented Jun 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

NickReisenauer Jun 17, 2026

Uh oh!

NickReisenauer Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

NickReisenauer commented Jun 17, 2026

What happened

Root cause

Changes

Tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

NickReisenauer Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

NickReisenauer Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants