Skip to content

Fix Linux runner-container pileup when a runner never reaches GitHub (0.1.3)#43

Merged
NickReisenauer merged 1 commit into
mainfrom
fix/linux-container-zombie-reaping
Jun 17, 2026
Merged

Fix Linux runner-container pileup when a runner never reaches GitHub (0.1.3)#43
NickReisenauer merged 1 commit into
mainfrom
fix/linux-container-zombie-reaping

Conversation

@NickReisenauer

Copy link
Copy Markdown
Contributor

What happened

A repo's test (linux) job sat queued for hours while 15 Linux runner containers piled up on the host (one spawned every ~5 min, none exiting). Each ephemeral container registered with GitHub but never came online because its network egress failed (SocketException 11 / TryAgain reaching *.actions.githubusercontent.com), and the accumulation exhausted Apple container's shared vmnet NAT bridge — which kept new containers from connecting too. A self-sustaining snowball.

Root cause

LinuxContainerProvider.stop() only process.terminate()'d the foreground container run client and relied on its termination handler to call cleanup() (container delete --force). But that client is just an attachment to a daemon-managed container, and SIGTERM doesn't reliably make container run exit — so the handler never fired and the container kept running. The orchestrator's reaper did drop the slot from tracking and deregister it from GitHub every cycle, but the container itself survived → pileup.

Verified empirically: container delete --force does reap a running container, so the fix is simply to call it directly.

Changes

  • stop() force-deletes the container by name directly (keystone fix), so every reap actually removes its container.
  • Liveness reaper (neverConfirmedReapInterval, default 2 min): a runner GitHub never confirmed online is reaped + deregistered + counted toward the launch-failure backoff, instead of waiting out the 5-min sustained-offline grace. Surfaces launchFailing ("substrate broken") rather than respawning forever.
  • Bridge-safe concurrency ceiling (maxConcurrentContainersCeiling = 4): the effective Linux cap is the lower of the RAM/CPU divide and this ceiling, since the shared vmnet bridge — not RAM/CPU — binds first.
  • Sustained listRunners-outage backstop: holds the fleet instead of scaling up against unknown busy-ness.

Tests

4 new (testEffectiveCapClampsLargeHostToBridgeCeiling, testEffectiveCapNeverRaisesASmallHost, testReapsRegisteredButNeverConfirmedOnlineZombieFast, testSustainedListRunnersOutageHoldsTheFleet). Full suite 220 passing; debug + release builds clean.

Bumps MARKETING_VERSION to 0.1.3; tagging v0.1.3 cuts the signed/notarized release.

…(0.1.3)

An ephemeral Linux container that registered but never connected (its egress
failed) never exited, so the orchestrator dropped it from tracking and launched
a replacement every reap cycle — piling up zombie containers (15 in one
incident) that exhausted Apple `container`'s shared vmnet NAT bridge and wedged
the network for the rest.

Root cause: LinuxContainerProvider.stop() only SIGTERM'd the foreground
`container run` client (an attachment to the daemon-managed container) and
relied on its termination handler to force-delete it — which didn't fire
reliably, so the container kept running. stop() now force-deletes the container
by name directly (`container delete --force`, verified to reap a running one),
so every reap actually removes its container.

Hardening so a future blip can't snowball:
- Liveness reaper: a runner alive past neverConfirmedReapInterval (2 min) that
  GitHub never confirmed online is force-deleted, deregistered, and counted
  toward the launch-failure backoff — not held for the 5-min sustained grace.
- Bridge-safe concurrency ceiling (maxConcurrentContainersCeiling = 4): the
  effective Linux cap is the lower of the RAM/CPU divide and this ceiling, since
  the shared vmnet bridge — not RAM/CPU — binds first.
- Sustained listRunners-outage backstop: hold the fleet instead of scaling up
  against unknown busy-ness.

4 new tests (budget ceiling x2, never-online reaper, outage backstop); full
suite 220 passing; debug + release builds clean. Bumps MARKETING_VERSION 0.1.3.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a failure mode in the Linux container runner provider where runners that never successfully come online can accumulate indefinitely, eventually exhausting Apple container’s shared vmnet NAT bridge and preventing new runners from networking.

Changes:

  • Force-delete Linux runner containers by name during LinuxContainerProvider.stop() to ensure reaped runners actually remove their underlying containers.
  • Add orchestrator safeguards for “registered but never confirmed online” runners (fast reap + launch-failure backoff) and for sustained listRunners outages (hold fleet instead of scaling up).
  • Clamp Linux container concurrency with a bridge-safe ceiling and update app budgeting logic; bump version/changelog entries for 0.1.3.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
Tests/MactionsCoreTests/OrchestratorTests.swift Adds tests covering never-confirmed-online zombie reaping and sustained listRunners outage hold behavior.
Tests/MactionsCoreTests/LinuxContainerBudgetTests.swift Adds tests for the new effective cap that clamps to the bridge-safe ceiling.
Sources/MactionsCore/Orchestrator.swift Implements never-confirmed-online reaper and sustained runner-API outage backstop; extends init with neverConfirmedReapInterval.
Sources/MactionsCore/LinuxContainerProvider.swift Makes stop() unconditionally force-delete the container by name (no longer relying on container run termination).
Sources/MactionsCore/LinuxContainerBudget.swift Introduces a hard concurrency ceiling and an effectiveMaxConcurrentContainers helper.
Sources/Mactions/AppState.swift Switches Linux budgeting to use the effective capped value.
project.yml Bumps MARKETING_VERSION to 0.1.3.
CHANGELOG.md Adds a 0.1.3 changelog entry documenting the fix and new safeguards.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 416 to +425
guard let health else {
// SUSTAINED runner-API outage → HOLD the fleet (don't spawn more). With
// busy-ness unknown, the scale-up below assumes queued jobs need NEW
// runners — but when the outage is an egress failure (often the same one
// stopping runners from connecting), those replacements can never come
// online, which is exactly the snowball that piled up containers. Mirrors
// the "failed queue poll HOLDS" rule above. A brief blip still scales up.
if let since = listRunnersFailingSince,
Date().timeIntervalSince(since) >= neverConfirmedReapInterval
{

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid — fixed in #44 (folded into the re-cut v0.1.3). The outage backstop now uses its own listRunnersOutageHoldInterval (default 120s), decoupled from neverConfirmedReapInterval, so tuning or disabling the zombie reaper can no longer change (or disable) the backstop window.

Comment on lines +147 to +155
/// A slot ALIVE this long without GitHub ever confirming it `online`
/// (`confirmedOnline` never set) is a never-connected zombie: it minted a
/// registration but its egress failed, so it never reached GitHub and never
/// claimed a job — and, unlike a launch-FAILURE *exit*, its container keeps
/// running, so it never trips `launchFailureGraceInterval`. Reap it well before
/// the 5-min sustained-offline grace (there's no job to orphan: an unconfirmed
/// runner is never assigned work), and count it toward the launch-failure
/// backoff so a repeated failure surfaces instead of spawning endlessly.
private let neverConfirmedReapInterval: TimeInterval

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #44: documented neverConfirmedReapInterval's value semantics — a negative value disables the reaper, and a very small value (e.g. 0) reaps a runner before it can come online (test-only); the production default (120s) leaves a real connect window.

@NickReisenauer NickReisenauer merged commit 4d428fb into main Jun 17, 2026
1 check passed
@NickReisenauer NickReisenauer deleted the fix/linux-container-zombie-reaping branch June 17, 2026 02:25
NickReisenauer added a commit that referenced this pull request Jun 17, 2026
…rval

PR #43 review (Copilot): the runner-API outage backstop reused
neverConfirmedReapInterval as its "sustained" threshold, coupling two
independent behaviors — tuning or disabling the zombie reaper would silently
widen or disable the outage backstop and could re-enable the runaway scale-up
during a prolonged outage. Give the backstop its own
listRunnersOutageHoldInterval (default 120s).

Also document neverConfirmedReapInterval's value semantics: a negative value
disables the reaper; a very small value (e.g. 0) reaps a runner before it can
come online (test-only) — the production default leaves a real connect window.

OrchestratorTests 37 passing; build clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants