-
Notifications
You must be signed in to change notification settings - Fork 0
Fix Linux runner-container pileup when a runner never reaches GitHub (0.1.3) #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,40 @@ | ||
| # Changelog | ||
|
|
||
| All notable changes to Mactions are documented here. This project adheres to | ||
| [Semantic Versioning](https://semver.org/). | ||
|
|
||
| ## [0.1.3] - 2026-06-16 | ||
|
|
||
| ### Fixed | ||
|
|
||
| - **Linux runner containers could pile up when a runner never reached GitHub.** | ||
| If an ephemeral Linux container registered but never connected — e.g. its | ||
| network egress failed — it never exited, so the orchestrator dropped it from | ||
| tracking and launched a replacement every reap cycle, accumulating zombie | ||
| containers (15 observed in one incident) that exhausted Apple `container`'s | ||
| shared vmnet NAT bridge and wedged the network for the rest. | ||
|
|
||
| Root cause: `LinuxContainerProvider.stop()` only SIGTERM'd the foreground | ||
| `container run` client (an attachment to the daemon-managed container) and | ||
| relied on its termination handler to force-delete the container — which did | ||
| not fire reliably, so the container kept running. `stop()` now force-deletes | ||
| the container by name directly (`container delete --force`, verified to reap a | ||
| running container), so every reap actually removes its container. | ||
|
|
||
| ### Added | ||
|
|
||
| - **Liveness reaper for "registered but never online" runners.** A runner alive | ||
| past `neverConfirmedReapInterval` (default 2 min) that GitHub never once | ||
| confirmed `online` is force-deleted, deregistered, and counted toward the | ||
| launch-failure backoff — rather than waiting out the 5-minute sustained-offline | ||
| grace (which exists to protect a mid-job runner from a transient blip). A | ||
| repeated failure now flips `launchFailing` ("substrate broken") instead of | ||
| respawning forever. | ||
| - **Bridge-safe Linux concurrency ceiling** (`maxConcurrentContainersCeiling`, | ||
| default 4). The effective per-host container cap is the lower of the RAM/CPU | ||
| divide and this ceiling, because Apple `container`'s single shared vmnet NAT | ||
| bridge — not RAM/CPU — is the binding constraint past a handful of containers. | ||
| - **Sustained runner-API outage backstop.** When `listRunners` has been failing | ||
| for a sustained window, the orchestrator holds the fleet instead of scaling up | ||
| against unknown busy-ness (which, during an egress outage, would spawn | ||
| replacements that can never connect). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -144,6 +144,15 @@ public final class RunnerOrchestrator { | |
| /// don't record a run, don't hot-loop a replacement — the periodic tick | ||
| /// paces the retry. | ||
| private let launchFailureGraceInterval: TimeInterval | ||
| /// A slot ALIVE this long without GitHub ever confirming it `online` | ||
| /// (`confirmedOnline` never set) is a never-connected zombie: it minted a | ||
| /// registration but its egress failed, so it never reached GitHub and never | ||
| /// claimed a job — and, unlike a launch-FAILURE *exit*, its container keeps | ||
| /// running, so it never trips `launchFailureGraceInterval`. Reap it well before | ||
| /// the 5-min sustained-offline grace (there's no job to orphan: an unconfirmed | ||
| /// runner is never assigned work), and count it toward the launch-failure | ||
| /// backoff so a repeated failure surfaces instead of spawning endlessly. | ||
| private let neverConfirmedReapInterval: TimeInterval | ||
|
|
||
| /// What the last queue poll saw — the UI's window into scale-from-zero | ||
| /// (zero runners is the normal armed state, so the demand signal itself is | ||
|
|
@@ -169,6 +178,12 @@ public final class RunnerOrchestrator { | |
| /// broken (dead daemon, bad base image) and "starting…" would be a lie. | ||
| private var consecutiveLaunchFailures = 0 | ||
| public var launchFailing: Bool { consecutiveLaunchFailures >= 2 } | ||
| /// When `listRunners` first started failing (cleared on the next success). A | ||
| /// SUSTAINED runner-API outage must NOT read as "spawn more": with busy-ness | ||
| /// unknown, the converge step would scale up to a fixed target every tick, and | ||
| /// when the outage is an egress failure those new runners can never connect — | ||
| /// the snowball. Used by the `health == nil` backstop in `reconcile`. | ||
| private var listRunnersFailingSince: Date? | ||
| /// Fired (on the main actor) whenever `state`/`runners` change. | ||
| public var onChange: (() -> Void)? | ||
| /// Fired (on the main actor) once for each runner that finishes — a clean | ||
|
|
@@ -264,7 +279,8 @@ public final class RunnerOrchestrator { | |
| idleJITRefreshInterval: TimeInterval? = 8 * 60, | ||
| idleTrimGraceInterval: TimeInterval = 90, | ||
| trimConfirmInterval: TimeInterval = 25, | ||
| launchFailureGraceInterval: TimeInterval = 30 | ||
| launchFailureGraceInterval: TimeInterval = 30, | ||
| neverConfirmedReapInterval: TimeInterval = 120 | ||
| ) { | ||
| self.controlPlane = controlPlane | ||
| self.factory = factory | ||
|
|
@@ -279,6 +295,7 @@ public final class RunnerOrchestrator { | |
| self.idleTrimGraceInterval = idleTrimGraceInterval | ||
| self.trimConfirmInterval = trimConfirmInterval | ||
| self.launchFailureGraceInterval = launchFailureGraceInterval | ||
| self.neverConfirmedReapInterval = neverConfirmedReapInterval | ||
| } | ||
|
|
||
| public var runners: [ManagedRunner] { | ||
|
|
@@ -397,6 +414,19 @@ public final class RunnerOrchestrator { | |
|
|
||
| // 3. Converge. | ||
| guard let health else { | ||
| // SUSTAINED runner-API outage → HOLD the fleet (don't spawn more). With | ||
| // busy-ness unknown, the scale-up below assumes queued jobs need NEW | ||
| // runners — but when the outage is an egress failure (often the same one | ||
| // stopping runners from connecting), those replacements can never come | ||
| // online, which is exactly the snowball that piled up containers. Mirrors | ||
| // the "failed queue poll HOLDS" rule above. A brief blip still scales up. | ||
| if let since = listRunnersFailingSince, | ||
| Date().timeIntervalSince(since) >= neverConfirmedReapInterval | ||
| { | ||
|
Comment on lines
416
to
+425
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Valid — fixed in #44 (folded into the re-cut v0.1.3). The outage backstop now uses its own |
||
| lastError = "Can't reach GitHub's runner API — holding the fleet (not spawning more)." | ||
| notify() | ||
| return | ||
| } | ||
| // The queue read is real but busy-ness is unknown (listRunners failed). | ||
| // Assume every current slot is busy: demand that PERSISTS alongside | ||
| // genuinely idle runners doesn't happen (GitHub assigns within seconds), | ||
|
|
@@ -538,10 +568,12 @@ public final class RunnerOrchestrator { | |
| do { | ||
| remote = try await controlPlane.listRunners() | ||
| } catch { | ||
| if listRunnersFailingSince == nil { listRunnersFailingSince = Date() } | ||
| lastError = String(describing: error) | ||
| notify() | ||
| return nil | ||
| } | ||
| listRunnersFailingSince = nil // a successful fetch clears the outage clock | ||
| guard epoch == myEpoch, state == .online else { return nil } | ||
|
|
||
| let byId = Dictionary(remote.map { ($0.id, $0) }, uniquingKeysWith: { first, _ in first }) | ||
|
|
@@ -556,6 +588,26 @@ public final class RunnerOrchestrator { | |
| let healthy = remoteRunner?.status.lowercased() == "online" | ||
|
|
||
| if !healthy { | ||
| // NEVER-CONNECTED ZOMBIE: appended, alive past `neverConfirmedReapInterval`, | ||
| // and GitHub has NEVER once confirmed it online. This is the egress-failed | ||
| // container — it registered but never reached GitHub, so it has no job to | ||
| // orphan (an unconfirmed runner is never assigned work) and its container | ||
| // won't exit on its own (so the launch-failure-on-EXIT path never fires). | ||
| // Reap it FAST instead of waiting out the 5-min sustained-offline grace | ||
| // (which exists to protect a mid-job runner from a transient blip) — else | ||
| // replacements pile up behind it. Count it toward the launch-failure | ||
| // backoff so a repeated failure flips `launchFailing` ("substrate broken") | ||
| // rather than silently respawning forever. | ||
| if slot.appended, !slot.confirmedOnline, neverConfirmedReapInterval >= 0, | ||
| age >= neverConfirmedReapInterval | ||
| { | ||
| consecutiveLaunchFailures += 1 | ||
| lastError = | ||
| "Runner \(slot.name) registered but never reached GitHub within " | ||
| + "\(Int(neverConfirmedReapInterval))s — reaping (check network/egress)." | ||
| stale.append((slot, remoteRunner)) | ||
| continue | ||
| } | ||
| // Missing from GitHub's list, or not online. Prune ONLY after SUSTAINED | ||
| // unhealth (≥ grace), never on a single bad reading — a transient | ||
| // eventual-consistency / connection blip must not tear down a runner | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in #44: documented
neverConfirmedReapInterval's value semantics — a negative value disables the reaper, and a very small value (e.g. 0) reaps a runner before it can come online (test-only); the production default (120s) leaves a real connect window.