feat(ci): stale-run janitor — cancel zombie Gate A runs#2250
Merged
Conversation
A dead Namespace runner leaves its Gate A run wedged `in_progress` past the timeout (GitHub never gets the completion signal); abandoned runs sit `queued` forever. Either way they keep counting against the submission governor's in-flight budget (UNSORRY_MAX_GATE_A_IN_FLIGHT), starving real verification and triggering cancellation cascades — the "70 PRs failing" that was really one zombie jam occupying the budget. Adds tools/repo/stale_runs.py + the stale-run-janitor workflow (every 20 min + manual). Cancels gate-a runs in_progress past a normal run (default 90 min) or queued long enough to be abandoned (default 180 min) — conservative limits that never touch a normal ~15 min run or a briefly-queued one. Uses the default GITHUB_TOKEN (actions: write); no REFRESH_TOKEN needed. Pure is_stale predicate is unit-tested. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
This is the fix for the recurring "many PRs failing" symptom that is really one zombie jam. When a Namespace runner dies mid-job, GitHub never gets the completion signal and the Gate A run wedges
in_progresspast the timeout; abandoned runs sitqueuedforever. Either way they keep counting against the submission governor's in-flight budget (UNSORRY_MAX_GATE_A_IN_FLIGHT) — starving real verification and triggering cancellation cascades. (Today there were 24 such zombies, up to 7.7 h old, eating the budget.)What
tools/repo/stale_runs.py— cancelsgate-a.ymlruns stuckin_progresspast a normal run (default 90 min; normal incremental Gate A is ~15 min) orqueuedlong enough to be abandoned (default 180 min; legitimate queueing waits minutes, not hours)..github/workflows/stale-run-janitor.yml— runs every 20 min +workflow_dispatch. Uses the defaultGITHUB_TOKEN(actions: write) — no PR creation, so no REFRESH_TOKEN needed.Safety
Conservative per-status thresholds never touch a genuinely-running (~15 min) or briefly-queued job. The
is_stalepredicate is pure and unit-tested (7 tests).Proof it works
The live dry-run found 6 zombies my manual sweep missed (221–465 min
in_progress); running it live cancelled all 6.Performance impact
Keeps the in-flight budget reflecting live work, so the cap means what it says and zombie-induced cancellation cascades stop recurring — i.e. you stop wasting the verification ceiling on dead runs. Complements the in-flight cap (#2198) and auto-archive (#2249).
🤖 Generated with Claude Code