Skip to content

feat(ci): stale-run janitor — cancel zombie Gate A runs#2250

Merged
perttu merged 1 commit into
mainfrom
feat/stale-run-janitor
Jun 18, 2026
Merged

feat(ci): stale-run janitor — cancel zombie Gate A runs#2250
perttu merged 1 commit into
mainfrom
feat/stale-run-janitor

Conversation

@perttu

@perttu perttu commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Why

This is the fix for the recurring "many PRs failing" symptom that is really one zombie jam. When a Namespace runner dies mid-job, GitHub never gets the completion signal and the Gate A run wedges in_progress past the timeout; abandoned runs sit queued forever. Either way they keep counting against the submission governor's in-flight budget (UNSORRY_MAX_GATE_A_IN_FLIGHT) — starving real verification and triggering cancellation cascades. (Today there were 24 such zombies, up to 7.7 h old, eating the budget.)

What

  • tools/repo/stale_runs.py — cancels gate-a.yml runs stuck in_progress past a normal run (default 90 min; normal incremental Gate A is ~15 min) or queued long enough to be abandoned (default 180 min; legitimate queueing waits minutes, not hours).
  • .github/workflows/stale-run-janitor.yml — runs every 20 min + workflow_dispatch. Uses the default GITHUB_TOKEN (actions: write) — no PR creation, so no REFRESH_TOKEN needed.

Safety

Conservative per-status thresholds never touch a genuinely-running (~15 min) or briefly-queued job. The is_stale predicate is pure and unit-tested (7 tests).

Proof it works

The live dry-run found 6 zombies my manual sweep missed (221–465 min in_progress); running it live cancelled all 6.

Performance impact

Keeps the in-flight budget reflecting live work, so the cap means what it says and zombie-induced cancellation cascades stop recurring — i.e. you stop wasting the verification ceiling on dead runs. Complements the in-flight cap (#2198) and auto-archive (#2249).

/.github/ + /tools/ are code-owned (@cgbarlow) — needs a code-owner review.

🤖 Generated with Claude Code

A dead Namespace runner leaves its Gate A run wedged `in_progress` past the
timeout (GitHub never gets the completion signal); abandoned runs sit `queued`
forever. Either way they keep counting against the submission governor's
in-flight budget (UNSORRY_MAX_GATE_A_IN_FLIGHT), starving real verification and
triggering cancellation cascades — the "70 PRs failing" that was really one
zombie jam occupying the budget.

Adds tools/repo/stale_runs.py + the stale-run-janitor workflow (every 20 min +
manual). Cancels gate-a runs in_progress past a normal run (default 90 min) or
queued long enough to be abandoned (default 180 min) — conservative limits that
never touch a normal ~15 min run or a briefly-queued one. Uses the default
GITHUB_TOKEN (actions: write); no REFRESH_TOKEN needed. Pure is_stale predicate
is unit-tested.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@perttu perttu requested a review from cgbarlow as a code owner June 18, 2026 18:47
@github-actions github-actions Bot added the feat New machinery label Jun 18, 2026
@perttu perttu merged commit 702e564 into main Jun 18, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feat New machinery

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant