Skip to content

Proofs stalled: gate-a-audit serialized on a single online audit runner (33-deep queue) #2172

@cgbarlow

Description

@cgbarlow

Summary

Proof throughput stalled on 18 Jun 2026. Root cause: the gate-a-audit job (authoritative axiom audit) is serialized on a single online namespace-profile-unsorry-audit runner, and the proof gate-a backlog has grown faster than one runner can drain it. Nothing downstream is broken — queued PRs are MERGEABLE with --auto armed and will merge themselves once Gate A goes green.

Evidence (as of 03:06 UTC / 15:06 NZT, 18 Jun)

Timeline

Event UTC NZT (18 Jun)
Last proof merged to main (#2064) 22:34 10:34
Last proof PR opened (#2167) 01:48 13:48
Current audit started 02:34 14:34

Swarm stopped opening new PRs at ~13:48 NZT (#2167 last), consistent with the Gate A in-flight cap throttling submissions once the queue exceeded it.

Not broken (ruled out)

  • Auto-merge — 30+ queued PRs MERGEABLE + --auto armed.
  • Gate B, lint, dispatcher, non-audit gate-a jobs — all green.
  • Board backlog (324 queued / 65 in-flight / 6 solvers) is real claimed work waiting on this same gate.

Root cause

Audit concurrency collapsed to ~1 because the namespace-profile-unsorry-audit runner pool is down to a single online instance. Each audit is a full mathlib build + axiom audit, so one runner cannot drain a 33-deep (and growing) queue.

Unverified

Why the pool is at 1 is unconfirmed — the org-runner API returns 403 for the available token, so autoscaler quota / billing / dropped instances couldn't be inspected. Needs a look at the Namespace dashboard.

Recommended action

Restore namespace-profile-unsorry-audit runner capacity (check Namespace pool/quota/billing). No code/PR fix required — with 2–3 audit runners back, the queued proofs auto-merge and the swarm resumes opening PRs as the in-flight cap clears.

Possible follow-ups worth deciding on:

  • Whether the audit-runner pool size lives in-repo (workflow/ADR) or purely in Namespace config.
  • An alert when proof gate-a queue depth or audit wait time crosses a threshold, so this surfaces before throughput visibly drops.

Filed from a live investigation; numbers are point-in-time snapshots.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingciCI/CD and automation

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions