Skip to content

Leaked ephemeral macOS runner VMs accumulating on host (no cleanup) #7

Description

@diranged

Summary

Ephemeral per-job macOS runner VMs created for GitHub Actions jobs are not being deleted after their job completes. Over time they accumulate on the Mac host, eventually filling its internal disk.

What I observed

On a host running graftery-managed runners, tart list showed 22 stopped ephemeral VMs from a single workflow, all cloned from the current arc-prepared-macos-* base image:

local  arc-prepared-macos-tahoe-xcode-26-4-<hash>     140  86   86  stopped   ← base (keep)
local  <project>-macos-26-4-070ad969                  140  88   88  stopped   ← leaked
local  <project>-macos-26-4-1aadff6f                  140  88   88  stopped   ← leaked
...  (20 more of the same pattern)

Each was ~88 GiB on disk (APFS clones, so actual block allocation is less, but not zero), and cumulatively they took the host's /System/Volumes/Data to 100% / ~100 MiB free.

Impact

Once the host disk filled, the k3s control-plane VM (also running on this host via Lima) suffered sqlite/kine corruption (database disk image is malformed), which cascaded:

  • ARC controller, storage operator, and several grafana-agent operators went into CrashLoopBackOff (900-1000+ restarts over 24 days)
  • Runner scale set stopped picking up jobs — pending runner pods got stuck in ContainerCreating for 44h
  • Lima guest agent became unresponsive (host fs full ⇒ macOS couldn't grow swap ⇒ guest at memory ceiling)

Manual tart delete of the 22 orphans recovered enough space to unblock recovery.

Expected behavior

Each ephemeral runner VM should be deleted once its associated GH Actions job terminates (success, failure, or cancellation). Today nothing appears to reap VMs that outlive the process that spawned them.

Possible causes (not verified)

  1. Teardown relies on the runner's own post-step, which doesn't run if the runner pod dies or the host loses power mid-job.
  2. Teardown relies on the ARC controller firing the delete — but when the controller itself is unhealthy (e.g. control-plane trouble), VMs leak and there's no independent recovery.
  3. tart delete may silently fail in some states and the caller ignores the exit code.

Suggested fix

Add an independent reaper that periodically sweeps tart list for VMs that:

  • match the ephemeral-VM naming pattern
  • are in the stopped state
  • have no corresponding active job / runner pod (or have exceeded a max-age threshold, e.g. 2h)

This gives us a belt-and-suspenders guarantee that controller outages or crashed runners don't turn into unbounded disk growth.

Related signals

  • Deletion load is probably bursty, so the reaper should batch and rate-limit tart delete calls.
  • Worth exposing a Prometheus counter for leaked/reaped VMs so this kind of drift is visible on a dashboard before it hits the disk.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions