Summary
Ephemeral per-job macOS runner VMs created for GitHub Actions jobs are not being deleted after their job completes. Over time they accumulate on the Mac host, eventually filling its internal disk.
What I observed
On a host running graftery-managed runners, tart list showed 22 stopped ephemeral VMs from a single workflow, all cloned from the current arc-prepared-macos-* base image:
local arc-prepared-macos-tahoe-xcode-26-4-<hash> 140 86 86 stopped ← base (keep)
local <project>-macos-26-4-070ad969 140 88 88 stopped ← leaked
local <project>-macos-26-4-1aadff6f 140 88 88 stopped ← leaked
... (20 more of the same pattern)
Each was ~88 GiB on disk (APFS clones, so actual block allocation is less, but not zero), and cumulatively they took the host's /System/Volumes/Data to 100% / ~100 MiB free.
Impact
Once the host disk filled, the k3s control-plane VM (also running on this host via Lima) suffered sqlite/kine corruption (database disk image is malformed), which cascaded:
- ARC controller, storage operator, and several grafana-agent operators went into CrashLoopBackOff (900-1000+ restarts over 24 days)
- Runner scale set stopped picking up jobs — pending runner pods got stuck in
ContainerCreating for 44h
- Lima guest agent became unresponsive (host fs full ⇒ macOS couldn't grow swap ⇒ guest at memory ceiling)
Manual tart delete of the 22 orphans recovered enough space to unblock recovery.
Expected behavior
Each ephemeral runner VM should be deleted once its associated GH Actions job terminates (success, failure, or cancellation). Today nothing appears to reap VMs that outlive the process that spawned them.
Possible causes (not verified)
- Teardown relies on the runner's own post-step, which doesn't run if the runner pod dies or the host loses power mid-job.
- Teardown relies on the ARC controller firing the delete — but when the controller itself is unhealthy (e.g. control-plane trouble), VMs leak and there's no independent recovery.
tart delete may silently fail in some states and the caller ignores the exit code.
Suggested fix
Add an independent reaper that periodically sweeps tart list for VMs that:
- match the ephemeral-VM naming pattern
- are in the
stopped state
- have no corresponding active job / runner pod (or have exceeded a max-age threshold, e.g. 2h)
This gives us a belt-and-suspenders guarantee that controller outages or crashed runners don't turn into unbounded disk growth.
Related signals
- Deletion load is probably bursty, so the reaper should batch and rate-limit
tart delete calls.
- Worth exposing a Prometheus counter for leaked/reaped VMs so this kind of drift is visible on a dashboard before it hits the disk.
Summary
Ephemeral per-job macOS runner VMs created for GitHub Actions jobs are not being deleted after their job completes. Over time they accumulate on the Mac host, eventually filling its internal disk.
What I observed
On a host running graftery-managed runners,
tart listshowed 22 stopped ephemeral VMs from a single workflow, all cloned from the currentarc-prepared-macos-*base image:Each was ~88 GiB on disk (APFS clones, so actual block allocation is less, but not zero), and cumulatively they took the host's
/System/Volumes/Datato 100% / ~100 MiB free.Impact
Once the host disk filled, the k3s control-plane VM (also running on this host via Lima) suffered sqlite/kine corruption (
database disk image is malformed), which cascaded:ContainerCreatingfor 44hManual
tart deleteof the 22 orphans recovered enough space to unblock recovery.Expected behavior
Each ephemeral runner VM should be deleted once its associated GH Actions job terminates (success, failure, or cancellation). Today nothing appears to reap VMs that outlive the process that spawned them.
Possible causes (not verified)
tart deletemay silently fail in some states and the caller ignores the exit code.Suggested fix
Add an independent reaper that periodically sweeps
tart listfor VMs that:stoppedstateThis gives us a belt-and-suspenders guarantee that controller outages or crashed runners don't turn into unbounded disk growth.
Related signals
tart deletecalls.