Leaked ephemeral macOS runner VMs accumulating on host (no cleanup)

## Summary

Ephemeral per-job macOS runner VMs created for GitHub Actions jobs are not being deleted after their job completes. Over time they accumulate on the Mac host, eventually filling its internal disk.

## What I observed

On a host running graftery-managed runners, `tart list` showed **22 stopped** ephemeral VMs from a single workflow, all cloned from the current `arc-prepared-macos-*` base image:

```
local  arc-prepared-macos-tahoe-xcode-26-4-<hash>     140  86   86  stopped   ← base (keep)
local  <project>-macos-26-4-070ad969                  140  88   88  stopped   ← leaked
local  <project>-macos-26-4-1aadff6f                  140  88   88  stopped   ← leaked
...  (20 more of the same pattern)
```

Each was ~88 GiB on disk (APFS clones, so actual block allocation is less, but not zero), and cumulatively they took the host's `/System/Volumes/Data` to **100% / ~100 MiB free**.

## Impact

Once the host disk filled, the k3s control-plane VM (also running on this host via Lima) suffered **sqlite/kine corruption** (`database disk image is malformed`), which cascaded:

- ARC controller, storage operator, and several grafana-agent operators went into CrashLoopBackOff (900-1000+ restarts over 24 days)
- Runner scale set stopped picking up jobs — pending runner pods got stuck in `ContainerCreating` for 44h
- Lima guest agent became unresponsive (host fs full ⇒ macOS couldn't grow swap ⇒ guest at memory ceiling)

Manual `tart delete` of the 22 orphans recovered enough space to unblock recovery.

## Expected behavior

Each ephemeral runner VM should be deleted once its associated GH Actions job terminates (success, failure, or cancellation). Today nothing appears to reap VMs that outlive the process that spawned them.

## Possible causes (not verified)

1. Teardown relies on the runner's own post-step, which doesn't run if the runner pod dies or the host loses power mid-job.
2. Teardown relies on the ARC controller firing the delete — but when the controller itself is unhealthy (e.g. control-plane trouble), VMs leak and there's no independent recovery.
3. `tart delete` may silently fail in some states and the caller ignores the exit code.

## Suggested fix

Add an independent reaper that periodically sweeps `tart list` for VMs that:

- match the ephemeral-VM naming pattern
- are in the `stopped` state
- have no corresponding active job / runner pod (or have exceeded a max-age threshold, e.g. 2h)

This gives us a belt-and-suspenders guarantee that controller outages or crashed runners don't turn into unbounded disk growth.

## Related signals

- Deletion load is probably bursty, so the reaper should batch and rate-limit `tart delete` calls.
- Worth exposing a Prometheus counter for leaked/reaped VMs so this kind of drift is visible on a dashboard before it hits the disk.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leaked ephemeral macOS runner VMs accumulating on host (no cleanup) #7

Summary

What I observed

Impact

Expected behavior

Possible causes (not verified)

Suggested fix

Related signals

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Leaked ephemeral macOS runner VMs accumulating on host (no cleanup) #7

Description

Summary

What I observed

Impact

Expected behavior

Possible causes (not verified)

Suggested fix

Related signals

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions