Skip to content

feat(gc): add --snapshot LRU eviction flag#53

Merged
CMGS merged 9 commits into
masterfrom
feat/snapshot-lru-gc
May 18, 2026
Merged

feat(gc): add --snapshot LRU eviction flag#53
CMGS merged 9 commits into
masterfrom
feat/snapshot-lru-gc

Conversation

@CMGS

@CMGS CMGS commented May 18, 2026

Copy link
Copy Markdown
Contributor

Summary

Implements issue #33 (LRU snapshot GC + scheduled cleanup) by extending `cocoon gc` with opt-in LRU snapshot reclamation.

Flag Effect
`--snapshot` Bare flag: evict every non-pending snapshot
`--snapshot-keep N` Keep N most-recently-accessed
`--snapshot-age DUR` Evict snapshots last-accessed before DUR (e.g. `720h`)
`--snapshot-size SZ` Evict oldest until total ≤ SZ (e.g. `100GB`)
`--dry-run` Preview only

Sub-flags combine as union of evictions / intersection of kept — a snapshot survives only if it passes every active criterion. Mixing sub-flags without `--snapshot` is rejected.

What's added

  • `snapshot.SnapshotRecord`: `LastAccessedAt` + `SizeBytes` fields. Touched on Restore / DataDir / Export; sized at Create / Import.
  • `snapshot/localfile/gc.go`: `EvictionPolicy`, `pickLRU`, `logEvictions`, `cleanResolvedRecords` plus extended `gcModule` Resolve to apply LRU before returning candidates.
  • `utils.DirSize` helper.
  • `lookupRecord(ctx, ref, touch)` consolidates `resolveRecord` + `accessRecord` (deep-reuse).
  • 13 new unit tests covering all policy combinations, dry-run, zero-time ordering, end-to-end via gcModule, and orphan dir cleanup.
  • README "Snapshot LRU Eviction" + "Scheduled Snapshot GC" sections (systemd timer + cron templates).

Framework change (heads-up)

`gc.Module[S].Resolve` signature now takes `ctx` as the first arg — 6 implementations rippled (`hypervisor`, `images`, `network/bridge` linux+other, `network/cni`, `snapshot/localfile`). Needed so eviction logs reach the orchestrator ctx instead of `context.Background()`.

Design notes

  • No back-compat for old records: zero `LastAccessedAt` sorts oldest and gets evicted first under `--snapshot-age`. Documented in the README.
  • Pending records are still protected: `cleanResolvedRecords` only force-removes records that are non-pending; pending records still need the 24h `pendingGCGrace` window (protects in-flight Create).
  • No daemon mode: scheduled execution delegated to systemd timer / cron — keeps `cocoon gc` a one-shot, lock-safe operation.

Verification

```
✓ make lint # 0 issues × Linux + Darwin
✓ go test ./... # all packages green; 13 new LRU tests pass
✓ /code SKILL line-by-line review (round 1: 6 issues fixed)
✓ /simplify 3 parallel agents (round 2: reuse + quality + efficiency, 2 actionable fixes)
```

Test plan

  • On testbed: `cocoon gc --snapshot --snapshot-age=24h --dry-run` lists expected candidates
  • `cocoon gc --snapshot --snapshot-keep=5` retains 5 most-recently-accessed
  • `cocoon gc --snapshot --snapshot-size=10GB` enforces size cap
  • Bare `cocoon gc --snapshot` deletes every non-pending snapshot
  • `cocoon gc --snapshot-age=24h` (no `--snapshot`) errors out
  • Existing `cocoon gc` behavior unchanged (orphans + stale-pending only)
  • systemd timer template from README boots a daily `cocoon-gc.service`

CMGS added 4 commits May 18, 2026 12:19
Closes #33.

Adds opt-in snapshot reclamation to `cocoon gc`:

  --snapshot                evict every non-pending snapshot
  --snapshot-keep N         keep N most-recently-accessed
  --snapshot-age DUR        evict last-accessed older than DUR
  --snapshot-size SIZE      evict oldest until total <= SIZE
  --dry-run                 preview without acting

Sub-flags combine as a union of evictions (intersection of kept) —
a snapshot survives only if it passes every active criterion. Bare
`--snapshot` (no sub-flag) nukes all non-pending, useful for dev
resets. Mixing sub-flags without `--snapshot` errors out.

SnapshotRecord gains LastAccessedAt (touched on Restore / DataDir /
Export) and SizeBytes (filled at Create / Import). Old records have
zero LastAccessedAt and sort oldest by design — back-compat is not
a goal here.

gc.Module.Resolve now takes ctx (6 impls rippled) so eviction logs
flow through the orchestrator ctx instead of context.Background.

resolveRecord + accessRecord deduped into a single lookupRecord
helper with a touch bool; pickLRU + logEvictions + EvictionPolicy
live in snapshot/localfile/gc.go; DirSize utility added to utils
for Create/Import to cache record size cheaply.

README adds a systemd timer template for scheduled cleanup per
issue #33's "support clean up regularly" requirement.
1. Rename --dry-run to --snapshot-dry-run

   The flag only affected snapshot LRU eviction, not the rest of GC
   (orphans / stale-pending / image / VM / network all still ran).
   The old name promised "act on nothing"; the new name plus an
   explicit README note (`Snapshot-only — orphans and other GC
   modules still execute`) removes the ambiguity.

2. Skip cleanResolvedRecords on entries whose RemoveAll failed

   Collect previously called cleanResolvedRecords on the full input
   set even when individual RemoveAll calls errored — orphaning the
   data dir AND dropping the DB record. Now only successfully
   removed IDs are forwarded; failed ones are retried next cycle.

3. Reject negative --snapshot-keep / --snapshot-age / --snapshot-size

   parseSnapshotPolicy only checked `> 0` so `-1` silently fell into
   the "no sub-criteria" branch and triggered bare --snapshot
   semantics (delete everything). Now negative values error out
   explicitly with the flag name.

Tests: TestGCModule_RemovalFailureKeepsDBRecord + cmd/others/
handler_test.go (Defaults / NegativeRejected / SubFlagRequiresSnapshot
/ HappyPath).
- Wrap Collect's per-id RemoveAll error with `remove snapshot %s`
  (matches sibling Delete); wrap cleanResolvedRecords error with
  `clean DB records` so errors.Join output keeps the failure point.
- Godoc on Option + WithGCPolicy (exported, missing per SKILL).
- Rename `t` local in meta() test helper to `accessedAt` — `t` is
  conventionally `*testing.T` in test files.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements opt-in LRU snapshot reclamation in cocoon gc (issue #33 follow-up). Adds new CLI sub-flags driving an EvictionPolicy consumed by the localfile snapshot GC module, persists LastAccessedAt / SizeBytes on snapshot records, threads context.Context into every gc.Module.Resolve implementation, and documents systemd-timer / cron deployment patterns.

Changes:

  • New --snapshot, --snapshot-keep|age|size, --snapshot-dry-run flags wired through cmd/others, parsed into localfile.EvictionPolicy, validated for negatives and required-with---snapshot.
  • snapshot.SnapshotRecord gains LastAccessedAt + SizeBytes; touched on Restore/DataDir/Export, sized on Create/Import via new utils.DirSize. resolveRecord replaced with lookupRecord(ctx, ref, touch).
  • snapshot/localfile/gc.go extended with pickLRU (union of age/keep/size, dry-run logging) and cleanResolvedRecords; gc.Module.Resolve signature now takes ctx, rippled across 6 backends.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
utils/file.go Adds DirSize helper to sum regular file sizes under a tree.
snapshot/db.go Adds SizeBytes and LastAccessedAt fields to SnapshotRecord.
snapshot/localfile/localfile.go Adds Option/WithGCPolicy, replaces resolveRecord with lookupRecord, sizes & touches records on Create.
snapshot/localfile/import.go Records SizeBytes + LastAccessedAt on Import.
snapshot/localfile/gc.go Adds EvictionPolicy, pickLRU, logEvictions, cleanResolvedRecords; ctx-aware Resolve.
snapshot/localfile/gc_test.go New unit tests covering LRU policy combinations, dry-run, end-to-end GC, orphan dirs.
gc/module.go, gc/runner.go, gc/orchestrator.go Thread context.Context into Resolve / resolveTargets.
hypervisor/gc.go, images/gc.go, network/bridge/gc_{linux,other}.go, network/cni/gc.go Update Resolve signature to accept ctx.
cmd/core/helpers.go InitSnapshot accepts variadic localfile.Option.
cmd/others/commands.go Defines new gc sub-flags.
cmd/others/handler.go Parses sub-flags into EvictionPolicy, validates and wires to backend.
cmd/others/handler_test.go Tests flag parsing (defaults, negatives, requires-snapshot, happy path).
README.md New "Snapshot LRU Eviction" + "Scheduled Snapshot GC" sections.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread snapshot/localfile/localfile.go
Comment thread cmd/others/handler.go
Comment thread snapshot/localfile/localfile.go Outdated
Comment thread cmd/others/handler.go
Comment thread snapshot/localfile/gc.go
Comment thread snapshot/localfile/gc.go
Comment thread utils/file.go
…eate

- lookupRecord: spawn touchAccess goroutine and return immediately, so
  hot paths (clone DataDir, Export) never block on the index write lock.
  Touch is best-effort LRU bookkeeping — caller ctx cancellation must
  not abort it; failure logs warn.
- Close: blocks pending touches; tests register via t.Cleanup so temp
  dirs aren't removed mid-write. Production CLI relies on the
  substantial post-lookup work (clone/restore/export) to keep the
  process alive past the few-ms goroutine.
- Create: stamp LastAccessedAt at finalize via a fresh time.Now()
  instead of the now captured before tar extraction so freshly-finished
  snapshots don't look minutes old to --snapshot-age right after Create
  returns.
@CMGS CMGS force-pushed the feat/snapshot-lru-gc branch from b1f0a09 to bcfa989 Compare May 18, 2026 05:00
@CMGS CMGS requested a review from Copilot May 18, 2026 05:00

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

cmd/others/handler.go:95

  • --snapshot-size is parsed with units.RAMInBytes, which interprets all SI-style suffixes as base-2 (e.g. 100GB → 100 × 2³⁰ ≈ 107 GB decimal), not base-10. The README example (--snapshot-size=100GB) and the flag help text (e.g. 100GB) don't tell the operator which interpretation is used, and "100GB" colloquially means 10⁹ bytes for most users when talking about disk usage. Either switch to units.FromHumanSize (decimal SI) or document explicitly that suffixes are binary (KiB/MiB/GiB semantics).
	if sizeStr != "" {
		n, err := units.RAMInBytes(sizeStr)
		if err != nil {
			return localfile.EvictionPolicy{}, fmt.Errorf("--snapshot-size %q: %w", sizeStr, err)
		}
		if n < 0 {
			return localfile.EvictionPolicy{}, fmt.Errorf("--snapshot-size must be >= 0, got %s", sizeStr)
		}
		size = n
	}

Comment thread snapshot/localfile/gc_test.go
Comment thread snapshot/localfile/localfile.go Outdated
Comment thread cmd/others/handler.go
Comment thread snapshot/localfile/gc.go Outdated
- TestRestoreUpdatesLastAccessedAt: call lf.Close() before reading
  LastAccessedAt; the touch goroutine had no happens-before relation to
  the read, occasionally allowing the assertion to race the persist.
- gc.go Collect: thread ctx instead of dropping it; check ctx.Err()
  between os.RemoveAll calls so Ctrl-C / systemd cancellation can
  interrupt long eviction batches.
- Close godoc: drop the inaccurate "CLI invokes this" claim — only
  tests wire Close today.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 1 comment.

Comment thread snapshot/localfile/gc.go
Copilot finished work on behalf of CMGS May 18, 2026 05:10
…size set

Records created before this PR have SizeBytes=0 (omitempty zero value). The
MaxSize loop in pickLRU summed those as 0, so an installation with 50 GB of
legacy snapshots could see "--snapshot-size=10GB" trigger no eviction at all.

ReadDB now calls DirSize for any record where SizeBytes==0 (only when
policy.MaxSize > 0 — opt-in cost) and persists the result so subsequent gc
invocations don't re-walk. After one gc cycle the corpus is fully sized.
@CMGS CMGS force-pushed the feat/snapshot-lru-gc branch from a0442d8 to ca3c38a Compare May 18, 2026 05:35
CMGS added 2 commits May 18, 2026 13:46
Two correctness bugs:

- backfillSizeBytes ran inside gcModule.ReadDB which is invoked by
  gc.Orchestrator.Run *under* the snapshot-store lock; calling
  store.Update re-acquires the same flock and the chan-based in-process
  guard is non-reentrant — gc --snapshot --snapshot-size deadlocks the
  moment any record has SizeBytes==0. Switch to store.WriteRaw (the
  orchestrator already holds the lock).

- async lookupRecord touch had no production drain: snapshot.Snapshot
  is the interface returned to CLI handlers, it has no Close method,
  and short-lived commands (Restore-then-exit) could race the goroutine
  past process exit — silently dropping the LastAccessedAt touch the
  README's LRU contract promises. Revert to synchronous touch under the
  existing Update path; cocoon's CLI model already serializes on flock
  across processes so the in-process channel contention is moot, and
  the per-call cost is sub-ms.

Drops the touchWG + Close machinery the async design required; tests
no longer need lf.Close() in t.Cleanup.
- pickLRU: use slices.SortedFunc(maps.Keys(records), ...) instead of the
  local entry struct + slices.SortFunc + manual []string fill. Looks
  records up by ID inline; drops 12 lines of plumbing.
- backfillSizeBytes: drop the parallel updates map; iterate records once
  in WriteRaw and only write when on-disk SizeBytes differs.
- Trim two godocs (backfillSizeBytes, EvictionPolicy) of origin chatter
  that belongs in the commit log.
@CMGS CMGS merged commit ce238b5 into master May 18, 2026
4 checks passed
@CMGS CMGS deleted the feat/snapshot-lru-gc branch May 18, 2026 05:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants