feat(gc): add --snapshot LRU eviction flag#53
Merged
Conversation
Closes #33. Adds opt-in snapshot reclamation to `cocoon gc`: --snapshot evict every non-pending snapshot --snapshot-keep N keep N most-recently-accessed --snapshot-age DUR evict last-accessed older than DUR --snapshot-size SIZE evict oldest until total <= SIZE --dry-run preview without acting Sub-flags combine as a union of evictions (intersection of kept) — a snapshot survives only if it passes every active criterion. Bare `--snapshot` (no sub-flag) nukes all non-pending, useful for dev resets. Mixing sub-flags without `--snapshot` errors out. SnapshotRecord gains LastAccessedAt (touched on Restore / DataDir / Export) and SizeBytes (filled at Create / Import). Old records have zero LastAccessedAt and sort oldest by design — back-compat is not a goal here. gc.Module.Resolve now takes ctx (6 impls rippled) so eviction logs flow through the orchestrator ctx instead of context.Background. resolveRecord + accessRecord deduped into a single lookupRecord helper with a touch bool; pickLRU + logEvictions + EvictionPolicy live in snapshot/localfile/gc.go; DirSize utility added to utils for Create/Import to cache record size cheaply. README adds a systemd timer template for scheduled cleanup per issue #33's "support clean up regularly" requirement.
1. Rename --dry-run to --snapshot-dry-run The flag only affected snapshot LRU eviction, not the rest of GC (orphans / stale-pending / image / VM / network all still ran). The old name promised "act on nothing"; the new name plus an explicit README note (`Snapshot-only — orphans and other GC modules still execute`) removes the ambiguity. 2. Skip cleanResolvedRecords on entries whose RemoveAll failed Collect previously called cleanResolvedRecords on the full input set even when individual RemoveAll calls errored — orphaning the data dir AND dropping the DB record. Now only successfully removed IDs are forwarded; failed ones are retried next cycle. 3. Reject negative --snapshot-keep / --snapshot-age / --snapshot-size parseSnapshotPolicy only checked `> 0` so `-1` silently fell into the "no sub-criteria" branch and triggered bare --snapshot semantics (delete everything). Now negative values error out explicitly with the flag name. Tests: TestGCModule_RemovalFailureKeepsDBRecord + cmd/others/ handler_test.go (Defaults / NegativeRejected / SubFlagRequiresSnapshot / HappyPath).
- Wrap Collect's per-id RemoveAll error with `remove snapshot %s` (matches sibling Delete); wrap cleanResolvedRecords error with `clean DB records` so errors.Join output keeps the failure point. - Godoc on Option + WithGCPolicy (exported, missing per SKILL). - Rename `t` local in meta() test helper to `accessedAt` — `t` is conventionally `*testing.T` in test files.
There was a problem hiding this comment.
Pull request overview
Implements opt-in LRU snapshot reclamation in cocoon gc (issue #33 follow-up). Adds new CLI sub-flags driving an EvictionPolicy consumed by the localfile snapshot GC module, persists LastAccessedAt / SizeBytes on snapshot records, threads context.Context into every gc.Module.Resolve implementation, and documents systemd-timer / cron deployment patterns.
Changes:
- New
--snapshot,--snapshot-keep|age|size,--snapshot-dry-runflags wired throughcmd/others, parsed intolocalfile.EvictionPolicy, validated for negatives and required-with---snapshot. snapshot.SnapshotRecordgainsLastAccessedAt+SizeBytes; touched on Restore/DataDir/Export, sized on Create/Import via newutils.DirSize.resolveRecordreplaced withlookupRecord(ctx, ref, touch).snapshot/localfile/gc.goextended withpickLRU(union of age/keep/size, dry-run logging) andcleanResolvedRecords;gc.Module.Resolvesignature now takesctx, rippled across 6 backends.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| utils/file.go | Adds DirSize helper to sum regular file sizes under a tree. |
| snapshot/db.go | Adds SizeBytes and LastAccessedAt fields to SnapshotRecord. |
| snapshot/localfile/localfile.go | Adds Option/WithGCPolicy, replaces resolveRecord with lookupRecord, sizes & touches records on Create. |
| snapshot/localfile/import.go | Records SizeBytes + LastAccessedAt on Import. |
| snapshot/localfile/gc.go | Adds EvictionPolicy, pickLRU, logEvictions, cleanResolvedRecords; ctx-aware Resolve. |
| snapshot/localfile/gc_test.go | New unit tests covering LRU policy combinations, dry-run, end-to-end GC, orphan dirs. |
| gc/module.go, gc/runner.go, gc/orchestrator.go | Thread context.Context into Resolve / resolveTargets. |
| hypervisor/gc.go, images/gc.go, network/bridge/gc_{linux,other}.go, network/cni/gc.go | Update Resolve signature to accept ctx. |
| cmd/core/helpers.go | InitSnapshot accepts variadic localfile.Option. |
| cmd/others/commands.go | Defines new gc sub-flags. |
| cmd/others/handler.go | Parses sub-flags into EvictionPolicy, validates and wires to backend. |
| cmd/others/handler_test.go | Tests flag parsing (defaults, negatives, requires-snapshot, happy path). |
| README.md | New "Snapshot LRU Eviction" + "Scheduled Snapshot GC" sections. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
71e25fd to
b1f0a09
Compare
…eate - lookupRecord: spawn touchAccess goroutine and return immediately, so hot paths (clone DataDir, Export) never block on the index write lock. Touch is best-effort LRU bookkeeping — caller ctx cancellation must not abort it; failure logs warn. - Close: blocks pending touches; tests register via t.Cleanup so temp dirs aren't removed mid-write. Production CLI relies on the substantial post-lookup work (clone/restore/export) to keep the process alive past the few-ms goroutine. - Create: stamp LastAccessedAt at finalize via a fresh time.Now() instead of the now captured before tar extraction so freshly-finished snapshots don't look minutes old to --snapshot-age right after Create returns.
b1f0a09 to
bcfa989
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 20 out of 20 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
cmd/others/handler.go:95
--snapshot-sizeis parsed withunits.RAMInBytes, which interprets all SI-style suffixes as base-2 (e.g.100GB→ 100 × 2³⁰ ≈ 107 GB decimal), not base-10. The README example (--snapshot-size=100GB) and the flag help text (e.g. 100GB) don't tell the operator which interpretation is used, and "100GB" colloquially means 10⁹ bytes for most users when talking about disk usage. Either switch tounits.FromHumanSize(decimal SI) or document explicitly that suffixes are binary (KiB/MiB/GiB semantics).
if sizeStr != "" {
n, err := units.RAMInBytes(sizeStr)
if err != nil {
return localfile.EvictionPolicy{}, fmt.Errorf("--snapshot-size %q: %w", sizeStr, err)
}
if n < 0 {
return localfile.EvictionPolicy{}, fmt.Errorf("--snapshot-size must be >= 0, got %s", sizeStr)
}
size = n
}
- TestRestoreUpdatesLastAccessedAt: call lf.Close() before reading LastAccessedAt; the touch goroutine had no happens-before relation to the read, occasionally allowing the assertion to race the persist. - gc.go Collect: thread ctx instead of dropping it; check ctx.Err() between os.RemoveAll calls so Ctrl-C / systemd cancellation can interrupt long eviction batches. - Close godoc: drop the inaccurate "CLI invokes this" claim — only tests wire Close today.
…size set Records created before this PR have SizeBytes=0 (omitempty zero value). The MaxSize loop in pickLRU summed those as 0, so an installation with 50 GB of legacy snapshots could see "--snapshot-size=10GB" trigger no eviction at all. ReadDB now calls DirSize for any record where SizeBytes==0 (only when policy.MaxSize > 0 — opt-in cost) and persists the result so subsequent gc invocations don't re-walk. After one gc cycle the corpus is fully sized.
a0442d8 to
ca3c38a
Compare
Two correctness bugs: - backfillSizeBytes ran inside gcModule.ReadDB which is invoked by gc.Orchestrator.Run *under* the snapshot-store lock; calling store.Update re-acquires the same flock and the chan-based in-process guard is non-reentrant — gc --snapshot --snapshot-size deadlocks the moment any record has SizeBytes==0. Switch to store.WriteRaw (the orchestrator already holds the lock). - async lookupRecord touch had no production drain: snapshot.Snapshot is the interface returned to CLI handlers, it has no Close method, and short-lived commands (Restore-then-exit) could race the goroutine past process exit — silently dropping the LastAccessedAt touch the README's LRU contract promises. Revert to synchronous touch under the existing Update path; cocoon's CLI model already serializes on flock across processes so the in-process channel contention is moot, and the per-call cost is sub-ms. Drops the touchWG + Close machinery the async design required; tests no longer need lf.Close() in t.Cleanup.
- pickLRU: use slices.SortedFunc(maps.Keys(records), ...) instead of the local entry struct + slices.SortFunc + manual []string fill. Looks records up by ID inline; drops 12 lines of plumbing. - backfillSizeBytes: drop the parallel updates map; iterate records once in WriteRaw and only write when on-disk SizeBytes differs. - Trim two godocs (backfillSizeBytes, EvictionPolicy) of origin chatter that belongs in the commit log.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements issue #33 (LRU snapshot GC + scheduled cleanup) by extending `cocoon gc` with opt-in LRU snapshot reclamation.
Sub-flags combine as union of evictions / intersection of kept — a snapshot survives only if it passes every active criterion. Mixing sub-flags without `--snapshot` is rejected.
What's added
Framework change (heads-up)
`gc.Module[S].Resolve` signature now takes `ctx` as the first arg — 6 implementations rippled (`hypervisor`, `images`, `network/bridge` linux+other, `network/cni`, `snapshot/localfile`). Needed so eviction logs reach the orchestrator ctx instead of `context.Background()`.
Design notes
Verification
```
✓ make lint # 0 issues × Linux + Darwin
✓ go test ./... # all packages green; 13 new LRU tests pass
✓ /code SKILL line-by-line review (round 1: 6 issues fixed)
✓ /simplify 3 parallel agents (round 2: reuse + quality + efficiency, 2 actionable fixes)
```
Test plan