Skip to content

fix: data gc preview and --dry-run include version-count GC (#108)#1176

Merged
stack72 merged 1 commit intomainfrom
fix/data-gc-version-preview
Apr 14, 2026
Merged

fix: data gc preview and --dry-run include version-count GC (#108)#1176
stack72 merged 1 commit intomainfrom
fix/data-gc-version-preview

Conversation

@keeb
Copy link
Copy Markdown
Contributor

@keeb keeb commented Apr 14, 2026

Summary

Fixes #108 — interactive swamp data gc silently skipped version-count GC when no lifetime-expired data existed, so resources with lifetime: infinite, gc: N accumulated unbounded versions. --dry-run under-reported the same work. Only --force and --json executed correctly.

Root cause. dataGcPreview surfaced only lifetime-expired entries via findExpiredData, and the CLI early-returned on an empty preview. Phase 2 version GC (collectGarbage) lived inside deleteExpiredData and was skipped entirely. Separately, the Phase 2 loop was gated on if (!dryRun), so --dry-run never reported version GC counts.

Fix. Extend UnifiedDataRepository.collectGarbage with an optional { dryRun } option that computes would-be counts without deleting. Expose it via a new DataLifecycleService.previewVersionGarbage method and aggregate it into DataGcPreview.versionGcItems. Widen the CLI early-return to check both categories. Remove the if (!dryRun) guard in deleteExpiredData, passing dryRun through to collectGarbage so --dry-run reports faithfully.

Cleanup. Consolidate two identical parseDuration copies into src/domain/data/duration.ts. (The grammar exists in four other places; deliberately out of scope for this PR.)

Design

  • Repository-layer dry-run was chosen over a parallel detection method so preview counts can never drift from actual pruning — there is exactly one source of truth for what gets removed.
  • DataGcPreview grows additively (items field preserved) so downstream consumers keep working.
  • collectGarbage's options arg is optional, so all () => stubs in model/driver/validation tests continue to satisfy the interface without changes.

Test Plan

  • deno check — clean
  • deno lint — clean (1033 files)
  • deno fmt --check — clean (1047 files)
  • deno run test — 4345 passed, 0 failed
  • New unit tests: duration_test.ts; previewVersionGarbage + dry-run pass-through in data_lifecycle_service_test.ts; aggregation in gc_test.ts; renderer output in new data_gc_test.ts
  • New integration test: collectGarbage dry-run in data_versioning_test.ts
  • Manual reproduction at /tmp/swamp-repro-issue-108 (12 versions at gc=10):
    • Interactive path: preview surfaces "version gc: 1 models with 12 excess versions", confirms, prunes 13 → 10
    • --dry-run --json: reports versionsDeleted: 12 (was 0 before fix), no deletion
    • --force: prunes correctly, no regression
    • Empty repo: still shows "Nothing to clean up."

Follow-ups

  • File UAT coverage in systeminit/swamp-uat for tests/cli/data/gc_test.ts (interactive + --dry-run) plus an adversarial test for large version counts (>10k versions on a single model).

🤖 Generated with Claude Code

Interactive `swamp data gc` silently skipped version-count garbage
collection when no lifetime-expired data existed, and `--dry-run` under-
reported the same work. Both ran correctly only via `--force` or `--json`.

Add a `dryRun` option to `collectGarbage` so the repository can compute
would-be counts without deleting, and expose it through a new
`previewVersionGarbage` service method. Extend `DataGcPreview` with a
`versionGcItems` field, widen the CLI early-return to consider both
categories, and render the new category in log and json preview output.

Also consolidate the duplicated `parseDuration` helper into
`src/domain/data/duration.ts`, shared by the lifecycle service and the
unified data repository.

Closes #108

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLI UX Review

Blocking

None.

Suggestions

  1. Inconsistent capitalization in log-mode preview (src/presentation/renderers/data_gc.ts lines 104, 111):
    The first log line uses GC preview: (uppercase GC) while the new version GC line uses version gc: (lowercase). Suggest either Version GC: or keeping the style uniform. Small but users scanning terminal output will notice the mismatch.

  2. "GC preview: 0 expired data items" when only version GC work exists (src/presentation/renderers/data_gc.ts line 104):
    When a repo has zero lifetime-expired entries but does have excess versions, the log output is:

    GC preview: 0 expired data items
    version gc: 2 models with 12 excess versions
    

    The first line reads oddly when its count is zero. Consider guarding it with if (preview.items.length > 0), or combining both into a single summary line.

  3. Log-mode completion message omits version deletion count (src/presentation/renderers/data_gc.ts line 37):
    GC complete: deleted N items does not surface versionsDeleted, so after --force or the interactive confirm path in log mode, users see no confirmation of how many versions were pruned. This gap is pre-existing, but the fix makes it more visible since version GC is now a first-class operation. A follow-up like GC complete: deleted N data items, M excess versions would close the loop.

  4. JSON branch of renderDataGcPreview is unreachable (src/presentation/renderers/data_gc.ts lines 86–101):
    Phase 1 (preview) is guarded by cliCtx.outputMode === "log", so the JSON branch of renderDataGcPreview is never called. The new JSON fields (versionGcModelCount, versionGcVersionCount, versionGcData) are correct and useful, but a script author running swamp data gc --json will only see the Phase 2 completed shape, not the preview shape. This is also pre-existing architecture, worth noting for a future ticket.

Verdict

PASS — the fix is correct and the user-visible changes are accurate: the early-return now covers both GC categories, the preview surfaces version GC work in both modes, and --dry-run now faithfully reports what would be pruned. No blocking issues.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Clean, well-structured fix for a real bug (version-count GC silently skipped in interactive + dry-run paths). The root cause analysis is accurate — the if (!dryRun) guard around Phase 2 and the early-return on empty preview.items were both problematic. The chosen approach of pushing dry-run support into the repository layer is sound: one source of truth for what gets removed.

Blocking Issues

None.

Suggestions

  1. Double findAllGlobal during interactive preview: In the interactive path (no --force, no --dry-run), dataGcPreview now calls both findExpiredData() and previewVersionGarbage(), each of which independently calls findAllGlobal(). Then the actual deleteExpiredData() call does a third scan. This is a reasonable correctness-over-performance trade-off for a GC operation, but if repos grow large, a future optimization could share the scan across preview methods (similar to how deleteExpiredData already reuses its single findAllGlobal result for both phases). Not blocking — just a note for if performance becomes an issue.

  2. totalVersions computed twice in renderer: In renderDataGcPreview (src/presentation/renderers/data_gc.ts:87 and :106), the reduce over versionGcItems is duplicated across the json and log branches. Minor, and since the branches are independent, this is perfectly fine as-is.

DDD-wise, the changes respect the existing layering:

  • parseDataDuration properly extracted to a domain value object operation in src/domain/data/duration.ts, re-exported through mod.ts
  • VersionGcPreviewInfo is a clean DTO on the domain service interface
  • Repository dry-run semantics stay at the infrastructure layer
  • CLI/renderers import from libswamp/mod.ts — import boundary respected
  • New types exported through src/libswamp/mod.ts

Test coverage is thorough: unit tests for the extracted duration parser, service-level tests for dry-run pass-through and previewVersionGarbage, libswamp-level tests for preview aggregation, renderer tests, and an integration test for the repository dry-run path.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial Review

Critical / High

None found. The core logic is correct: the dryRun flag is properly threaded through collectGarbage, the !dryRun guard correctly protects both Deno.remove and the post-deletion updateLatestMarker/catalog update, and the options parameter is optional with a safe ?? false default so all existing callers (including the () => stubs in model/driver/validation tests) continue to satisfy the interface.

Medium

  1. Preview can double-count when lifetime-expired data also has version GC excesssrc/libswamp/data/gc.ts:124. dataGcPreview runs findExpiredData() and previewVersionGarbage() independently via Promise.all. A data item with e.g. lifetime: "1d" (expired) and gc: 3 with 10 versions would appear in both: as an expired item (all 10 versions) AND as having 7 excess versions. The JSON preview would report expiredDataCount: 1 plus versionGcVersionCount: 7, overstating the total work. In practice this is mitigated: (a) log mode only shows dataEntriesExpired in the final result, not versionsDeleted, so the discrepancy is invisible; (b) --dry-run via deleteExpiredData({dryRun: true}) skips Phase 1 deletion, so Phase 2 correctly sees and counts the data. Still, the interactive JSON preview can mislead in this edge case.

  2. Triple findAllGlobal() scan in interactive pathsrc/domain/data/data_lifecycle_service.ts:247 and :285. In the interactive path: (1) findExpiredData()findAllGlobal(), (2) previewVersionGarbage()findAllGlobal() + collectGarbage(dry) per unique model, (3) deleteExpiredData()findAllGlobal() + collectGarbage per unique model. For repos with many models, this triples the I/O. Not a correctness issue but worth noting for repos with thousands of model instances.

Low

  1. parseDataDuration("0d") returns 0src/domain/data/duration.ts:29. The regex ^\d+(mo|y|h|m|d|w)$ matches "0d", yielding 0 milliseconds. This is safe in practice because GarbageCollectionSchema rejects zero-value strings via its .refine() check, and lifetime values normalize zero durations to "workflow". Only a direct call bypassing schema validation would hit this. Pre-existing behavior (moved, not introduced by this PR).

  2. Promise.all in dataGcPreview loses partial results on rejectionsrc/libswamp/data/gc.ts:124. If previewVersionGarbage() throws (e.g. findAllGlobal() fails), the already-resolved findExpiredData() result is lost. In practice both call findAllGlobal() so if one fails the other likely does too, and individual model errors within previewVersionGarbage are caught. Negligible risk.

Verdict

PASS — The fix is correct and well-tested. The core bug (interactive/dry-run paths skipping Phase 2 version GC) is properly resolved. The dryRun plumbing through the repository is clean and the interface change is backwards-compatible. The parseDuration consolidation is a faithful extraction with no behavioral change. Test coverage is thorough across unit, integration, and presentation layers. The medium-severity items are informational inaccuracies in edge-case previews, not execution correctness issues.

@stack72 stack72 merged commit 9984ef2 into main Apr 14, 2026
10 checks passed
@stack72 stack72 deleted the fix/data-gc-version-preview branch April 14, 2026 00:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants