fix(datastore): exclude _catalog.db* from S3/GCS sync walker (swamp-club#29) by stack72 · Pull Request #61 · systeminit/swamp-extensions

stack72 · 2026-04-08T19:59:36Z

Summary

Fix for swamp-club issue #29: swamp datastore sync --push against @swamp/s3-datastore fails fatally on _catalog.db-wal because the local SQLite data catalog lives inside the sync cache directory and the walker unconditionally enqueues all three _catalog.db* files for upload. Between the walker's stat and the S3 PUT, SQLite rewrites the WAL under PRAGMA journal_mode=WAL, and the upload fails mid-flight.

Fix

Two layers, both in s3_cache_sync.ts and mirrored in gcs_cache_sync.ts:

Layer 1 — skip filter. Add an isInternalCacheFile(rel) helper that returns true for the three pre-existing internal files (.datastore-index.json, .push-queue.json, .datastore.lock) plus basenames _catalog.db and anything starting with _catalog.db- (matching the SQLite WAL, SHM, and journal sidecars). Applied to the walker in pushChanged and to the remote-index iteration loop in pullChanged as a belt-and-suspenders safety net. Basename matching (not full-path equality) makes the filter robust to any future change in the data tier subdirectory name.

Layer 2 — passive index cleanup. Users who hit the bug before this fix have _catalog.db* entries sitting in their remote .datastore-index.json. Without active cleanup those zombie entries would keep getting re-uploaded on every push forever. A new private scrubIndex() method removes them from the in-memory index, and a private indexMutated flag propagates the cleaned index back to the remote on the next push — even a no-op push (pushed === 0). The scrub runs inside loadIndex() and both branches of pullIndex() so the persistence boundary enforces the invariant for any future caller. On the cache-miss branch of pullIndex the local cache file is rewritten with the scrubbed JSON so on-disk and in-memory views stay consistent; the cache-hit branch scrubs in-memory only and relies on indexMutated propagation on the next push.

Why GCS is included in this PR

gcs_cache_sync.ts has the identical three-filename hardcoded skip list at line 221 and the identical pullChanged / pushChanged / loadIndex / pullIndex structure. Shipping without the GCS fix would leave a guaranteed defect in a sibling extension. The port is literally identical code — no added risk, no scope creep. The no-shared-internals pattern across extensions means the helper is duplicated verbatim rather than extracted.

Test coverage

Seven new unit tests per extension (s3_cache_sync_test.ts and gcs_cache_sync_test.ts) covering:

(a) pushChanged skips _catalog.db, _catalog.db-shm, _catalog.db-wal from the walk
(b) pushChanged still skips the three pre-existing internal files (regression guard)
(c) the walker-level skip in pullChanged is a true safety net — manually injects a zombie into this.index.entries post-scrub via a private-state cast, asserts the walker still skips it
(d) pullChanged skips zombie _catalog.db* entries from the remote index
(e1) pullIndex S3-fetch path scrubs in-memory AND rewrites the local cache file with the scrubbed JSON
(e2) pullIndex cache-hit path scrubs in-memory only, then the next pushChanged propagates the cleaned index to the remote via indexMutated
(f) no-op pushChanged with a polluted on-disk index propagates the scrubbed index to the remote, then the flag resets on a second call (verifies reset semantics)

Tests use an in-memory mock client with append-only putObject recording, cast via as unknown as S3Client — the same escape-hatch pattern already used in s3_lock_test.ts.

Tests use a private-state cast (as unknown as { index, indexMutated }) to observe internal state. Accepted trade-off vs. adding a test-only public getter.

Verification: 24 tests pass in S3 extension (7 new + 17 existing), 24 tests pass in GCS extension (7 new + 17 existing). Full deno task check && deno task lint && deno task fmt && deno task test clean in both datastore/s3/ and datastore/gcs/. Before/after verification against the reproduction script from triage confirms the fix: pre-fix walker queues _catalog.db, _catalog.db-shm, _catalog.db-wal alongside the legitimate payload; post-fix walker queues only the legitimate payload.

What this PR does NOT do

Does not move the catalog out of the data tier. The colocation is deliberate — the catalog indexes data and must follow it into the cache directory. The fix correctly lives in the sync walker, not the catalog path.
Does not address the double pushChanged() call in swamp datastore sync --push. Tracked in swamp-club#30. Separate risk profile — getting push lifecycle signalling wrong could cause silent data loss, warrants its own focused PR.
Does not clean up orphan physical _catalog.db* objects in existing buckets. Tracked in swamp-club#31. Accepted trade-off: the index cleanup removes zombie references, but the physical objects remain as unreferenced dead weight. Affected users can manually delete them with aws s3 rm --recursive 's3://<bucket>/data/_catalog.db' (or the gsutil equivalent for GCS). Few KB of storage; no functional impact.
Does not add CLI-level UAT coverage. A targeted follow-up is filed as systeminit/swamp-uat#131 — the extension unit tests definitively prove the filter logic, and the UAT addition is defense-in-depth for future regressions through new code paths.

Release notes

@swamp/s3-datastore and @swamp/gcs-datastore: fixed a bug where swamp datastore sync --push would fail on _catalog.db-wal because the local SQLite data catalog was incorrectly walked into the upload set. If you previously hit this error, the remote index will self-clean on your next sync. To reclaim the few KB of orphan bucket storage left behind by pre-fix pushes, run aws s3 rm --recursive 's3://<bucket>/data/_catalog.db' or the gsutil equivalent.

Manifest bumps

@swamp/s3-datastore: 2026.04.03.1 → 2026.04.08.1
@swamp/gcs-datastore: 2026.03.31.1 → 2026.04.08.1

CI auto-publishes both extensions when this lands on main.

Test plan

deno task check passes in datastore/s3/ and datastore/gcs/
deno task lint passes in both
deno task fmt is clean in both
deno task test — 24 tests pass in each
Verified against the original reproduction script: pre-fix walker queues catalog files; post-fix walker does not

🤖 Generated with Claude Code

…lub#29) `swamp datastore sync --push` against an @swamp/s3-datastore fails fatally on `_catalog.db-wal` because the SQLite data catalog lives inside the sync cache directory and the walker enqueues all three `_catalog.db*` files for upload. SQLite rewrites the WAL mid-upload with PRAGMA journal_mode=WAL, and the S3 PUT fails. Fix in two layers: 1. Add an `isInternalCacheFile` helper that filters the three pre-existing internal files PLUS basenames `_catalog.db` and anything starting with `_catalog.db-` (matching the SQLite WAL, SHM, and journal sidecars). The helper uses basename matching so it's robust to future changes in the data tier subdirectory. 2. Add passive index cleanup so polluted remote indexes from the bug period heal automatically. A `scrubIndex` method removes zombie `_catalog.db*` entries from the in-memory index on load, and an `indexMutated` flag propagates the cleaned index back to the remote on the next push (even a no-op push). The scrub runs inside `loadIndex` and both branches of `pullIndex` so the persistence boundary enforces the invariant. On the cache-miss branch of `pullIndex` the local cache file is rewritten with the scrubbed JSON for on-disk/in-memory consistency. Applied identically to `@swamp/s3-datastore` and `@swamp/gcs-datastore` since both carry the same defect — the hardcoded three-filename skip list at `s3_cache_sync.ts:252` and `gcs_cache_sync.ts:221`. Seven new unit tests per extension cover: push skips catalog files, push still skips pre-existing internal files, walker-level skip is a true safety net via post-scrub zombie injection, pull skips zombies from remote index, pullIndex S3/GCS-fetch path scrubs and rewrites local cache file, pullIndex cache-hit path scrubs and indexMutated propagates via next push, no-op push propagates scrubbed index then flag resets on second call. Out of scope (tracked separately): - swamp-club#30: `sync --push` calls `pushChanged()` twice per invocation via the coordinator flush path. - swamp-club#31: orphan physical `_catalog.db*` objects remaining in pre-fix buckets. Acceptable trade-off — release notes can direct users to `aws s3 rm` / `gsutil rm` if they want to reclaim the few KB of dead weight. - systeminit/swamp-uat#131: CLI-level UAT coverage for the populated-catalog push scenario. Manifest versions bumped to 2026.04.08.1 for CI auto-publish. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions

Code Review

Blocking Issues

None.

Suggestions

Test (c) doesn't isolate the walker's belt-and-suspenders guard. In both s3_cache_sync_test.ts and gcs_cache_sync_test.ts, test (c) seeds a clean remote index, calls pullIndex() (which sets this.index non-null), then injects a zombie entry into the in-memory index, and finally calls pullChanged(). However, pullChanged() begins by calling pullIndex() again — and because this.index !== null, the cache-hit branch is skipped and the S3/GCS-fetch branch runs, re-fetching the clean remote index and overwriting the injected zombie before the walker loop runs. So the assertion passes because pullIndex() erased the zombie via a fresh remote fetch, not because the walker's isInternalCacheFile check fired. The production guard is real and correct — this is purely a test isolation issue. To truly exercise the walker guard in isolation, the zombie would need to be injected after pullIndex() is called internally (or via this.index set to non-null through an initial push path). Low priority, the overall behavior is verified.
The seedFile helper in both test files (line ~264 in gcs test, ~938 in s3 test) uses a manual substring/lastIndexOf to derive the parent directory. dirname(full) from the already-imported @std/path would be cleaner, but this is minor test helper code.

The fix itself is well-designed and correct: isInternalCacheFile cleanly consolidates the skip logic, scrubIndex() + indexMutated correctly handles the migration path for zombie entries in existing remote indexes, and both layers are applied symmetrically to S3 and GCS. Test coverage for the six behavioral scenarios is thorough.

github-actions

Adversarial Review

I read all six changed files (both sync implementations, both test files, both manifests). I traced every new code path, the ||= short-circuit semantics, the scrub-during-iteration safety, error-path consistency, and concurrent batch mutation safety. The production code is solid.

Medium

Test (c) does not actually exercise the walker-level skip it claims to test — s3_cache_sync_test.ts:204-251, gcs_cache_sync_test.ts:195-232.

The test calls service.pullIndex() (populates this.index), injects a zombie, then calls service.pullChanged(). But pullChanged calls pullIndex() again internally. Since this.index !== null, the cache-hit branch condition this.index === null is false, so it falls through to the S3/GCS-fetch branch. The fetch overwrites this.index with the clean remote index, erasing the injected zombie before the walker loop ever runs. The assertion mock.gets.includes("data/_catalog.db-wal") === false passes because re-fetch removed the zombie, not because the walker's isInternalCacheFile check filtered it.

Breaking example: If you removed the isInternalCacheFile(rel) check from the pullChanged walker loop (lines 241-243 in S3, lines 225-227 in GCS), test (c) would still pass — proving it doesn't test what its name says.

Suggested fix: To truly test the walker skip, prevent pullIndex from re-fetching inside pullChanged. Either: (a) inject the zombie into the mock remote index so it survives the re-fetch (and won't be scrubbed since it's a fresh parse... wait, scrubIndex would catch it). Or (b) seed the zombie in the local cache file with a fresh mtime so pullIndex takes the cache-hit path. To hit the cache-hit path, this.index must be null — so you'd need to null it out after pullIndex, then inject the zombie into the local file, then call pullChanged. The simplest approach: skip the initial pullIndex entirely, seed a local index file containing the zombie with a fresh mtime, construct a fresh service instance, and call pullChanged — the cache-hit path in pullIndex will fire (fresh file + this.index === null), scrub will remove the zombie, but you can re-inject it post-scrub by using the private-state cast before the walker loop. However, since pullChanged calls pullIndex atomically before iterating, you can't inject between them without modifying the class. The pragmatic fix: accept this as a documentation issue and rename the test to clarify what it actually verifies (scrub + re-fetch cleanup).

Impact: No production risk — the walker-level skip works correctly. This is a test fidelity issue only.

Low

||= short-circuit skips scrubIndex() when indexMutated is already true — s3_cache_sync.ts:184,444, gcs_cache_sync.ts:170,390.

this.indexMutated ||= this.scrubIndex() will not call scrubIndex() if indexMutated is already true. In theory, if the index is reloaded from a polluted source after indexMutated was set but before it was reset, zombies could survive in-memory. In practice this cannot happen: the ||= sites only execute when this.index was previously null (guarded by this.index === null in pullIndex or if (this.index) return in loadIndex), and indexMutated starts as false and is only true after a prior scrub — by which point this.index is already set and these paths are unreachable. No actual bug, but the ||= is semantically surprising; a plain if (this.scrubIndex()) this.indexMutated = true would be clearer and avoid the short-circuit footgun for future editors. Contrast with the S3-fetch branch in pullIndex which correctly uses direct assignment.
The S3 pullChanged reconciles localMtime on size-match (lines 252-257) but GCS pullChanged does not (lines 230-233) — this is a pre-existing divergence, not introduced by this PR. Noting it since the PR description says the implementations are "literally identical code."

Verdict

PASS — The core fix is correct and well-tested. isInternalCacheFile covers all SQLite sidecar patterns, scrubIndex correctly heals polluted indexes, the indexMutated flag propagates cleanup to the remote, and error paths don't leave inconsistent state. The one medium finding is a test fidelity issue with no production impact.

github-actions bot approved these changes Apr 8, 2026

View reviewed changes

github-actions bot reviewed Apr 8, 2026

View reviewed changes

stack72 merged commit 4f22759 into main Apr 8, 2026
29 checks passed

stack72 deleted the fix/swamp-club-29-exclude-catalog-from-sync branch April 8, 2026 20:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(datastore): exclude _catalog.db* from S3/GCS sync walker (swamp-club#29)#61

fix(datastore): exclude _catalog.db* from S3/GCS sync walker (swamp-club#29)#61
stack72 merged 1 commit intomainfrom
fix/swamp-club-29-exclude-catalog-from-sync

stack72 commented Apr 8, 2026

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stack72 commented Apr 8, 2026

Summary

Fix

Why GCS is included in this PR

Test coverage

What this PR does NOT do

Release notes

Manifest bumps

Test plan

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Code Review

Blocking Issues

Suggestions

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Adversarial Review

Medium

Low

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant