fix: add application-level retry for SQLite init under concurrent access by stack72 · Pull Request #1154 · systeminit/swamp

stack72 · 2026-04-09T13:33:07Z

Summary

Adds exponential backoff retry (up to 5 attempts) around PRAGMA journal_mode=WAL and schema creation in both CatalogStore and ExtensionCatalogStore
Fixes "database is locked" crashes when multiple swamp workflow run processes start concurrently and race to initialize the same SQLite database
The existing busy_timeout=5000 pragma doesn't reliably cover the journal mode switch in Deno's node:sqlite, so application-level retry fills the gap

Test Plan

Existing CatalogStore: constructor retries under write lock contention test passes
Full test suite passes (4249 tests)
deno check, deno lint, deno fmt all clean
UAT adversarial test swamp workflow run concurrently produces distinct data versions should now pass

🤖 Generated with Claude Code

PRAGMA journal_mode=WAL requires an exclusive lock to switch modes. When multiple processes (e.g. concurrent workflow runs) open the same database simultaneously, the SQLite busy handler does not reliably cover the mode switch, causing "database is locked" crashes. Add exponential backoff retry around the WAL pragma and schema creation in both CatalogStore and ExtensionCatalogStore. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions

Code Review

Blocking Issues

None.

Suggestions

Code duplication: The initializeWithRetry() method is nearly identical in both CatalogStore and ExtensionCatalogStore. Consider extracting a shared helper (e.g., withSqliteRetry(fn: () => void)) into a common infrastructure utility to reduce duplication. Not blocking — the current approach is correct and scoped appropriately for a bug fix.
Minor: attempt count semantics: The loop runs 6 total attempts (0–5) with 5 retries, while the PR description says "up to 5 attempts." Cosmetic only — the behavior is sound.

Notes

DDD placement is correct: retry logic for SQLite locking is an infrastructure concern, properly located in src/infrastructure/persistence/.
The Atomics.wait pattern for synchronous sleep is the right choice here since the constructors are synchronous.
Backoff strategy is reasonable (~3.3s worst-case total wait), and all retried operations are idempotent.
Existing test coverage (CatalogStore: constructor retries under write lock contention) exercises the retry path via subprocess with an exclusive lock.
CI checks (lint, test, format) are all passing.

LGTM — clean, well-scoped fix for a real concurrency issue.

github-actions

Adversarial Review

Critical / High

None found.

Medium

catalog_store.ts:88-121 — Partial migration can leave populated=true for an empty catalog on retry. If migrateIfNeeded() succeeds through step 5 (setting schema_version) but step 6 (DELETE FROM catalog_meta WHERE key = 'populated') throws a lock error, the retry loop catches it. On the next attempt, migrateIfNeeded() sees the correct schema_version, returns early, and initialization "succeeds" — but the populated flag is still true for a freshly-dropped, empty catalog table. In practice this is extremely unlikely because the lock contention is on PRAGMA journal_mode=WAL (the first statement), not on DML after WAL is established. The migration operations are also not in a transaction, but that's a pre-existing condition this PR doesn't introduce. Flagging for awareness only — the probability of hitting this in production is negligible since all subsequent statements use the same connection that already acquired the WAL lock.

Low

Both files — 6 attempts, not 5. for (let attempt = 0; attempt <= MAX_RETRIES; attempt++) with MAX_RETRIES = 5 executes 6 iterations (the initial attempt + 5 retries). This is actually fine and matches the intent described in the PR ("up to 5 attempts" of retry), but the constant name MAX_RETRIES accurately describes the behavior — just noting it since off-by-one is a common review blind spot and this one is correct.
Both files — SharedArrayBuffer allocation per retry. A new SharedArrayBuffer(4) and Int32Array are allocated on each retry iteration. This is harmless (4 bytes, GC'd promptly, max 5 allocations), but a single allocation hoisted above the loop would be marginally cleaner. Not worth changing.

Verdict

PASS. The retry logic is correct: exponential backoff with jitter, selective retry on lock/busy errors only, non-lock errors rethrow immediately, idempotent operations on retry. The Atomics.wait synchronous sleep is appropriate for Deno (available on main thread, unlike browsers). The busy_timeout=5000 is set before the retry loop, giving SQLite its own retry layer underneath. The code is well-contained, the two files are consistent, and existing tests cover the retry path.

github-actions bot approved these changes Apr 9, 2026

View reviewed changes

github-actions bot reviewed Apr 9, 2026

View reviewed changes

stack72 merged commit 0283af0 into main Apr 9, 2026
10 checks passed

stack72 deleted the worktree-flickering-popping-marshmallow branch April 9, 2026 13:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add application-level retry for SQLite init under concurrent access#1154

fix: add application-level retry for SQLite init under concurrent access#1154
stack72 merged 1 commit intomainfrom
worktree-flickering-popping-marshmallow

stack72 commented Apr 9, 2026

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stack72 commented Apr 9, 2026

Summary

Test Plan

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Code Review

Blocking Issues

Suggestions

Notes

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Adversarial Review

Critical / High

Medium

Low

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant