fix: add application-level retry for SQLite init under concurrent access#1154
fix: add application-level retry for SQLite init under concurrent access#1154
Conversation
PRAGMA journal_mode=WAL requires an exclusive lock to switch modes. When multiple processes (e.g. concurrent workflow runs) open the same database simultaneously, the SQLite busy handler does not reliably cover the mode switch, causing "database is locked" crashes. Add exponential backoff retry around the WAL pragma and schema creation in both CatalogStore and ExtensionCatalogStore. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
Blocking Issues
None.
Suggestions
-
Code duplication: The
initializeWithRetry()method is nearly identical in bothCatalogStoreandExtensionCatalogStore. Consider extracting a shared helper (e.g.,withSqliteRetry(fn: () => void)) into a common infrastructure utility to reduce duplication. Not blocking — the current approach is correct and scoped appropriately for a bug fix. -
Minor: attempt count semantics: The loop runs 6 total attempts (0–5) with 5 retries, while the PR description says "up to 5 attempts." Cosmetic only — the behavior is sound.
Notes
- DDD placement is correct: retry logic for SQLite locking is an infrastructure concern, properly located in
src/infrastructure/persistence/. - The
Atomics.waitpattern for synchronous sleep is the right choice here since the constructors are synchronous. - Backoff strategy is reasonable (~3.3s worst-case total wait), and all retried operations are idempotent.
- Existing test coverage (
CatalogStore: constructor retries under write lock contention) exercises the retry path via subprocess with an exclusive lock. - CI checks (lint, test, format) are all passing.
LGTM — clean, well-scoped fix for a real concurrency issue.
There was a problem hiding this comment.
Adversarial Review
Critical / High
None found.
Medium
catalog_store.ts:88-121— Partial migration can leavepopulated=truefor an empty catalog on retry. IfmigrateIfNeeded()succeeds through step 5 (settingschema_version) but step 6 (DELETE FROM catalog_meta WHERE key = 'populated') throws a lock error, the retry loop catches it. On the next attempt,migrateIfNeeded()sees the correctschema_version, returns early, and initialization "succeeds" — but thepopulatedflag is stilltruefor a freshly-dropped, empty catalog table. In practice this is extremely unlikely because the lock contention is onPRAGMA journal_mode=WAL(the first statement), not on DML after WAL is established. The migration operations are also not in a transaction, but that's a pre-existing condition this PR doesn't introduce. Flagging for awareness only — the probability of hitting this in production is negligible since all subsequent statements use the same connection that already acquired the WAL lock.
Low
-
Both files — 6 attempts, not 5.
for (let attempt = 0; attempt <= MAX_RETRIES; attempt++)withMAX_RETRIES = 5executes 6 iterations (the initial attempt + 5 retries). This is actually fine and matches the intent described in the PR ("up to 5 attempts" of retry), but the constant nameMAX_RETRIESaccurately describes the behavior — just noting it since off-by-one is a common review blind spot and this one is correct. -
Both files —
SharedArrayBufferallocation per retry. A newSharedArrayBuffer(4)andInt32Arrayare allocated on each retry iteration. This is harmless (4 bytes, GC'd promptly, max 5 allocations), but a single allocation hoisted above the loop would be marginally cleaner. Not worth changing.
Verdict
PASS. The retry logic is correct: exponential backoff with jitter, selective retry on lock/busy errors only, non-lock errors rethrow immediately, idempotent operations on retry. The Atomics.wait synchronous sleep is appropriate for Deno (available on main thread, unlike browsers). The busy_timeout=5000 is set before the retry loop, giving SQLite its own retry layer underneath. The code is well-contained, the two files are consistent, and existing tests cover the retry path.
Summary
PRAGMA journal_mode=WALand schema creation in bothCatalogStoreandExtensionCatalogStoreswamp workflow runprocesses start concurrently and race to initialize the same SQLite databasebusy_timeout=5000pragma doesn't reliably cover the journal mode switch in Deno'snode:sqlite, so application-level retry fills the gapTest Plan
CatalogStore: constructor retries under write lock contentiontest passesdeno check,deno lint,deno fmtall cleanswamp workflow run concurrently produces distinct data versionsshould now pass🤖 Generated with Claude Code