Skip to content

fix: add application-level retry for SQLite init under concurrent access#1154

Merged
stack72 merged 1 commit intomainfrom
worktree-flickering-popping-marshmallow
Apr 9, 2026
Merged

fix: add application-level retry for SQLite init under concurrent access#1154
stack72 merged 1 commit intomainfrom
worktree-flickering-popping-marshmallow

Conversation

@stack72
Copy link
Copy Markdown
Contributor

@stack72 stack72 commented Apr 9, 2026

Summary

  • Adds exponential backoff retry (up to 5 attempts) around PRAGMA journal_mode=WAL and schema creation in both CatalogStore and ExtensionCatalogStore
  • Fixes "database is locked" crashes when multiple swamp workflow run processes start concurrently and race to initialize the same SQLite database
  • The existing busy_timeout=5000 pragma doesn't reliably cover the journal mode switch in Deno's node:sqlite, so application-level retry fills the gap

Test Plan

  • Existing CatalogStore: constructor retries under write lock contention test passes
  • Full test suite passes (4249 tests)
  • deno check, deno lint, deno fmt all clean
  • UAT adversarial test swamp workflow run concurrently produces distinct data versions should now pass

🤖 Generated with Claude Code

PRAGMA journal_mode=WAL requires an exclusive lock to switch modes.
When multiple processes (e.g. concurrent workflow runs) open the same
database simultaneously, the SQLite busy handler does not reliably
cover the mode switch, causing "database is locked" crashes. Add
exponential backoff retry around the WAL pragma and schema creation
in both CatalogStore and ExtensionCatalogStore.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Blocking Issues

None.

Suggestions

  1. Code duplication: The initializeWithRetry() method is nearly identical in both CatalogStore and ExtensionCatalogStore. Consider extracting a shared helper (e.g., withSqliteRetry(fn: () => void)) into a common infrastructure utility to reduce duplication. Not blocking — the current approach is correct and scoped appropriately for a bug fix.

  2. Minor: attempt count semantics: The loop runs 6 total attempts (0–5) with 5 retries, while the PR description says "up to 5 attempts." Cosmetic only — the behavior is sound.

Notes

  • DDD placement is correct: retry logic for SQLite locking is an infrastructure concern, properly located in src/infrastructure/persistence/.
  • The Atomics.wait pattern for synchronous sleep is the right choice here since the constructors are synchronous.
  • Backoff strategy is reasonable (~3.3s worst-case total wait), and all retried operations are idempotent.
  • Existing test coverage (CatalogStore: constructor retries under write lock contention) exercises the retry path via subprocess with an exclusive lock.
  • CI checks (lint, test, format) are all passing.

LGTM — clean, well-scoped fix for a real concurrency issue.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial Review

Critical / High

None found.

Medium

  1. catalog_store.ts:88-121 — Partial migration can leave populated=true for an empty catalog on retry. If migrateIfNeeded() succeeds through step 5 (setting schema_version) but step 6 (DELETE FROM catalog_meta WHERE key = 'populated') throws a lock error, the retry loop catches it. On the next attempt, migrateIfNeeded() sees the correct schema_version, returns early, and initialization "succeeds" — but the populated flag is still true for a freshly-dropped, empty catalog table. In practice this is extremely unlikely because the lock contention is on PRAGMA journal_mode=WAL (the first statement), not on DML after WAL is established. The migration operations are also not in a transaction, but that's a pre-existing condition this PR doesn't introduce. Flagging for awareness only — the probability of hitting this in production is negligible since all subsequent statements use the same connection that already acquired the WAL lock.

Low

  1. Both files — 6 attempts, not 5. for (let attempt = 0; attempt <= MAX_RETRIES; attempt++) with MAX_RETRIES = 5 executes 6 iterations (the initial attempt + 5 retries). This is actually fine and matches the intent described in the PR ("up to 5 attempts" of retry), but the constant name MAX_RETRIES accurately describes the behavior — just noting it since off-by-one is a common review blind spot and this one is correct.

  2. Both files — SharedArrayBuffer allocation per retry. A new SharedArrayBuffer(4) and Int32Array are allocated on each retry iteration. This is harmless (4 bytes, GC'd promptly, max 5 allocations), but a single allocation hoisted above the loop would be marginally cleaner. Not worth changing.

Verdict

PASS. The retry logic is correct: exponential backoff with jitter, selective retry on lock/busy errors only, non-lock errors rethrow immediately, idempotent operations on retry. The Atomics.wait synchronous sleep is appropriate for Deno (available on main thread, unlike browsers). The busy_timeout=5000 is set before the retry loop, giving SQLite its own retry layer underneath. The code is well-contained, the two files are consistent, and existing tests cover the retry path.

@stack72 stack72 merged commit 0283af0 into main Apr 9, 2026
10 checks passed
@stack72 stack72 deleted the worktree-flickering-popping-marshmallow branch April 9, 2026 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant