Skip to content

fix(notifier): resolve waitWebSocket promise on WS error to prevent pull-loop wedge#72

Merged
jbiskur merged 1 commit into
mainfrom
fix/notifier-ws-error-wedge
May 10, 2026
Merged

fix(notifier): resolve waitWebSocket promise on WS error to prevent pull-loop wedge#72
jbiskur merged 1 commit into
mainfrom
fix/notifier-ws-error-wedge

Conversation

@jbiskur
Copy link
Copy Markdown
Contributor

@jbiskur jbiskur commented May 10, 2026

Summary

  • Resolves waitWebSocket() promise immediately on RxJS subject error() instead of waiting up to 20 s for the timeout, eliminating the per-source pull-loop wedge observed in data-pathways production on 2026-05-10.
  • Binds the awaited promise + eventResolver before notificationClient.connect() so a synchronous error during connect cannot race against an unbound resolver.
  • Wraps the inner await in try/finally so notificationClient.disconnect() runs on every exit path (error, timeout, success, abort) — fixes a pre-existing socket leak on the error path.
  • Adds _internals.createNotificationClient test seam so the WS subject lifecycle can be driven deterministically in unit tests. Not part of the public API.

Why

Production incident 2026-05-10: PAT events written to flowcore stopped propagating to Usable. Root cause located in src/data-pump/notifier.ts:107-140: subject's error handler only logged, so the awaited promise never resolved on WS disconnect. Loop spun on 20 s timeouts with no events flowing. Full investigation in Outage fragment bddefdd2-c377-472a-bc50-cd75f708f822.

Test plan

  • deno fmt --check
  • deno lint
  • deno check src/mod.ts
  • deno test -A test/tests/notifier.test.ts — 8 cases, 0 failures
  • deno test -A — full suite, 10 passed (47 steps), 0 failed
  • TDD discipline: tests cases 1, 3, 6, 7, 8 confirmed to fail meaningfully against pre-fix code

Tests cover

  1. WS error during wait() resolves the promise within 50 ms (regression for the wedge)
  2. Subject is recreated per wait() call
  3. After a WS-error cycle, the next wait() delivers events normally
  4. Timeout-only path resolves around the configured boundary
  5. AbortSignal resolves promptly
  6. disconnect() runs on every exit path (error, timeout, success, abort)
  7. No double-resolve crash when next() fires on a terminated subject
  8. Synchronous error before connect() resolves: wait() resolves cleanly

🤖 Generated with Claude Code

…ull-loop wedge

Previously the RxJS subject's error handler only logged on subject.error(),
leaving the pull-loop awaiting the promise until the 20s timeout. With a
persistent WS disconnect every cycle went through the timeout-only path
with no events flowing — observed as a per-source pump wedge in
data-pathways production on 2026-05-10.

Three fixes: (1) error handler resolves the promise so loop re-enters
wait() within milliseconds; (2) promise + eventResolver bound before
connect() so a synchronous error during connect doesn't race against
an unbound resolver; (3) try/finally guarantees disconnect() runs on
every exit path, fixing a socket leak.

Adds _internals.createNotificationClient test seam so the WS subject
lifecycle can be driven deterministically in unit tests.

Refs: data-pathways outage bddefdd2-c377-472a-bc50-cd75f708f822

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jbiskur jbiskur force-pushed the fix/notifier-ws-error-wedge branch from 0fe6aa8 to 87762d5 Compare May 10, 2026 23:30
@jbiskur jbiskur merged commit d76ee0f into main May 10, 2026
2 checks passed
@jbiskur jbiskur deleted the fix/notifier-ws-error-wedge branch May 10, 2026 23:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant