feat: event-driven app state sync with event log#1726
Conversation
0c205cf to
7c9f1c5
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## development #1726 +/- ##
===============================================
+ Coverage 55.80% 56.04% +0.24%
===============================================
Files 138 141 +3
Lines 28057 28834 +777
===============================================
+ Hits 15657 16161 +504
- Misses 12400 12673 +273 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…node When a v2 broadcast arrives with fewer apps than previously stored, the location collection kept orphaned entries for removed apps. Now both the gossip path and batch sync path remove location entries for apps no longer in the v2 broadcast's app list. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The batch sync upserts filter by {name, ip} but the collection only had
an index on {name}. For popular apps running on ~100 nodes, each upsert
examined ~99 docs to find 1. The compound index reduces this to a single
key lookup — explain shows 99 docs examined → 1.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a v2 broadcast arrives for an IP, any v1 signed docs for apps no longer in the v2's app list are now removed. Applies to both the gossip path (storeSignedAppRunningBroadcast) and batch sync path. Ensures the signed broadcast collection stays consistent with the location collection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The cleanup query filters by ip first with $nin on name. The previous
{name, ip} order couldn't use the index prefix for ip-first lookups,
causing 1131 key scans per IP. Reversing to {ip, name} reduces cleanup
to 4 key scans per IP. The upsert query {name, ip} still uses this
index efficiently (MongoDB handles field order).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The signed broadcast stores on the gossip path used the full TTL (125 min / 15 min / 24 hours) while the location stores used 5 min. Stale gossip (>5 min) would be stored in the signed collection but rejected from the location collection, causing inconsistency. Now both use 5 min on the gossip path. The batch sync path retains full TTL validity since sync is a point-in-time snapshot. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Gossip arrives in unpredictable order — a stale v2 relay can trigger cleanup that removes valid fresher v1 data. Remove all cleanup from the gossip path (storeSignedAppRunningBroadcast, storeAppRunningMessage). Batch sync cleanup is safe because it processes a consistent snapshot. Add broadcastedAt condition to cleanup deletes so concurrent gossip with fresher data survives. Merge upserts and cleanup into a single bulkWrite per collection for atomicity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three changes to eliminate orphaned entries between collections:
1. break → continue in storeAppRunningMessage loop: for v2 messages
with multiple apps, skip apps that already have current data but
keep processing the rest. Previously broke out of the entire loop.
2. storeAppRunningMessage returns { stored, rebroadcast } instead of
true/false. The gossip handler only calls storeSignedAppRunningBroadcast
when stored is true, ensuring both collections accept or reject together.
3. Remove redundant 5-minute gossip validity check from
storeSignedAppRunningBroadcast — it's now gated on the location
store's acceptance, eliminating the timing edge where one store
accepts at the boundary and the other rejects milliseconds later.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The sigterm handler was mutating broadcastedAt on location records to force 7-minute TTL expiry. This broke the data contract — broadcastedAt is derived from signed data and should never change. Stale gossip could also overwrite the sigterm by passing the "is newer" check against the fake broadcastedAt value. Switch all 6 ephemeral collections to expireAt-based TTL (expireAt:0). expireAt is operational metadata we control, not part of the signed payload. Sigterm now sets expireAt = now + 7min on both locations and signed broadcasts without touching broadcastedAt. Also: split gossip validity (5min) from record expiry into named constants, add missing expireAt to error stores, fix empty-apps v2 handler to clean up signed broadcasts with broadcastedAt guard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nodeStatusMonitor and storeAppRemovedMessage deleted from zelappslocation without touching fluxapprunningbroadcasts, leaving orphaned signed broadcasts (~44 per 20-minute monitor cycle). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- storeAppRemovedMessage: $addToSet excludedApps on v2 broadcast docs so the derived view skips removed apps without mutating signed data - storeSignedAppRunningBroadcast + batch sync: $unset excludedApps when a newer broadcast upserts (clears stale exclusions) - appLocationFromBroadcasts: filter out excluded apps after v2 unwind - reindexGlobalAppsLocation: also drop running broadcasts collection - explorer rescan: also drop running + installing broadcasts - Export handleMissingMasterSlaveContainer from stoppedAppsRecovery - Fix all 10 CI test failures, add excludedApps tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Single `appstateevents` collection replaces `fluxapprunningbroadcasts` as the source of truth. Five event types (apprunning, sigterm, appremoved, evicted) with dedupKey-based upserts and $cond timestamp guards. `zelappslocation` stays populated as materialized cache. - storeAppStateEvent() dispatcher with APP_STATE_EVENT_TYPES enum - storeBatchAppRunningEvents() for sync receiver - Gossip handler writes event unconditionally, then materializes location - Sigterm/appremoved/evicted all append events instead of mutating - Sync sender/receiver stream from event log - Remove storeSignedAppRunningBroadcast, excludedApps, gossip gating - 99 tests passing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The view now filters appremoved, sigterm, and evicted events, excludes stale v1 broadcasts superseded by newer v2, and correctly handles expired shutdown events. Verified against charlie live data (0 diff). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The event store was accepting gossip up to 125min old (RUNNING_EXPIRY_MS) instead of 5min (GOSSIP_VALIDITY_MS). Only the batch sync path should accept older messages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nodeStatusMonitor deletes locations immediately on eviction, but the view was giving evicted IPs the same 7-minute grace period as sigterm. Eviction should be immediate — the monitor already verified the node is gone. Also extend eviction TTL to match apprunning (125min) so the eviction event outlives the apprunning events it suppresses. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
storeSignedAppRunningBroadcast no longer exists — stub storeAppStateEvent instead. Sigterm handler now calls updateInDatabase once (location expiry only) not twice, and storeAppStateEvent needs stubbing to prevent throw. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix undefined appsRunningBroadcasts in apiServer.js sigterm handler,
add storeAppStateEvent(SIGTERM) call for own shutdown
- Escape regex in appLocationFromBroadcasts to prevent injection
- Cap sync response batch size at 2500 in all 4 handlers
- Add IPCHANGED event type with view remapping so IP changes are
reflected in the event log view
- Await all storeAppStateEvent calls (was fire-and-forget)
- Use ?? instead of || for config fallbacks in orchestrator
- Optimise appLocationFromBroadcasts pipeline: $arrayToObject/$getField
for O(1) lookups instead of $filter scans (2900ms → 118ms), push
name filter into facet sub-pipelines (2666ms → 26ms for targeted)
- Standardise $gt (not $gte) for "only if newer" guards
- Add {createdAt: 1} index for sync sender evicted event queries
- Hash sync failure recovery: retry 3x with 5-min gap, block timer
fallback if retries exhausted, background 20-min recheck on
blockReceived for missing hashes
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests cover: retry on failure, block timer fallback when retries exhausted, readiness via block timer when hash sync never completes, and DB rebuild failure not blocking the state machine. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract streamBatchedSync helper from 3 nearly identical respondWith* functions. Rename MIN_SYNC_PEERS to MIN_SYNC_COMPLETIONS for clarity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
$getField with dynamic field references requires MongoDB 7.2+ (SERVER-74371). CI runs 7.0. Replaced $arrayToObject/$getField O(1) maps with $filter/$first lookups against small arrays. Structural optimization preserved: shutdown/v1 filtering at IP level before unwinding. Estimated ~200-300ms at full scale vs 118ms with $getField vs 2900ms with the original post-unwind approach. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- handleAppRunningEvent: reject empty-apps v2 when no prior events exist for that IP (matches location store behavior independently) - handleNodeSigtermMessage: check event log for app events instead of zelappslocation, so sigterm handling works without locations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rename to reflect event log architecture (broadcasts no longer exist).
Change signature from positional appname to options object { appname, ip }
to support IP filtering. Sigterm handler now uses the full view derivation
to check for apps instead of a naive event log findOne.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stores the time each node received/processed the event, alongside the original broadcastedAt from the source node. The delta reveals gossip propagation latency and helps diagnose messages that arrive near the 5-minute validity boundary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Gossip path sets receivedAt on insert. Batch sync path preserves the sender's receivedAt so the original gossip reception time is retained across sync. Enables propagation latency diagnostics on installing and install error broadcasts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sigterm events had 7-min TTL matching the grace period, but apprunning events have 125-min TTL. After the sigterm TTL'd away, apps reappeared in the view with nothing to suppress them. Same race as the evicted TTL bug. Fix: sigterm event expireAt uses RUNNING_EXPIRY_MS (125 min) so the document outlives every apprunning it suppresses. The 7-min grace period is computed from eventAt in the view pipeline, not from expireAt. Export SIGTERM_EXPIRY_MS and use it in fluxCommunication.js and apiServer.js instead of hardcoded 420*1000. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous hash sync sent fluxapprequest to a single random peer per attempt, with a fixed 30s wait that couldn't cover the 75s response time for 500 hashes (150ms per hash on the responder). It also broke out on zero progress and reused the same peers. New algorithm: - Bulk threshold lowered from 1000 to 500 (matching fluxapprequest v2 cap) - Targeted path sends to 3 peers per round with poll-until-settled - Timeout proportional to hash count (count × 150ms + 5s buffer) - Settle detection: exits early when no new responses for 4s - Tracks tried peers — never repeats across rounds - Continues through all rounds regardless of per-round progress - Excludes deterministic peers (same-provider neighbors) - Bulk path aggregates responses from all peers instead of picking largest Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Moves the broadcast signing logic out of fluxCommunicationMessagesSender into utils/fluxBroadcastHelper. This breaks the circular dependency that prevented appHashSyncService from sending signed messages to peers (messageStore → messageVerifier → fluxCommunicationMessagesSender). appHashSyncService now uses fluxBroadcastHelper directly to sign and send fluxapprequest messages via peer.send(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add boot:context SSE event to orchestrator start(). Test shutdown detection via SSE event data instead of log grep. Use separate test environments with crafted bootContext to test clean vs unclean shutdown — no need to SIGKILL containers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The boot:context event fired before the SSE stream connected so tests never received it. Instead, include bootContext in the existing orchestrator:started event which is already in the buffer by the time tests run. No separate event needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The seed functions call the daemon stub control API which only exists after createTestEnv starts the containers. Moved seeding after env creation but before blocks advance, so bootstrapSoftForks sees the data. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bootstrapSoftForks is already covered by 6 unit tests. The integration test required resetting the pre-seeded scanned height to trigger the fresh bootstrap path, adding complexity for minimal additional coverage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The test config defaults to appSyncMinCompletions=1, so nodes only request sync from 1 peer. The test expects 3-peer sync behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds nodeConfigOverrides option to createTestEnv — a map of node index to config that merges on top of the global configOverrides. This allows setting different config on specific nodes, e.g. appSyncMinCompletions=3 only on the joining node without affecting source nodes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Only one joining node is needed. Also set appSyncPeerThreshold=3 so the peer threshold fires after 3 peers connect, matching the appSyncMinCompletions=3 requirement. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Responding to the second deep-dive review. Commit reference for this response: 🔴 MUST ADDRESS
Fixed in
Already addressed in the first review response — this is an intentional fix for a master bug. On master, non-Arcane nodes throw when they can't decrypt enterprise updates via
Not a real issue. Tracing the boot sequence:
Fixed in
Not a real issue in practice.
Fixed in
Fixed in
Fixed in
Fixed in 🟡 SHOULD CONSIDER
Fixed in
Not needed. The orchestrator is event-driven by design —
Acknowledged. With
Fixed in
Fixed in
Not an issue. Both gossip and sync paths use the same
Acknowledged. This is a pre-existing configuration, not introduced by this PR. The sync protocol adds a new data path but the WebSocket configuration hasn't changed. Sync handlers validate
Partially addressed in
Intentional trade-off, documented in the architecture doc under "Container Ownership." The old behavior (Docker auto-starting containers) caused cancelled/expired apps to be reinstalled, stale location data, and over-provisioning. The new behavior: FluxOS waits for DB readiness (~1-2 minutes on the happy path), then reconciles — only starts apps that are still valid. The cost is minutes of downtime after a power cut; the benefit is correctness.
Fixed in
Fixed in
Already addressed in the first review response — intentional behavior. Non-ArcaneOS nodes can't decrypt enterprise specs, so they store the encrypted blob as-is. The alternative (throwing) would mark the hash as
Already addressed in the first review response. The permanent message is the source of truth — it's already stored successfully. The spec write failure is secondary and self-heals via
Acknowledged. Empty state is broadcast once at boot (establishes baseline), then not re-broadcast (no value in repeating "zero apps" hourly). After install, 🟢 Nits
Fixed in
Fixed in
Acknowledged. Test-only feature (
Correct behavior. On a reorg the chain tip goes backward — the node genuinely hasn't accumulated that many blocks of gossip at the current tip. The counter accurately reflects this. Once READY is reached via
Not an issue. Master already sets
Fixed in
Fixed under finding G.
Acknowledged. |
Health check timeout (5s) exceeded interval (3s), causing Docker's health state machine to produce spurious "unhealthy" on container restart. Reduced timeout to 2s across all container health checks. Docker's CloseMonitorChannel sets health status to "unhealthy" during monitor teardown (moby/daemon/container/health.go:80). On restart, HealthCheckWaitStrategy sees this transient state and destroys the container. Replaced restartNode to swap in an HTTP-polling wait strategy that bypasses Docker's health state machine entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bulk permanent message fetch now partitions missing hashes across peers and streams in parallel via Promise.allSettled, instead of sequential single-peer streaming. Each stream maintains its own 500-message backpressure — peak memory is ~1500 messages vs 500 previously. Targeted fetch and ephemeral rounds now chunk hashes into groups of 500 before calling broadcastHashRequest, fixing a latent bug where >500 hashes would exceed the fluxapprequest v2 message cap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Parallel bulk fetch caused ~10-25% failure rate per batch because update messages couldn't find predecessor specs processed on other streams. Reverted to sequential streaming which maintains height ordering across all messages. Kept the broadcastHashRequest chunking at 500 for targeted fetch rounds and ephemeral rounds (latent bug fix). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The first checkAndNotifyPeersOfRunningApps call was triggered by peer threshold, before appStartupManager finished reconciling containers. This caused the broadcast to report 0 apps because Docker containers hadn't been started yet. The next broadcast wouldn't fire for an hour (peerNotifyIntervalMs). Gate the first broadcast behind waitForBootContainerStateSettled() so it runs after reconciliation completes and Docker state is accurate. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The first broadcast was racing with appStartupManager and reporting 0 apps. This test verifies the app:running SSE event includes the reconciled app after a simulated reboot, catching the race if the broadcast gate on boot:settled is removed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
If the HTTP poll times out, log a warning instead of throwing. Throwing triggers testcontainers' waitForContainer error handler which destroys the container, making the failure undiagnosable. The test's own assertions will catch the actual problem. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Testing Update — 2026-05-21 (commit
|
| Scenario | Result | Notes |
|---|---|---|
| 1. Full Clean Ephemeral Sync | ✅ | All 4 sync types from 3 peers (3/3 completions each). Spawner started. DB counts match across all nodes. |
| 2. Full Hash Bootstrap | ✅ | 59,188 hashes bootstrapped via address-index in ~165s. 59,099 resolved via streaming bulk fetch, 0 processing failures. Spawner started. |
| 3. Partial Hash Sync (local) | ✅ | 59,097 hashes resolved from local permanent messages in ~10s. No network fetch. |
| 4. Degradation/Recovery | ✅ | DEGRADED at 2s (peers removed via API), recovery with full ephemeral resync from 3 peers, READY at 39s. Spawner paused and resumed correctly. |
| 5. Dual-Collection Consistency | ✅ | bc=loc exact match for installing (43/43) and errors (9130/9130). Running view count consistent at 5035 across all branch nodes. compare-nodes.sh shows all 10 nodes (including 3 master controls) agree on TotalLoc. |
| 6. Boot Lifecycle | ✅ | 6a: Reboot → containers don't auto-start (policy "no"), reconciliation starts after DB ready, app started, broadcast includes reconciled app (1 app). 6c: Simultaneous fluxd+fluxos restart → no collection dropped errors. 6d: Long downtime (579s) → fast-path removal triggered immediately. |
Bugs Found & Fixed
1. Broadcast race with appStartupManager (da358fc14)
After reboot, checkAndNotifyPeersOfRunningApps was triggered by peer threshold before appStartupManager finished reconciling containers. The broadcast reported 0 apps because Docker containers hadn't been started yet. Next broadcast wouldn't fire for 1 hour (peerNotifyIntervalMs). Fixed by gating the first broadcast on waitForBootContainerStateSettled().
2. broadcastHashRequest exceeding v2 cap (82254d3f0)
The targeted fetch rounds and ephemeral hash round sent all remaining hashes in a single broadcastHashRequest without chunking. If >500 hashes remained after bulk fetch, this exceeded the fluxapprequest v2 message cap. Fixed by chunking at 500 in both paths.
3. Docker health check restart race (32f59b337)
Docker's CloseMonitorChannel (moby daemon/container/health.go:80) sets health status to "unhealthy" during monitor teardown. On container restart, testcontainers' HealthCheckWaitStrategy sees this transient state and destroys the container. Fixed by swapping in an HTTP-polling wait strategy for restartNode that bypasses Docker's health state machine. Also reduced health check timeout (5s→2s) to be below the interval (3s).
4. Integration test for broadcast after reboot (9a38a9ee8)
Added test verifying the app:running SSE event includes the reconciled app after a simulated reboot. This catches the broadcast race if the waitForBootContainerStateSettled gate is removed.
Commit reference: 7d5015f70
Cabecinha84
left a comment
There was a problem hiding this comment.
PR #1726 — Deep Re-Analysis (head 7d5015f, 2026-05-21)
I checked out the latest PR head, re-read the orchestrator, hash-sync, spawner, boot manager, sync handlers, and verifier in full, and verified every claimed fix against the actual code — plus re-tested the items the
author rejected.
Verdict
The PR is in good shape. Every genuine blocker from both review rounds is correctly fixed. I found 2 items the author dismissed that I believe still warrant a (trivial) fix, 1 rejection based on a factual error, and a
handful of real-but-minor bugs. Nothing here is a merge-stopper on the happy path — but items 1–4 below are all one-to-few-line changes worth a final pass.
✅ Verified correctly fixed (can be acked)
┌────────────────────────────────────────────────────────────────────────────────────┬─────────────────────────────────────────┐
│ Claim │ Status │
├────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────┤
│ isSyncRequested gate on all 4 sync handlers │ ✅ fluxCommunication.js:127,150,226,248 │
├────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────┤
│ insertMany partial-failure (ordered:true + insertedCount + hashMarkOps.slice) │ ✅ appHashSyncService.js:400-415 │
├────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────┤
│ Streaming bulk fetch + missingSet filter + try/finally cleanup │ ✅ appHashSyncService.js:133-223 │
├────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────┤
│ globalState.dbReady moved into #rebuildDb() — every DB-rebuild path opens the gate │ ✅ appSyncOrchestrator.js:432 │
├────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────┤
│ #hashSyncAttempts reset in #resetSyncState() │ ✅ :286 │
├────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────┤
│ #clearShutdownReason() on heartbeat start │ ✅ :551-559,575 │
├────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────┤
│ Spawn loop → while loop + spawnLoopRunning re-entrancy guard │ ✅ appSpawner.js:34-63 (verified race-free) │
├────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────┤
│ startDiscovery() exported, serviceManager calls it (guard covers all paths) │ ✅ │
├────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────┤
│ runBatch uses 0-based array index as JSON-RPC id │ ✅ fluxRpc.js:303 │
├────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────┤
│ writeShutdownReason 3s Promise.race timeout │ ✅ :585-594 │
├────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────┤
│ #started double-invocation guard │ ✅ :101 │
├────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────┤
│ verifyPool worker exit handler + stop() in SIGTERM │ ✅ verifyPool.js:18-32, apiServer.js:474 │
├────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────┤
│ Version-marker reset (#checkVersionUpgrade → resetHashSyncForUpgrade) │ ✅ actually implemented │
├────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────┤
│ boot_id read separated from heartbeat read │ ✅ appSyncOrchestrator.js:521-529 │
└────────────────────────────────────────────────────────────────────────────────────┴─────────────────────────────────────────────┘
The "Sync timeout → wipe all apps" path (the scariest one from review 2) is genuinely closed: dbReady now opens on the block-timer fallback too.
- npm run lint is still broken — ~8 unused vars in PR code
The round-2 fix removed only the 4 named imports; new ones remain or regressed in the ~20 commits since:
- appHashSyncService.js:7,19 — messageStore, globalState (declared, never used)
- appStartupManager.js:21,23 — decryptEnterpriseApps, appUsesGSyncthingMode
- serviceManager.js:73-79 — hashSyncIntervalMs, peerNotifyIntervalMs, locationTtlS, installingTtlS, installErrorTtlS, removalSpacingMs (dead — old interval logic moved to orchestrator)
- nodeStatusMonitor.js:11 — fluxEventBus; messageVerifier.js:25 — scannedHeightCollection
Each appears exactly once (declaration only) — confirmed genuinely unused. Note: CI does not run lint (.github/workflows/nodejs.yml runs ciconfig/prebuild/test only), so the round-2 "High/CI" label was overstated —
but npm run lint is broken for anyone running it. Trivial cleanup.
- null confirmation state still wipes all apps (Finding C — author dismissed; I partly disagree)
nodeConfirmationService.isConfirmed() returns daemonConfirmed, which is null until the first successful getFluxNodeStatus poll. Both consumers do if (!isConfirmed()) → !null === true → removeAllApps:
- appStartupManager.js:307 — removeAllApps('Node not confirmed')
- nodeStatusMonitor.js:74 — removeAllAppsLocally('node not confirmed')
The 125-min stale / 320-min expired grace backstops are gated on lastSuccessfulPoll !== null (nodeConfirmationService.js:75), so if the daemon RPC fails on the first poll and stays down, daemonConfirmed is stuck null,
the grace windows never apply, and apps are wiped immediately. The trigger is narrow (daemon must die in the window between the boot daemon-check and the first poll), but the effect is catastrophic and contradicts
the PR's own grace-period design — that grace exists precisely to avoid wiping apps on transient daemon loss. The author's rebuttal ("milliseconds window; removal is correct for an unresponsive daemon") is internally
inconsistent with the 125-min grace they built. Fix is trivial: treat null as "unknown" — apply the grace path, only wipe on a definitive false.
- Finding E — the author's rejection rests on a factual error
insertAppSpecifications (registryManager.js:1509) does an unconditional replaceOne upsert with no height-downgrade guard (updateAppSpecifications right below it has one). The author rejected adding the guard, claiming
insertAppSpecifications is "called from processMessages which sorts by height" and a guard would cost "60k+ lookups during bulk sync."
That's wrong. insertAppSpecifications is not called from processMessages (hash sync inserts into globalAppsMessages, never touches it). Its only caller is messageVerifier.checkAndRequestApp:744 — a live, low-frequency
path. The "60k lookups" cost doesn't exist. Real-world risk is low (live block processing is height-ordered), but the guard is a one-line, ~zero-cost safety net against re-org / message re-request — exactly what
updateAppSpecifications already does. Recommend adding it; at minimum the rejection reasoning should be corrected.
- usersToExtend consensus divergence — document it (B/21)
messageVerifier.js:357: on non-secure nodes, for v8+ enterprise apps canCompareSpecs=false, so (!canCompareSpecs || isExpireOnlyUpdate(...)) accepts any usersToExtend-signed change, not just expire-only. Secure nodes
still enforce expire-only → secure vs non-secure nodes can reach different validity verdicts for the same message. The author calls this an intentional interim fix (vs master's data-hole bug), pending Arcane
attestations — a defensible tradeoff. But the author promised in round 1 to "add a code comment" and there still is none at :354-357. Add it — the consensus implication should be explicit in the code.
🟢 Accepted known risk (ack, but be aware)
Evicted events (fluxCommunication.js:161-175) are still processed with no per-event signature — a single malicious peer among the 3 solicited sync peers can wipe arbitrary IPs' location entries from this node's view
for ~60 min. The author explicitly acked this with a documented plan (quorum-signed peer-unreachable events) and it's genuinely bounded (solicited-only, self-heals). Fine to ack — just know it's an open item, not
closed.
Minor / NITs (real but low-impact)
- explorerService.js:452 — apps.filter((item) => !appsToRemove.includes(item)); — return value discarded (should be apps = apps.filter(...)). Already-resolved apps get re-requested via checkAndRequestMultipleApps.
Harmless (idempotent) but wasteful — a real bug. - appSpawner.js:76-83 — trySpawningGlobalApplication's first lines (enterprise identity, getSpawnDelays) are outside the inner try/catch. A throw there rejects spawnLoop unhandled and kills spawning permanently until
a DEGRADED→READY flap. Low probability; widen the try. - appHashSyncService.js:178 — chunk.toString() (utf8) on raw stream chunks can corrupt multi-byte UTF-8 split across chunk boundaries → JSON.parse fails → message silently skipped. Use string_decoder.StringDecoder.
Self-heals via retry, so low impact. - verifyPool.js:22 — worker exit handler resubmits pending batches only when code !== 0; a clean (code 0) exit with pending work leaves those promises unresolved → verify() hangs. Very unlikely (persistent message
loop) but an asymmetric footgun. - GitGuardian CI is RED — 2 test-fixture secrets (test-infra/fixtures/registry-tls/server-key.pem, a key in tests/unit/fluxCommunicationMessagesSender.test.js). The PR's .gitguardian.yaml doesn't whitelist them all.
Either extend the whitelist (green check) or consciously accept the red. - The network-wide errorCount >= 5 install-error skip is now effectively disabled in the spawner (logs + SSE event only, no block) pending error classification — confirmed a conscious "off for now," documented by the
author. Not a bug.
Bottom line
CI build passes; the architecture, signature verification on sync paths, TTL/index migrations, and the destructive-removeAllApps paths are now sound. It's close to ack-able. I'd ask for one short final round covering
items 1–4 (all trivial: dead-var cleanup, null-vs-false confirmation check, the insertAppSpecifications guard, and a code comment) — and the author's response to E should be corrected since it's factually wrong. Item
5 (evicted events) is a legitimate ack. Everything else is genuinely fixed.
appHashSyncService.js: messageStore, globalState appStartupManager.js: decryptEnterpriseApps, appUsesGSyncthingMode serviceManager.js: hashSyncIntervalMs, peerNotifyIntervalMs, locationTtlS, installingTtlS, installErrorTtlS, removalSpacingMs (dead — old interval logic moved to orchestrator) nodeStatusMonitor.js: fluxEventBus messageVerifier.js: scannedHeightCollection Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevents a re-processed registration from overwriting a newer update spec. Mirrors the existing guard in updateAppSpecifications. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents the known divergence between secure and non-secure nodes for enterprise usersToExtend updates, and the planned resolution via Arcane attestations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The return value of apps.filter() was discarded, causing already-resolved apps to be re-requested via checkAndRequestMultipleApps. Idempotent but wasteful. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
chunk.toString() can corrupt multi-byte UTF-8 characters split across chunk boundaries. StringDecoder buffers incomplete characters across writes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pre-existing test fixture private key, not introduced by this PR but file was modified. Added to GitGuardian ignored_paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Review Response — Round 3 (commit
|
The broadcast gate change requires the globalState stub to provide waitForBootContainerStateSettled, otherwise the broadcast promise never resolves and the test fails. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
PR #1726 — Deep Re-Analysis (head 4f8823e, 2026-05-21 17:51 UTC) I updated your local branch from 7d5015f → 4f8823e (fast-forward, 7 new commits pushed after your 15:30 review), then What changed since your last review The author pushed 7 commits that map exactly onto items 1, 3, 4 and the two NITs from your 2026-05-21 review: ┌───────────────────────────────────────────────────┬─────────────────┬────────────────────────────────────────────┐ Verification detail on the two non-trivial ones:
Tests: ran the 5 touched suites — registryManager 58, appHashSyncService 30, messageVerifier 15, explorerService 82, So 3 of your 4 "should be addressed" items + both NITs you raised are fixed. Item 2 was not. appStartupManager.js:306 and nodeStatusMonitor.js:73 — unchanged. nodeConfirmationService.isConfirmed() returns I dug into reachability this round, and the author's earlier dismissal reasoning is wrong:
This is not a "milliseconds window" — it's the fluxd-startup window, seconds-to-minutes on every machine reboot, which Fix (trivial, ~1 line each): only wipe on a definitive false. 🟢 Minor / ack-able (not addressed, low impact)
Bottom line The PR is ack-ready bar one item. Every blocker from review rounds 1–3 is verified fixed and stable; the 7 new commits The only thing I'd push back on is Finding C — the author silently skipped it, and their prior reasoning doesn't hold up |
Replace .catch(() => {}) with warnings that include the network
name and component. Silent swallowing masked resource leaks that
caused intermittent failures in later suites.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This is just total garbage: The first confirmation poll calls getFluxNodeStatus — a different RPC. On a machine reboot, fluxd answers Claude is wrong - this is by design. |
Summary
Replaces the timer-based spawner startup (125-minute fixed wait) with an event-driven architecture that syncs all ephemeral app state from peers in seconds. Rewrites the hash sync pipeline for correctness and performance (63 minutes → 47 seconds). Fixes zombie apps caused by stale hash sync state and a fire-and-forget race condition. Also fixes a broken TTL on the install errors collection. Offloads ECDSA signature verification to a worker thread pool, bypasses the gossip pipeline for sync responses, and adds sender-side backpressure to prevent WebSocket pong timeouts during large syncs.
Additionally, reworks the app lifecycle startup to give FluxOS full control of container management. Replaces Docker's
unless-stoppedrestart policy withnoso containers never auto-start — FluxOS decides what to run based on boot context (machine reboot detection, downtime, shutdown reason) and DB readiness. Fixes a reindex race condition wherereindexGlobalAppsInformationdroppedzelappsinformationwhile concurrent cursors were reading it.Key changes
appstateevents) replacesfluxapprunningbroadcasts.zelappslocationremains as a materialized cache. Current running state derived via aggregation — no mutations, no deletes, no orphan riskmessageNotFoundreset +hashesChangedevent for periodic recoveryupdateAppSpecificationssplit: Separated insert (registration, upsert) from update (no upsert, awaited) to prevent cancelled app resurrectionFluxPeerSocket.onmessage, bypassing the gossip pipeline (no hash, no cache, no random delay). ECDSA verification offloaded tocpus-1worker threads viaverifyPool.js. Per-peer FIFO chunk ordering prevents premature sync completionsendAsync()onFluxPeerSocketreturns a Promise that resolves when data is flushed to the kernel TCP buffer. Sync senders useawaitDrain: trueto prevent megabytes of queued data from blocking WebSocket pong frameswhile (!spawnerPaused)loop.spawnLoopRunningguard prevents concurrent loops from READY/DEGRADED flapsbootstrapSoftForks()usesgetaddressdeltasto find foundation price fork transactions before the main hash loop, ensuring correct app pricing from the start of a fresh bootstrapcachedAtindex replaced with workingbroadcastedAt(24-hour TTL)nofor all containers. FluxOS manages startup via boot context and lifecycle awaitables (waitForDaemonReady,waitForDbReady,waitForBootComplete)machineBootId(from/proc/sys/kernel/random/boot_id) for deterministic reboot detection. Shutdown reason written on SIGTERM. Boot decisions: FluxOS restart → skip, expired locations → fast-path removal, reboot → wait for sync then reconciledropCollection→deleteMany, removed redundant corruption check andvalidateAppsInformation, single reindex path with integrated expired app removalappStartupManager: ReplacesstoppedAppsRecovery—manageAppsOnBoot()(boot decisions),reconcileAppsOnBoot()(start valid containers),monitorAndRecoverApps()(ongoing health, gates onbootComplete)The Problem
Cancelled enterprise apps (
expire: 100) were being installed on nodes that restarted after the cancellation. Root causes:globalAppsInformationrebuilt at boot before hash syncmessageNotFoundflags preventing cancel messages from being fetchedupdateAppSpecificationsracing with cancel deletesAdditionally, the install errors collection had a broken TTL (index on
cachedAtwhich was never set — 19,000+ records accumulating indefinitely) and an unsigned HTTP bulk fetch that trusted one peer's aggregate data without verification.The startup rework was prompted by a race condition where
reindexGlobalAppsInformationdroppedzelappsinformationwhile a concurrentFindCursor.getMorewas reading it. Investigation revealed deeper structural issues: split container ownership (Docker auto-starts on powercut, FluxOS manages on clean shutdown), setTimeout-based service coordination, and no mechanism for FluxOS to know its boot context.Architecture
Before (timer-based)
After (event-driven)
Boot Flow
State Machine
Hash Sync Rewrite
Master's hash sync: sequential
storeAppTemporaryMessage+checkAndRequestAppper message with 50ms delay. 10-14 DB operations per message. ~63 minutes for 58k messages.New
processMessages: streaming sequential fetch with async backpressure from up to 3 peers, batch existence check, batch insert viainsertMany, incrementalprevSpecsMapfor same-chunk updates. Per chunk of 2000: one$inquery → onebulkWrite→ pre-load prev specs → verify each message (hash, specs, signature) → batchinsertMany→ batch mark hashes. 58k messages in ~2 minutes.Bulk fetch uses
bulkFetchStreamAndProcess— a zero-dependency streaming JSON parser with gzip decompression. Streams the 71MB/apps/permanentmessagesresponse through a producer/consumer pipeline: the extractor parses objects one at a time, filters againstmissingSet, batches into groups of 500, pauses the stream whileprocessMessagesruns (async backpressure), then resumes. Peak heap 133MB vs 247MB for the old full-dump approach. Peers are tried sequentially — peer 1 usually resolves everything, peers 2 and 3 only contacted ifmissingSetstill has entries.Errors in
syncMissingHashespropagate to the caller (#runHashSync) where retry logic lives — the previous catch block silently returned{ missing: -1 }which made the retry mechanism dead code.Signature verification fixes
Master's verification path has several bugs that cause valid on-chain messages to fail during hash sync replay. 9 bugs identified and fixed (see Bugs Found on Master):
usersToExtendpathisExpireOnlyUpdate: doesn't strip enterprise blob before comparisonprocessMessages: doesn't decrypt previous spec before passing to signature verificationZombie app recovery
Apps whose cancel messages were flagged
messageNotFound: truefrom earlier sync failures become zombies —reindexGlobalAppsInformationonly sees the registration, the spawner installs it. Two recovery mechanisms:nodeStartupTrackerbefore initial hash sync. On upgrade, resets allmessageNotFoundflags viaresetHashSyncForUpgrade(). Marker written after successful hash sync.reconstructAppMessagesHashCollectionon a block-height modulo (~5 days). Cross-checkszelappshashesagainstzelappsmessagesand corrects mismatches. EmitshashesChangedif corrections were made. Orchestrator listens, schedules an immediate hash recheck on the next block.Verified on chud:
stardewvalley1777025188031zombie (cancelled at h=2542210, expire=100) automatically detected, uninstalled, and broadcast removed within 3 minutes of deploying new code.updateAppSpecificationssplitMaster's
updateAppSpecificationsusedupsert: trueand was called without await on the update path. Race condition: fire-and-forget upsert resolves after a cancel deletes the entry → re-inserts → zombie. Also had unbounded recursive retry (60s delay, infinite recursion).Split into:
insertAppSpecifications— registration path,upsert: true, awaited, returnstrue/false. Caller skips pending updates on failure.updateAppSpecifications— update path,upsert: false, awaited, returnstrue/false, no retry. If the entry was deleted, does nothing.Signed Broadcast Sync Protocol
Two-layer verification: outer broadcast proves a real node sent the response, inner broadcasts individually verified against deterministic node list. Rate limited (5-min cooldown per peer per type). Clock offset adjusted.
Sync Response Pipeline
Sync responses are routed directly in
FluxPeerSocket.onmessagetodispatchSyncResponse, separate from the gossip message path. This avoids unnecessary per-chunkobject-hash, message cache lookups, and random relay delays that gossip messages need but solicited sync responses don't.ECDSA signature verification (
bitcoinMessage.verify, ~3-4ms per call) is offloaded to a worker thread pool (verifyPool.js,cpus-1workers). The main thread does node list lookups and prepares verification batches; workers do the cryptographic work and return boolean arrays. Worker crash recovery respawns the worker and resubmits pending batches. Pool stopped on SIGTERM viaapiServer.js.Per-peer FIFO chunk queues ensure chunks from the same peer process in order, preventing a small final
done:truechunk from completing before larger earlier chunks.Sender-side backpressure via
sendAsync()(Promise-basedws.send) prevents megabytes of queued sync data from blocking WebSocket pong frames. Sync senders useawaitDrain: trueto wait for TCP drain between chunks.Live test results: 14,200 broadcasts verified and stored in 21 seconds (1 peer), 42,000 broadcasts from 3 peers in 46 seconds. Zero missed pongs.
Event Log
Running app state uses a single append-only event log (
appstateevents). Event types:apprunning(v1/v2),sigterm,appremoved,evicted,ipchanged. Dedup via unique compound index{ip, type, dedupKey}. Upserts use$gtguard for strictly-newer overwrites.appLocationFromEvents()aggregation view derives current state. Fixes the master staleness bug (dropped apps between v2 broadcasts). Handles v1/v2 overlap, shutdown grace periods, IP remapping. Pipeline filters at IP level before unwinding. 20 unit tests.Install Errors Fix
cachedAttobroadcastedAt(24-hour expiry)fluxappinstallingerrormessages, so 5 transient infra errors from 5 nodes would suppress a healthy app network-wide. Error broadcasts are still generated, stored, and synced (useful for diagnostics)spawnErrorsLongerAppCache(don't retry on this node). Previously had no cache at all and retried every 60 seconds indefinitelyspawner:installFailedon local failureApp Lifecycle Startup Rework
Reindex race fix
reindexGlobalAppsInformationusesdropCollectiononzelappsinformation, which invalidates any openFindCursoron that collection. On master this can be triggered byvalidateAppsInformationorinitiateBlockProcessor's corruption check while concurrent readers (gossip handlers, API endpoints, other startup code) have open cursors.Fixes:
dropCollection→deleteMany(preserves collection + indexes, no cursor invalidation)initiateBlockProcessorvalidateAppsInformationfrom startup (redundant with orchestrator reindex)registryManager.reindexGlobalAppsInformationnow handles expired app removal viasyncAppsInformationCollectionreturn value (previously discarded, then recomputed by a redundantexpireGlobalApplicationscall)specificationFormatter(belongs inappValidator.verifyAppSpecifications)createIndexcalls inserviceManagerswitched toensureIndex— tolerates conflicting index specs by finding the conflict vialistIndexes, dropping by actual name, and recreatingFluxOS-managed container startup
Previously, container startup was split: clean shutdown was FluxOS-managed (
unless-stoppedrespected explicit stops), powercuts were Docker-managed (auto-starts everything blindly). This made post-powercut behavior unpredictable — containers running with stale state, no DB sync, no location validation.Now all containers use restart policy
no. Docker never auto-starts containers. FluxOS decides what to run after syncing.nohardcoded inappDockerCreategetRestartPolicy()function removed (wasunless-stopped/always/nobased on flags/owner)restartAlwaysOwnersconfig removed (dead code — whitelisted address never deployed an app)migrateContainerRestartPolicies()updates existing containers on boot viacontainer.update()— non-destructive, doesn't stop running containersBoot context
A heartbeat document in
nodestartuptracker(written every 30s) provides boot context:lastAlive— timestamp of last heartbeatmachineBootId— from/proc/sys/kernel/random/boot_id, deterministic reboot detectionshutdownReason— written by SIGTERM handler (3s timeout to prevent hang if Mongo shutting down), absent on powercut/crashreadBootContextseparates boot_id read from DB read — if/proc/sys/kernel/random/boot_idis unreadable (containerized FluxOS), heartbeat data (downtimeMs,cleanShutdown,firstBoot) is still preserved.machineRebooteddefaults totruein this case.shutdownReasonis$unsetfrom the heartbeat at the start of each boot (before the heartbeat interval starts) so it only reflects the immediately preceding shutdown.Decision matrix on boot:
Lifecycle coordination
Three awaitables on
globalStatereplace setTimeout-based service starts:waitForDaemonReady()— bare promise, set afterwaitForDaemonRpcwaitForDbReady()— event-driven viaDB_READY, set after orchestrator reindexwaitForBootComplete()— bare promise, set whenmanageAppsOnBootfinishesServices moved from setTimeout delays to
waitForDbReady(): enterprise app check, ownership cleanup, port restore interval.appStartupManagermoduleReplaces
stoppedAppsRecovery. Three functions with clear responsibilities:manageAppsOnBoot(bootContext)— top-level boot decision maker. Fast-path removal for expired locations, daemon/sync timeouts (5min each)reconcileAppsOnBoot()— boot-time container management. Checks location records, starts valid apps, skips expiredmonitorAndRecoverApps()— ongoing runtime health monitor. Detects vanished/stopped containers, recreates or restarts. Gates onwaitForBootComplete()to prevent racing with boot reconciliation. ReturnsstartedAppsso the broadcast incheckAndNotifyPeersOfRunningAppsreflects recovery stateAlso renamed
checkStoppedApps→monitorAndRecoverAppsto reflect actual scope (recreation, master/slave routing, two-strike restart, failure removal).Bugs Found on Master
storeAppRunningMessagenever deletes locations for apps dropped between v2 broadcasts. 15 stale entries on barbados. Event log view fixes this.cachedAtfield never set. 15,640 records on barbados vs ~4,500 with fix.checkApplicationRegistrationNameConflictschecks locally installed apps during hash sync — inappropriate for syncing blockchain records. Causes ~100 missing messages.storeAppTemporaryMessagefalls back todaemonHeightwhen hash not yet scanned. Pre-fork specs pass enforcement check. 4 affected messages.usersToExtendthrows:checkAndDecryptAppSpecscalled unconditionally in usersToExtend path. Non-ArcaneOS nodes can't decrypt → throws → message not stored.isExpireOnlyUpdatedoesn't strip enterprise blob: Re-encrypted blobs differ even when content is identical. Comparison always fails after decryption.messageNotFound: Cancel messages flagged unreachable → never retried → reindex rebuilds from incomplete data → spawner installs cancelled apps.updateAppSpecifications: Unawaited upsert races with cancel delete → re-inserts deleted entry. Recursive 60s retry on failure.specificationFormatterexpire validation blocks removal: ValidatesexpireagainstmaxBlocksAllowanceusing current daemon height (which is 0 at early startup). Rejects valid post-PON apps (110 apps withexpire > 264000). Validation belongs inappValidator.verifyAppSpecificationswhich uses registration height.Rollout Behavior
During rollout (mixed network): Nodes running new code will find no
appStateSyncpeers and fall back to the block-count timer — 125 minutes for non-enterprise, 62 minutes for enterprise. Hash sync improvements, zombie recovery, spawner expiration filter, and container lifecycle changes apply immediately. Existing containers are migrated fromunless-stoppedtonorestart policy on first boot with new code.After full network upgrade: Restarting nodes sync ephemeral data in ~1.5 seconds, hash sync completes in ~2 minutes. Boot-to-spawner: ~5 minutes. Containers managed entirely by FluxOS.
Test Results
Tested on 9 nodes (6 test + 3 production reference). 265+ unit tests across 8 suites. 27 integration test suites with 190+ tests running on a 5–12 node testcontainers environment on cindy.
Hash sync (cindy, clean deploy, 2026-05-06)
58,461 permanent messages. 4 failures (pre-fork spec versions, correctly rejected). All 5 previously-failing signature hashes now stored.
Zombie recovery (chud, code-only deploy, 2026-05-06)
Version upgrade reset 643
messageNotFoundflags. Hash sync fetched stardewvalley cancel message.checkAndRequestAppdetected expired → removed fromglobalAppsInformation→ container uninstalled →fluxappremovedbroadcast. Zombie resolved in ~3 minutes, zero manual intervention.Cross-node comparison (2026-05-06, 9 nodes)
Errors and installing match exactly across all 9 nodes. Cindy's 2-location difference: confirmed master staleness bugs (apps dropped between v2 broadcasts).
App lifecycle startup (chalk, 2026-05-09)
Test plan
Live node testing
appSyncMinPeerUptime: 7500,appSyncMinCompletions: 3)Integration test suites (27 suites, 187 tests)
Testcontainers-based harness on cindy with 5–10 FluxOS nodes, MongoDB (with failpoint injection), daemon stub, peer stubs, local Docker registry, and SSE event stream verification.
Unit tests (265+ tests across PR-touched files)
Test infrastructure
enableTestCommands) for fault testingafterIdfiltering for temporal event scopingpushBrokenImage()for creating images that fail to startnodeConfigOverrides) for different config on specific nodesspawner:installFailedSSE eventhashSync:completeandhashSync:failedSSE eventsephemeralSync:requested,ephemeralSync:peerComplete,ephemeralSync:allComplete,sync:chunkVerifiedSSE events🤖 Generated with Claude Code