feat: event-driven app state sync & spawner startup#1724
Closed
MorningLightMountain713 wants to merge 34 commits into
Closed
feat: event-driven app state sync & spawner startup#1724MorningLightMountain713 wants to merge 34 commits into
MorningLightMountain713 wants to merge 34 commits into
Conversation
The spawner had no awareness of block height or app expiration. Cancelled apps (expire=100) remained in globalAppsInformation for up to 3+ hours until the next expireGlobalApplications sweep, during which the spawner would install them on new nodes. Add a PON-fork-aware expiration filter to the aggregation pipeline that excludes apps expiring within 100 blocks (newMinBlocksAllowance). Filter runs before the $lookup join so expired apps never reach candidate selection. Also fix two existing bugs: - findIndex used >= instead of <=, popping deferred apps immediately instead of waiting their scheduled time - Array.includes() with a callback always returned false (should be .some()) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace timer/loop-based app hash sync with an event-driven architecture to fix cancelled apps being installed on nodes with stale data. AppSyncOrchestrator coordinates the sync lifecycle: - Listens for blockEmitter (explorer synced) and peerManager threshold events - Fetches temp messages from eligible peers on reconnect (new tempMessageSync P2P capability with binary 0x20 request + signed JSON response) - Runs unified syncMissingHashes (multi-peer, no 95% threshold, no 30-min loop) - Rebuilds globalAppsInformation AFTER sync via reindexGlobalAppsInformation - Emits spawnerReady when all readiness conditions met - Pauses spawner if peers drop below degraded threshold (hysteresis) Other changes: - FluxPeerManager extends EventEmitter with peerThresholdReached/peersBelowThreshold - X-Flux-Uptime header in WebSocket handshake for peer uptime tracking - messageStore triggers immediate promotion when temp message arrives for known hash - peerNotification.checkAndNotifyPeersOfRunningApps uses direct imports - Remove duplicate continuousFluxAppHashesCheck from messageVerifier - Remove timer-based spawner start and 30-min hash sync loop from serviceManager - Fix syncthingHealthMonitor test to match broadcast change (sendMessage=true) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add #syncInProgress flag to prevent concurrent syncs in orchestrator - Add per-peer 5-minute cooldown on temp message requests - Cap incoming temp sync messages at 500 - Remove dead ponFork variable and unused config import - Collapse identical conditional branches in #runHashSync Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Store signed apprunning broadcasts in new fluxapprunningbroadcasts collection alongside existing zelappslocation. On peer threshold, request signed broadcasts from capable peers via binary 0x21 with sinceTimestamp for delta sync. Sender streams results using cursor- based chunking (2000/chunk) with done flag. Receiver verifies each inner broadcast signature individually then bulk writes both collections. Location readiness now data-driven (appRunningSyncComplete) with block-count fallback for legacy networks. Also updates temp message sync (0x20) to include sinceTimestamp and use cursor-based streaming. Removes arbitrary 500-message cap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract FLUX_CAPABILITIES to FluxPeerSocket.js as single source of truth. Both the WebSocket server (inbound upgrade response) and outbound client (upgrade request) now use the same list. Also adds X-Flux-Uptime to server response headers so peers learn our uptime regardless of connection direction. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add signed appinstalling broadcast sync (0x22) using the same cursor-based streaming pattern as apprunning sync. Stores signed broadcasts in fluxappinstallingbroadcasts collection for sync, existing appsinstallinglocations unchanged. Unify capability names: replace tempMessageSync + appRunningSync with single appStateSync capability covering all three sync types (0x20 temp messages, 0x21 apprunning, 0x22 appinstalling). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When storeAppRunningMessage removes entries from appsinstallinglocations (app finished installing), also remove the corresponding signed broadcast from fluxappinstallingbroadcasts to keep both collections consistent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Only send installing broadcasts newer than 5 minutes. The TTL on the collection is 15 minutes but the validity window is 5 minutes — records between 5-15 minutes old are stale (install likely finished/failed) and would be rejected by the receiver anyway. Filtering on the sender avoids wasting bandwidth on records that can't be stored. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sender filters to 15 minutes (matching TTL) instead of 5 minutes. Receiver signed storage and batch store also use 15 minutes. The gossip path retains 5-minute validity — gossip messages are relayed and could be stale, but sync is a point-in-time snapshot of current DB state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix broken TTL on appsInstallingErrorsLocations: index was on `cachedAt` (never set), changed to `broadcastedAt` with 24-hour expiry. Remove null-expiry logic that prevented 98% of records from ever expiring. Remove startup wipe for consistency with other ephemeral collections. Change gossip validity from 60min to 5min (sync path accepts 24 hours). Add signed error broadcast sync (0x23) following the same pattern as apprunning (0x21) and appinstalling (0x22). Sender filters to 24-hour window. Remove unsigned HTTP fetch (getPeerAppsInstallingErrorMessages) that was trusting one peer's aggregate data without verification. Enable spawner to check network-wide error count per app hash. If 5+ distinct nodes have reported install failures for the same hash, skip the app. Combined with the 7-day in-memory spawnErrorsLongerAppCache, this means: broken specs are skipped network-wide within minutes, retried once per day as errors expire from TTL. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add orchestrator tests for apprunning sync requests, location readiness with appRunningSyncComplete flag, block-count fallback, and degradation reset. Add peerCodec tests for all four sync message types (0x20-0x23) with timestamp roundtrip verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change eligibility threshold from 60s (testing) to 7200s (2 hours) for all 4 sync types. A peer needs at least 2 hours of uptime to have accumulated a full cycle of apprunning broadcasts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change apprunning gossip store validity from 125 min to 5 min — consistent with installing and errors. The broadcast relay check already limits to ~5 minutes so no legitimate gossip message should be older. Change peer uptime threshold from 7200s to 7500s to match the apprunning TTL exactly. A peer needs a full TTL cycle (2h5m) to have complete data. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Filter apprunning broadcasts to 125 minutes on sender side, consistent with installing and errors. Prevents sending records that expired but haven't been cleaned up by MongoDB's TTL thread. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
reindexGlobalAppsInformation was wiping the entire error locations collection on every reindex while leaving the signed broadcasts untouched, causing a growing mismatch (854 vs 301 on sandwich). expireGlobalApplications and updateAppSpecifications only removed from error locations, not broadcasts. All three now operate on both collections consistently. Explorer rescan also drops both. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The spawner was starting after syncing from just 1 peer — any single appStateSync peer responding with done:true set appRunningSyncComplete globally, bypassing the 125-minute block-count fallback entirely. Redesigned sync readiness: - Merged duplicate getEligibleTempSyncPeers/getEligibleAppRunningSyncPeers into getEligibleSyncPeers with missedPongs===0 check and shuffle - Orchestrator tracks asked peers per cycle and sync completions per type - Requires all 3 sync types (apprunning, appinstalling, apperrors) to complete from 3 peers before setting stateSyncComplete - Falls back to block-count timer if <3 eligible peers exist - 2-minute timeout falls back if syncs don't complete - Replaced globalState.appRunningSyncComplete with orchestrator-internal state, wired via setOnSyncComplete callback - Renamed isLocationReady to isStateSyncReady - Extracted peer uptime threshold to config (appSyncMinPeerUptime), set to 60s for testing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Set to 1 in default config for testing on 2-node networks, 3 in test config for unit tests. Also adds appSyncMinPeerUptime and other sync config values to the test config. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The signed broadcasts collection was upserting by {ip} only, so v1
gossip messages (one per app) from the same IP overwrote each other.
Only the last app's broadcast survived — losing data for sync
responses. Changed to upsert by {ip, data.name} for v1 messages,
keeping {ip} for v2 messages which already contain all apps.
Updated unique index to match.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
peerNotification was sending v1 (per-app) for single-app nodes and v2 (apps array) for multi-app nodes, with duplicate local storage. appInstaller was also sending v1. Simplified both to always send v2, removing the v1/v2 branching on the sending side. Receivers still handle v1 from old nodes on the network. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Aggregation function that produces the same output shape as
appLocation() but reads from the signed broadcasts collection.
Handles both v1 (per-app) and v2 (multi-app) documents via $facet,
deduplicates by {name, ip} taking the latest broadcastedAt. Adds
data.apps.name index for v2 app queries. Also reverts the v1
cleanup logic — v1 docs coexist with v2 and TTL out naturally.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…orage
Major refactoring of app state broadcasting:
1. Split peerNotification into broadcast-only + recovery:
- Stopped-app recovery logic (recreateMissingContainers,
handleMissingMasterSlaveContainer, checkStoppedApps) moved to
stoppedAppsRecovery.js where it belongs
- peerNotification.js now only handles broadcasting + timer
- Breaks circular dep: appInstaller no longer imports peerNotification
- appInstaller uses setOnInstallComplete callback wired by serviceManager
2. Always send v2 apprunning broadcasts:
- peerNotification always sends v2 with full apps array
- appInstaller triggers full checkAndNotifyPeersOfRunningApps via callback
- Removes v1/v2 branching on the sending side
3. Fix v1 broadcast storage overwriting:
- v1 broadcasts (from old nodes) now upsert by {ip, data.name} not {ip}
- Prevents losing per-app data when multiple v1 messages arrive
- Updated unique index to {ip, data.name}
4. Broadcast timer self-resets:
- Moved interval from orchestrator to peerNotification
- resetBroadcastInterval called in finally block
- Any call to checkAndNotifyPeersOfRunningApps resets the 1h timer
5. appLocationFromBroadcasts aggregation view:
- New function produces same output shape as appLocation()
- Handles both v1 and v2 via $facet, dedupes by {name, ip}
- Added data.apps.name index for v2 app queries
- Prepares for replacing locations collection with broadcast collection
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## development #1724 +/- ##
===============================================
+ Coverage 55.02% 55.08% +0.05%
===============================================
Files 135 139 +4
Lines 27734 28442 +708
===============================================
+ Hits 15260 15666 +406
- Misses 12474 12776 +302 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
broadcastMessageToAll now returns the signed object. peerNotification stores it locally via storeSignedAppRunningBroadcast, closing the gap where a node's own running apps appeared in the locations collection but not in the signed broadcasts collection until gossip relayed the message back. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The batch store path (used by sync responses) was doing unconditional $set on location collections, which could overwrite newer data from the gossip path or the node's own broadcast. Use aggregation pipeline updates with $cond to only write fields when the incoming broadcastedAt is newer than what's already in the DB. Affects all three batch stores: running, installing, and errors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…node When a v2 broadcast arrives with fewer apps than previously stored, the location collection kept orphaned entries for removed apps. Now both the gossip path and batch sync path remove location entries for apps no longer in the v2 broadcast's app list. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The batch sync upserts filter by {name, ip} but the collection only had
an index on {name}. For popular apps running on ~100 nodes, each upsert
examined ~99 docs to find 1. The compound index reduces this to a single
key lookup — explain shows 99 docs examined → 1.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a v2 broadcast arrives for an IP, any v1 signed docs for apps no longer in the v2's app list are now removed. Applies to both the gossip path (storeSignedAppRunningBroadcast) and batch sync path. Ensures the signed broadcast collection stays consistent with the location collection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The cleanup query filters by ip first with $nin on name. The previous
{name, ip} order couldn't use the index prefix for ip-first lookups,
causing 1131 key scans per IP. Reversing to {ip, name} reduces cleanup
to 4 key scans per IP. The upsert query {name, ip} still uses this
index efficiently (MongoDB handles field order).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The signed broadcast stores on the gossip path used the full TTL (125 min / 15 min / 24 hours) while the location stores used 5 min. Stale gossip (>5 min) would be stored in the signed collection but rejected from the location collection, causing inconsistency. Now both use 5 min on the gossip path. The batch sync path retains full TTL validity since sync is a point-in-time snapshot. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Gossip arrives in unpredictable order — a stale v2 relay can trigger cleanup that removes valid fresher v1 data. Remove all cleanup from the gossip path (storeSignedAppRunningBroadcast, storeAppRunningMessage). Batch sync cleanup is safe because it processes a consistent snapshot. Add broadcastedAt condition to cleanup deletes so concurrent gossip with fresher data survives. Merge upserts and cleanup into a single bulkWrite per collection for atomicity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three changes to eliminate orphaned entries between collections:
1. break → continue in storeAppRunningMessage loop: for v2 messages
with multiple apps, skip apps that already have current data but
keep processing the rest. Previously broke out of the entire loop.
2. storeAppRunningMessage returns { stored, rebroadcast } instead of
true/false. The gossip handler only calls storeSignedAppRunningBroadcast
when stored is true, ensuring both collections accept or reject together.
3. Remove redundant 5-minute gossip validity check from
storeSignedAppRunningBroadcast — it's now gated on the location
store's acceptance, eliminating the timing edge where one store
accepts at the boundary and the other rejects milliseconds later.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The sigterm handler was mutating broadcastedAt on location records to force 7-minute TTL expiry. This broke the data contract — broadcastedAt is derived from signed data and should never change. Stale gossip could also overwrite the sigterm by passing the "is newer" check against the fake broadcastedAt value. Switch all 6 ephemeral collections to expireAt-based TTL (expireAt:0). expireAt is operational metadata we control, not part of the signed payload. Sigterm now sets expireAt = now + 7min on both locations and signed broadcasts without touching broadcastedAt. Also: split gossip validity (5min) from record expiry into named constants, add missing expireAt to error stores, fix empty-apps v2 handler to clean up signed broadcasts with broadcastedAt guard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nodeStatusMonitor and storeAppRemovedMessage deleted from zelappslocation without touching fluxapprunningbroadcasts, leaving orphaned signed broadcasts (~44 per 20-minute monitor cycle). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- storeAppRemovedMessage: $addToSet excludedApps on v2 broadcast docs so the derived view skips removed apps without mutating signed data - storeSignedAppRunningBroadcast + batch sync: $unset excludedApps when a newer broadcast upserts (clears stale exclusions) - appLocationFromBroadcasts: filter out excluded apps after v2 unwind - reindexGlobalAppsLocation: also drop running broadcasts collection - explorer rescan: also drop running + installing broadcasts - Export handleMissingMasterSlaveContainer from stoppedAppsRecovery - Fix all 10 CI test failures, add excludedApps tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
Author
|
Superseded by #1726 which includes the event log approach. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the timer-based spawner startup (125-minute fixed wait) with an event-driven architecture that syncs all ephemeral app state from peers in seconds. A restarting node now reaches spawner readiness in ~30 seconds instead of 2+ hours. Also fixes a broken TTL on the install errors collection and enables the spawner to skip apps with network-wide install failures.
Key changes:
cachedAtindex replaced with workingbroadcastedAt(24-hour TTL)The Problem
Cancelled enterprise apps (
expire: 100) were being installed on nodes that restarted after the cancellation. Six root causes identified:globalAppsInformationrebuilt at boot before hash syncAdditionally, the install errors collection had a broken TTL (index on
cachedAtwhich was never set — 19,000+ records accumulating indefinitely on production nodes) and an unsigned HTTP bulk fetch (getPeerAppsInstallingErrorMessages) that trusted one peer's aggregate data without verification.Architecture
Before (timer-based)
After (event-driven)
State Machine
Signed Broadcast Sync Protocol
How it works
[type:1][sinceTimestamp:8]done: true→ receiver knows sync is completeSecurity
bitcoinjs-messagesinceTimestampadjusted bypeer.remoteClockOffsetMsbefore sender queriesgetPeerAppsInstallingErrorMessages(which fetched unsigned data via HTTP from one peer), every synced record is cryptographically verified back to its originating nodeCapability
Single
appStateSynccapability covers all four sync types. Defined once inFluxPeerSocket.jsasFLUX_CAPABILITIES, used by both WebSocket client (outbound request headers) and server (inbound response headers). Previously capabilities were duplicated in two files with different values — fixed.Sender-Side Filtering
Each sync type filters by validity on the sender side to avoid sending records that have technically expired but haven't been cleaned up yet by MongoDB's TTL thread (~60s lag):
Dual-Collection Storage
Each ephemeral data type stored in two collections:
pubKey,signature,data— used for serving sync requestsBoth written on every gossip message. Installing collections cleaned up when app transitions to running.
Why duplicate instead of replacing? There are 15+ direct queries against the existing location collections scattered across the codebase (messageStore, peerNotification, fluxCommunication, nodeStatusMonitor, stoppedAppsRecovery, advancedWorkflows, appController, syncthingMonitor). A separate branch centralizes all these queries through
registryManager. Changing the document shape in this PR would conflict with that work. The signed collections add ~1.7MB overhead (running) — trivial. Once the centralization branch is rebased on top of this, the dual-collection pattern can be collapsed into a single collection with an aggregation-basedappLocation()query.fluxapprunningbroadcastszelappslocationfluxappinstallingbroadcastsappsinstallinglocationsfluxappinstallingerrorsbroadcastsappsInstallingErrorsLocationsValidity Windows
Gossip validity is short because gossip messages are relayed — stale messages indicate network issues. Sync validity matches TTL because sync is a point-in-time snapshot of the sender's DB.
Install Errors Fix
What was broken
cachedAt— field never set in any document, nothing ever expiredexpireAt: nullexpireAt: null(persist forever)removeDocumentsFromCollection({})wiped collection on every restart (inconsistent with other ephemeral collections)getPeerAppsInstallingErrorMessagesfetched unsigned error data via HTTP from one random peerWhat was fixed
cachedAttobroadcastedAtwith 86400s (24-hour) expirygetPeerAppsInstallingErrorMessages— replaced by signed 0x23 syncSpawner network-wide error check
Previously commented out (
appSpawner.js:336-340). Re-enabled with new logic:Two-layer error handling:
spawnErrorsLongerAppCacheprevents nodes that already failed from retrying for a weekOther Changes
FLUX_CAPABILITIESextracted toFluxPeerSocket.jsas single source of truth (was duplicated betweensocketServer.jsandfluxCommunication.jswith different values)X-Flux-Uptime: Now sent in both inbound and outbound connection headers (was outbound only)peerNotification.js: Simplified from 10-parameter injection to directglobalStateimportsmessageVerifier.js: Removed ~165 lines of duplicate codeappHashSyncService.js: Rewritten — multi-peer parallel fetch, no 95% threshold skip#syncInProgressflag prevents overlapping syncs in orchestratorRollout Behavior
During rollout (mixed network): Nodes running new code will find no
appStateSyncpeers and fall back to the block-count timer — 125 minutes for non-enterprise, 62 minutes for enterprise. This is the same behavior as today. All other improvements (hash sync, DB rebuild sequencing, spawner expiration filter, immediate promotion, network error check) apply immediately regardless.After full network upgrade: Restarting nodes sync all location data from peers in ~1.5 seconds. Spawner can start as soon as explorer syncs + hash sync + DB rebuild completes — typically 2-5 minutes after restart instead of 2+ hours.
Test Results
Tested on two dedicated test nodes.
Test 1: Temp Message Sync
/apps/temporarymessagesAPITest 2: App Running Sync (full, from clean slate)
1 broadcast failed:
nodeNotFound(node dropped off network between storage and verification — expected)Test 3: App Installing Sync
Test 4: Install Errors Sync
Test 5: Collection Consistency
Test 6: Spawner Error Check
Inserted 6 test error records for a fake hash. Verified:
Test 7: Capability Advertisement
Fixed and verified: both inbound (server response) and outbound (client request) now advertise
appStateSync. Previously server-side was missing new capabilities.Test 8: Installing Validity Window
Initial test showed 132 received, 44 stored. Root cause: sender was sending records up to 15 minutes old (TTL window) but receiver only accepted 5 minutes (gossip validity). Fixed: sender now filters to validity window, 100% stored on subsequent tests.
Performance
Total sync time: ~1.5 seconds for full network state
Files Changed
config/default.jsutils/peerCodec.jsutils/FluxPeerSocket.jsFLUX_CAPABILITIESconstantutils/FluxPeerManager.jsutils/globalState.jsappRunningSyncComplete,spawnerPausedlib/socketServer.jsFLUX_CAPABILITIES, addsX-Flux-UptimeappMessaging/messageStore.jsappMessaging/appSyncOrchestrator.jsappMessaging/appHashSyncService.jsappMessaging/messageVerifier.jsappMessaging/peerNotification.jsfluxCommunication.jsX-Flux-UptimefluxCommunicationMessagesSender.jsserviceManager.jsappLifecycle/appSpawner.jsappLifecycle/advancedWorkflows.jsgetPeerAppsInstallingErrorMessagesappDatabase/registryManager.jscountAppInstallingErrors, cleaned projectionsTest plan
appStateSyncpeers → block-count timer activatesPre-PR: Change uptime threshold from 60s to 7500s(done, 4 locations — matches apprunning TTL)Before merging