feat: event-driven app state sync & spawner startup by MorningLightMountain713 · Pull Request #1724 · RunOnFlux/flux

MorningLightMountain713 · 2026-04-30T08:18:00Z

Summary

Replaces the timer-based spawner startup (125-minute fixed wait) with an event-driven architecture that syncs all ephemeral app state from peers in seconds. A restarting node now reaches spawner readiness in ~30 seconds instead of 2+ hours. Also fixes a broken TTL on the install errors collection and enables the spawner to skip apps with network-wide install failures.

Key changes:

AppSyncOrchestrator: State machine coordinating hash sync, DB rebuild, and spawner startup via events instead of timers
Signed broadcast sync protocol: Four binary message types (0x20-0x23) for syncing temp messages, apprunning locations, appinstalling locations, and install errors from peers over WebSocket
Spawner expiration filter: Pipeline-level check prevents spawning expired/cancelled apps (the original bug fix)
Immediate promotion: Temp messages arriving for known hashes are promoted instantly instead of waiting for the 30-minute sweep
Network-wide error check: Spawner skips apps with 5+ install failures across distinct nodes
Install errors TTL fix: Broken cachedAt index replaced with working broadcastedAt (24-hour TTL)

The Problem

Cancelled enterprise apps (expire: 100) were being installed on nodes that restarted after the cancellation. Six root causes identified:

No expiration check in the spawner aggregation pipeline
Stale globalAppsInformation rebuilt at boot before hash sync
Slow hash sync: 30-minute loop, 500 at a time, one peer
95% threshold skipping bulk sync for nodes down briefly
Missed gossip: P2P gossip is fire-and-forget, no replay
Timer-based spawner: starts on fixed clock regardless of DB state

Additionally, the install errors collection had a broken TTL (index on cachedAt which was never set — 19,000+ records accumulating indefinitely on production nodes) and an unsigned HTTP bulk fetch (getPeerAppsInstallingErrorMessages) that trusted one peer's aggregate data without verification.

Architecture

Before (timer-based)

T+0        Boot, rebuild DB from incomplete data
T+15-30m   Hash sync starts (30-min loop)
T+62m      Enterprise spawner (TIMER) — may use stale data
T+125m     Non-enterprise spawner (TIMER) — may use stale data

After (event-driven)

T+0        Boot, AppSyncOrchestrator starts
T+~30s     Peer threshold (12) → sync all 4 data types from peers
T+~31s     All syncs complete, location data ready
T+varies   Explorer syncs → hash sync → DB rebuild
T+varies   Node confirmed → spawnerReady

State Machine

INITIALIZING ──► SYNCING ──► READY
                   ▲            |
                   |            v
                RESYNCING ◄── DEGRADED

Signed Broadcast Sync Protocol

Message	Code	Data	Chunk Size
Temp Messages	0x20	App registrations/updates not yet on-chain	2000
App Running	0x21	Node → running apps mapping	2000
App Installing	0x22	In-progress installations	2000
Install Errors	0x23	Failed installations per node per app hash	2000

How it works

Receiver sends 9-byte binary request: [type:1][sinceTimestamp:8]
Sender opens MongoDB cursor (consistent snapshot), streams in chunks of 2000 — caps memory usage regardless of dataset size
Each chunk is a signed broadcast containing an array of inner signed broadcasts
Receiver verifies each inner broadcast signature against deterministic node list
Verified broadcasts bulk-written to both signed and location collections
Final chunk has done: true → receiver knows sync is complete

Security

Two-layer verification: Outer broadcast proves a real node sent the response. Inner broadcasts are individually verified — pubkey checked against deterministic node list, signature verified via bitcoinjs-message
Rate limiting: 5-minute cooldown per peer per sync type
Clock offset adjustment: sinceTimestamp adjusted by peer.remoteClockOffsetMs before sender queries
No trust in aggregates: Unlike the old getPeerAppsInstallingErrorMessages (which fetched unsigned data via HTTP from one peer), every synced record is cryptographically verified back to its originating node

Capability

Single appStateSync capability covers all four sync types. Defined once in FluxPeerSocket.js as FLUX_CAPABILITIES, used by both WebSocket client (outbound request headers) and server (inbound response headers). Previously capabilities were duplicated in two files with different values — fixed.

Sender-Side Filtering

Each sync type filters by validity on the sender side to avoid sending records that have technically expired but haven't been cleaned up yet by MongoDB's TTL thread (~60s lag):

Running: Only sends records newer than 125 minutes (matches TTL)
Installing: Only sends records newer than 15 minutes (matches TTL)
Install errors: Only sends records newer than 24 hours (matches TTL)

Dual-Collection Storage

Each ephemeral data type stored in two collections:

Signed collection (new): Full broadcast with pubKey, signature, data — used for serving sync requests
Location collection (existing, unchanged): Existing flat rows — used by all existing queries (spawner, advancedWorkflows, etc.)

Both written on every gossip message. Installing collections cleaned up when app transitions to running.

Why duplicate instead of replacing? There are 15+ direct queries against the existing location collections scattered across the codebase (messageStore, peerNotification, fluxCommunication, nodeStatusMonitor, stoppedAppsRecovery, advancedWorkflows, appController, syncthingMonitor). A separate branch centralizes all these queries through registryManager. Changing the document shape in this PR would conflict with that work. The signed collections add ~1.7MB overhead (running) — trivial. Once the centralization branch is rebased on top of this, the dual-collection pattern can be collapsed into a single collection with an aggregation-based appLocation() query.

Signed	Location	TTL
`fluxapprunningbroadcasts`	`zelappslocation`	125 min
`fluxappinstallingbroadcasts`	`appsinstallinglocations`	15 min
`fluxappinstallingerrorsbroadcasts`	`appsInstallingErrorsLocations`	24 hours

Validity Windows

Data Type	Gossip Validity	Sync Validity	TTL
Running	5 min	125 min	125 min
Installing	5 min	15 min	15 min
Install Errors	5 min	24 hours	24 hours

Gossip validity is short because gossip messages are relayed — stale messages indicate network issues. Sync validity matches TTL because sync is a point-in-time snapshot of the sender's DB.

Install Errors Fix

What was broken

TTL index on cachedAt — field never set in any document, nothing ever expired
19,304 records on production (chud), oldest 3 weeks, 98% had expireAt: null
Null-expiry logic: when 5+ errors accumulated for same app+hash, set expireAt: null (persist forever)
removeDocumentsFromCollection({}) wiped collection on every restart (inconsistent with other ephemeral collections)
getPeerAppsInstallingErrorMessages fetched unsigned error data via HTTP from one random peer

What was fixed

TTL index changed from cachedAt to broadcastedAt with 86400s (24-hour) expiry
Removed null-expiry logic
Removed startup wipe (consistent with other collections, TTL handles cleanup)
Removed getPeerAppsInstallingErrorMessages — replaced by signed 0x23 sync
Gossip validity reduced from 60 minutes to 5 minutes (consistent with installing messages)

Spawner network-wide error check

Previously commented out (appSpawner.js:336-340). Re-enabled with new logic:

const errorCount = await registryManager.countAppInstallingErrors(appHash);
if (errorCount >= 5) { // 5+ distinct nodes failed → broken spec
  globalState.spawnErrorsLongerAppCache.set(appHash, '');
  throw new Error(`...has ${errorCount} network-wide install failures, skipping`);
}

Two-layer error handling:

Network DB (24h TTL): Prevents untried nodes from wasting time. ~5 nodes retry per day as errors expire
Per-node in-memory (7 days): spawnErrorsLongerAppCache prevents nodes that already failed from retrying for a week

Other Changes

Capability unification: FLUX_CAPABILITIES extracted to FluxPeerSocket.js as single source of truth (was duplicated between socketServer.js and fluxCommunication.js with different values)
X-Flux-Uptime: Now sent in both inbound and outbound connection headers (was outbound only)
peerNotification.js: Simplified from 10-parameter injection to direct globalState imports
messageVerifier.js: Removed ~165 lines of duplicate code
appHashSyncService.js: Rewritten — multi-peer parallel fetch, no 95% threshold skip
Sync guard: #syncInProgress flag prevents overlapping syncs in orchestrator

Rollout Behavior

During rollout (mixed network): Nodes running new code will find no appStateSync peers and fall back to the block-count timer — 125 minutes for non-enterprise, 62 minutes for enterprise. This is the same behavior as today. All other improvements (hash sync, DB rebuild sequencing, spawner expiration filter, immediate promotion, network error check) apply immediately regardless.

After full network upgrade: Restarting nodes sync all location data from peers in ~1.5 seconds. Spawner can start as soon as explorer syncs + hash sync + DB rebuild completes — typically 2-5 minutes after restart instead of 2+ hours.

Test Results

Tested on two dedicated test nodes.

Test 1: Temp Message Sync

Stopped sandwich, registered 2 test apps while it was down
Restarted sandwich, peered to squidward
Sandwich received 3 temp messages (chunkymonkey + 2 test apps), all processed
Verified via /apps/temporarymessages API

Test 2: App Running Sync (full, from clean slate)

handleAppRunningSyncResponse - Received 2000 broadcasts (done: false)
handleAppRunningSyncResponse - Received 1336 broadcasts (done: true)
handleAppRunningSyncResponse - Stored 2000 of 2000 verified broadcasts
handleAppRunningSyncResponse - Stored 1335 of 1335 verified broadcasts

1 broadcast failed: nodeNotFound (node dropped off network between storage and verification — expected)

Test 3: App Installing Sync

handleAppInstallingSyncResponse - Received 57 broadcasts (done: true)
handleAppInstallingSyncResponse - Stored 57 of 57 verified broadcasts

Test 4: Install Errors Sync

handleAppInstallingErrorsSyncResponse - Received 3 broadcasts (done: true)
handleAppInstallingErrorsSyncResponse - Stored 3 of 3 verified broadcasts

Test 5: Collection Consistency

running broadcasts: 3335  apps: 5165  locations: 5166  match: ~✓ (1 from gossip)
installing broadcasts: 56  locations: 56  match: true
error broadcasts: 3  error locations: 3  match: true

Test 6: Spawner Error Check

Inserted 6 test error records for a fake hash. Verified:

countDocuments({hash: "testhash..."}) = 6 → would skip (>=5): true
countDocuments({hash: "nonexistent"}) = 0 → would skip (>=5): false

Test 7: Capability Advertisement

Fixed and verified: both inbound (server response) and outbound (client request) now advertise appStateSync. Previously server-side was missing new capabilities.

Test 8: Installing Validity Window

Initial test showed 132 received, 44 stored. Root cause: sender was sending records up to 15 minutes old (TTL window) but receiver only accepted 5 minutes (gossip validity). Fixed: sender now filters to validity window, 100% stored on subsequent tests.

Performance

T+0.0s   Boot
T+28.0s  Peer threshold reached (12 peers)
T+28.0s  All 4 sync requests sent
T+28.2s  Temp: 1 received, processed
T+28.3s  Installing: 57 received, verified, stored
T+28.3s  Errors: 3 received, verified, stored
T+28.7s  Running chunk 1: 2000 received
T+29.1s  Running chunk 2: 1336 received
T+29.7s  All syncs complete

Total sync time: ~1.5 seconds for full network state

Files Changed

File	Change
`config/default.js`	New collection names, peer thresholds
`utils/peerCodec.js`	0x20 sinceTimestamp, 0x21, 0x22, 0x23 encode/decode
`utils/FluxPeerSocket.js`	`FLUX_CAPABILITIES` constant
`utils/FluxPeerManager.js`	Binary handlers for 0x21-0x23, eligibility methods, threshold events
`utils/globalState.js`	`appRunningSyncComplete`, `spawnerPaused`
`lib/socketServer.js`	Uses shared `FLUX_CAPABILITIES`, adds `X-Flux-Uptime`
`appMessaging/messageStore.js`	Signed broadcast storage, batch operations, cleanup, error TTL fix
`appMessaging/appSyncOrchestrator.js`	Event-driven coordinator
`appMessaging/appHashSyncService.js`	Rewritten hash sync
`appMessaging/messageVerifier.js`	Removed duplicate code
`appMessaging/peerNotification.js`	Direct imports, simplified signature
`fluxCommunication.js`	Sync handlers, dispatch routes, capability, `X-Flux-Uptime`
`fluxCommunicationMessagesSender.js`	Cursor-based streaming responders
`serviceManager.js`	Collection indexes, orchestrator wiring, TTL fixes
`appLifecycle/appSpawner.js`	Event-driven startup, pause/resume, network error check
`appLifecycle/advancedWorkflows.js`	Removed `getPeerAppsInstallingErrorMessages`
`appDatabase/registryManager.js`	`countAppInstallingErrors`, cleaned projections

Test plan

Before merging

Final code review
Extended testing on test nodes (leave running for 24+ hours, verify TTL cleanup, error accumulation, spawner behavior)
Review all commit messages and squash if needed

The spawner had no awareness of block height or app expiration. Cancelled apps (expire=100) remained in globalAppsInformation for up to 3+ hours until the next expireGlobalApplications sweep, during which the spawner would install them on new nodes. Add a PON-fork-aware expiration filter to the aggregation pipeline that excludes apps expiring within 100 blocks (newMinBlocksAllowance). Filter runs before the $lookup join so expired apps never reach candidate selection. Also fix two existing bugs: - findIndex used >= instead of <=, popping deferred apps immediately instead of waiting their scheduled time - Array.includes() with a callback always returned false (should be .some()) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace timer/loop-based app hash sync with an event-driven architecture to fix cancelled apps being installed on nodes with stale data. AppSyncOrchestrator coordinates the sync lifecycle: - Listens for blockEmitter (explorer synced) and peerManager threshold events - Fetches temp messages from eligible peers on reconnect (new tempMessageSync P2P capability with binary 0x20 request + signed JSON response) - Runs unified syncMissingHashes (multi-peer, no 95% threshold, no 30-min loop) - Rebuilds globalAppsInformation AFTER sync via reindexGlobalAppsInformation - Emits spawnerReady when all readiness conditions met - Pauses spawner if peers drop below degraded threshold (hysteresis) Other changes: - FluxPeerManager extends EventEmitter with peerThresholdReached/peersBelowThreshold - X-Flux-Uptime header in WebSocket handshake for peer uptime tracking - messageStore triggers immediate promotion when temp message arrives for known hash - peerNotification.checkAndNotifyPeersOfRunningApps uses direct imports - Remove duplicate continuousFluxAppHashesCheck from messageVerifier - Remove timer-based spawner start and 30-min hash sync loop from serviceManager - Fix syncthingHealthMonitor test to match broadcast change (sendMessage=true) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add #syncInProgress flag to prevent concurrent syncs in orchestrator - Add per-peer 5-minute cooldown on temp message requests - Cap incoming temp sync messages at 500 - Remove dead ponFork variable and unused config import - Collapse identical conditional branches in #runHashSync Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Store signed apprunning broadcasts in new fluxapprunningbroadcasts collection alongside existing zelappslocation. On peer threshold, request signed broadcasts from capable peers via binary 0x21 with sinceTimestamp for delta sync. Sender streams results using cursor- based chunking (2000/chunk) with done flag. Receiver verifies each inner broadcast signature individually then bulk writes both collections. Location readiness now data-driven (appRunningSyncComplete) with block-count fallback for legacy networks. Also updates temp message sync (0x20) to include sinceTimestamp and use cursor-based streaming. Removes arbitrary 500-message cap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Extract FLUX_CAPABILITIES to FluxPeerSocket.js as single source of truth. Both the WebSocket server (inbound upgrade response) and outbound client (upgrade request) now use the same list. Also adds X-Flux-Uptime to server response headers so peers learn our uptime regardless of connection direction. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add signed appinstalling broadcast sync (0x22) using the same cursor-based streaming pattern as apprunning sync. Stores signed broadcasts in fluxappinstallingbroadcasts collection for sync, existing appsinstallinglocations unchanged. Unify capability names: replace tempMessageSync + appRunningSync with single appStateSync capability covering all three sync types (0x20 temp messages, 0x21 apprunning, 0x22 appinstalling). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When storeAppRunningMessage removes entries from appsinstallinglocations (app finished installing), also remove the corresponding signed broadcast from fluxappinstallingbroadcasts to keep both collections consistent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Only send installing broadcasts newer than 5 minutes. The TTL on the collection is 15 minutes but the validity window is 5 minutes — records between 5-15 minutes old are stale (install likely finished/failed) and would be rejected by the receiver anyway. Filtering on the sender avoids wasting bandwidth on records that can't be stored. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Sender filters to 15 minutes (matching TTL) instead of 5 minutes. Receiver signed storage and batch store also use 15 minutes. The gossip path retains 5-minute validity — gossip messages are relayed and could be stale, but sync is a point-in-time snapshot of current DB state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix broken TTL on appsInstallingErrorsLocations: index was on `cachedAt` (never set), changed to `broadcastedAt` with 24-hour expiry. Remove null-expiry logic that prevented 98% of records from ever expiring. Remove startup wipe for consistency with other ephemeral collections. Change gossip validity from 60min to 5min (sync path accepts 24 hours). Add signed error broadcast sync (0x23) following the same pattern as apprunning (0x21) and appinstalling (0x22). Sender filters to 24-hour window. Remove unsigned HTTP fetch (getPeerAppsInstallingErrorMessages) that was trusting one peer's aggregate data without verification. Enable spawner to check network-wide error count per app hash. If 5+ distinct nodes have reported install failures for the same hash, skip the app. Combined with the 7-day in-memory spawnErrorsLongerAppCache, this means: broken specs are skipped network-wide within minutes, retried once per day as errors expire from TTL. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add orchestrator tests for apprunning sync requests, location readiness with appRunningSyncComplete flag, block-count fallback, and degradation reset. Add peerCodec tests for all four sync message types (0x20-0x23) with timestamp roundtrip verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Change eligibility threshold from 60s (testing) to 7200s (2 hours) for all 4 sync types. A peer needs at least 2 hours of uptime to have accumulated a full cycle of apprunning broadcasts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Change apprunning gossip store validity from 125 min to 5 min — consistent with installing and errors. The broadcast relay check already limits to ~5 minutes so no legitimate gossip message should be older. Change peer uptime threshold from 7200s to 7500s to match the apprunning TTL exactly. A peer needs a full TTL cycle (2h5m) to have complete data. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Filter apprunning broadcasts to 125 minutes on sender side, consistent with installing and errors. Prevents sending records that expired but haven't been cleaned up by MongoDB's TTL thread. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

reindexGlobalAppsInformation was wiping the entire error locations collection on every reindex while leaving the signed broadcasts untouched, causing a growing mismatch (854 vs 301 on sandwich). expireGlobalApplications and updateAppSpecifications only removed from error locations, not broadcasts. All three now operate on both collections consistently. Explorer rescan also drops both. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The spawner was starting after syncing from just 1 peer — any single appStateSync peer responding with done:true set appRunningSyncComplete globally, bypassing the 125-minute block-count fallback entirely. Redesigned sync readiness: - Merged duplicate getEligibleTempSyncPeers/getEligibleAppRunningSyncPeers into getEligibleSyncPeers with missedPongs===0 check and shuffle - Orchestrator tracks asked peers per cycle and sync completions per type - Requires all 3 sync types (apprunning, appinstalling, apperrors) to complete from 3 peers before setting stateSyncComplete - Falls back to block-count timer if <3 eligible peers exist - 2-minute timeout falls back if syncs don't complete - Replaced globalState.appRunningSyncComplete with orchestrator-internal state, wired via setOnSyncComplete callback - Renamed isLocationReady to isStateSyncReady - Extracted peer uptime threshold to config (appSyncMinPeerUptime), set to 60s for testing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Set to 1 in default config for testing on 2-node networks, 3 in test config for unit tests. Also adds appSyncMinPeerUptime and other sync config values to the test config. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The signed broadcasts collection was upserting by {ip} only, so v1 gossip messages (one per app) from the same IP overwrote each other. Only the last app's broadcast survived — losing data for sync responses. Changed to upsert by {ip, data.name} for v1 messages, keeping {ip} for v2 messages which already contain all apps. Updated unique index to match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

peerNotification was sending v1 (per-app) for single-app nodes and v2 (apps array) for multi-app nodes, with duplicate local storage. appInstaller was also sending v1. Simplified both to always send v2, removing the v1/v2 branching on the sending side. Receivers still handle v1 from old nodes on the network. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Aggregation function that produces the same output shape as appLocation() but reads from the signed broadcasts collection. Handles both v1 (per-app) and v2 (multi-app) documents via $facet, deduplicates by {name, ip} taking the latest broadcastedAt. Adds data.apps.name index for v2 app queries. Also reverts the v1 cleanup logic — v1 docs coexist with v2 and TTL out naturally. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…orage Major refactoring of app state broadcasting: 1. Split peerNotification into broadcast-only + recovery: - Stopped-app recovery logic (recreateMissingContainers, handleMissingMasterSlaveContainer, checkStoppedApps) moved to stoppedAppsRecovery.js where it belongs - peerNotification.js now only handles broadcasting + timer - Breaks circular dep: appInstaller no longer imports peerNotification - appInstaller uses setOnInstallComplete callback wired by serviceManager 2. Always send v2 apprunning broadcasts: - peerNotification always sends v2 with full apps array - appInstaller triggers full checkAndNotifyPeersOfRunningApps via callback - Removes v1/v2 branching on the sending side 3. Fix v1 broadcast storage overwriting: - v1 broadcasts (from old nodes) now upsert by {ip, data.name} not {ip} - Prevents losing per-app data when multiple v1 messages arrive - Updated unique index to {ip, data.name} 4. Broadcast timer self-resets: - Moved interval from orchestrator to peerNotification - resetBroadcastInterval called in finally block - Any call to checkAndNotifyPeersOfRunningApps resets the 1h timer 5. appLocationFromBroadcasts aggregation view: - New function produces same output shape as appLocation() - Handles both v1 and v2 via $facet, dedupes by {name, ip} - Added data.apps.name index for v2 app queries - Prepares for replacing locations collection with broadcast collection Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

codecov · 2026-05-01T21:04:57Z

Codecov Report

❌ Patch coverage is 35.02890% with 562 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.08%. Comparing base (8412550) to head (f2eb3f3).
⚠️ Report is 32 commits behind head on development.

Files with missing lines	Patch %	Lines
...k/src/services/appLifecycle/stoppedAppsRecovery.js	5.26%	126 Missing ⚠️
ZelBack/src/services/fluxCommunication.js	3.66%	105 Missing ⚠️
ZelBack/src/services/appMessaging/messageStore.js	20.38%	82 Missing ⚠️
...ck/src/services/fluxCommunicationMessagesSender.js	2.38%	82 Missing ⚠️
...ck/src/services/appMessaging/appHashSyncService.js	48.71%	60 Missing ⚠️
ZelBack/src/services/utils/FluxPeerManager.js	26.00%	37 Missing ⚠️
...Back/src/services/appMessaging/peerNotification.js	13.04%	20 Missing ⚠️
ZelBack/src/services/appLifecycle/appSpawner.js	34.78%	15 Missing ⚠️
...elBack/src/services/appDatabase/registryManager.js	7.14%	13 Missing ⚠️
...k/src/services/appMessaging/appSyncOrchestrator.js	92.39%	13 Missing ⚠️
... and 4 more

Additional details and impacted files

@@               Coverage Diff               @@
##           development    #1724      +/-   ##
===============================================
+ Coverage        55.02%   55.08%   +0.05%     
===============================================
  Files              135      139       +4     
  Lines            27734    28442     +708     
===============================================
+ Hits             15260    15666     +406     
- Misses           12474    12776     +302

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

broadcastMessageToAll now returns the signed object. peerNotification stores it locally via storeSignedAppRunningBroadcast, closing the gap where a node's own running apps appeared in the locations collection but not in the signed broadcasts collection until gossip relayed the message back. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The batch store path (used by sync responses) was doing unconditional $set on location collections, which could overwrite newer data from the gossip path or the node's own broadcast. Use aggregation pipeline updates with $cond to only write fields when the incoming broadcastedAt is newer than what's already in the DB. Affects all three batch stores: running, installing, and errors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…node When a v2 broadcast arrives with fewer apps than previously stored, the location collection kept orphaned entries for removed apps. Now both the gossip path and batch sync path remove location entries for apps no longer in the v2 broadcast's app list. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The batch sync upserts filter by {name, ip} but the collection only had an index on {name}. For popular apps running on ~100 nodes, each upsert examined ~99 docs to find 1. The compound index reduces this to a single key lookup — explain shows 99 docs examined → 1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When a v2 broadcast arrives for an IP, any v1 signed docs for apps no longer in the v2's app list are now removed. Applies to both the gossip path (storeSignedAppRunningBroadcast) and batch sync path. Ensures the signed broadcast collection stays consistent with the location collection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The cleanup query filters by ip first with $nin on name. The previous {name, ip} order couldn't use the index prefix for ip-first lookups, causing 1131 key scans per IP. Reversing to {ip, name} reduces cleanup to 4 key scans per IP. The upsert query {name, ip} still uses this index efficiently (MongoDB handles field order). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The signed broadcast stores on the gossip path used the full TTL (125 min / 15 min / 24 hours) while the location stores used 5 min. Stale gossip (>5 min) would be stored in the signed collection but rejected from the location collection, causing inconsistency. Now both use 5 min on the gossip path. The batch sync path retains full TTL validity since sync is a point-in-time snapshot. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Gossip arrives in unpredictable order — a stale v2 relay can trigger cleanup that removes valid fresher v1 data. Remove all cleanup from the gossip path (storeSignedAppRunningBroadcast, storeAppRunningMessage). Batch sync cleanup is safe because it processes a consistent snapshot. Add broadcastedAt condition to cleanup deletes so concurrent gossip with fresher data survives. Merge upserts and cleanup into a single bulkWrite per collection for atomicity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Three changes to eliminate orphaned entries between collections: 1. break → continue in storeAppRunningMessage loop: for v2 messages with multiple apps, skip apps that already have current data but keep processing the rest. Previously broke out of the entire loop. 2. storeAppRunningMessage returns { stored, rebroadcast } instead of true/false. The gossip handler only calls storeSignedAppRunningBroadcast when stored is true, ensuring both collections accept or reject together. 3. Remove redundant 5-minute gossip validity check from storeSignedAppRunningBroadcast — it's now gated on the location store's acceptance, eliminating the timing edge where one store accepts at the boundary and the other rejects milliseconds later. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The sigterm handler was mutating broadcastedAt on location records to force 7-minute TTL expiry. This broke the data contract — broadcastedAt is derived from signed data and should never change. Stale gossip could also overwrite the sigterm by passing the "is newer" check against the fake broadcastedAt value. Switch all 6 ephemeral collections to expireAt-based TTL (expireAt:0). expireAt is operational metadata we control, not part of the signed payload. Sigterm now sets expireAt = now + 7min on both locations and signed broadcasts without touching broadcastedAt. Also: split gossip validity (5min) from record expiry into named constants, add missing expireAt to error stores, fix empty-apps v2 handler to clean up signed broadcasts with broadcastedAt guard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

nodeStatusMonitor and storeAppRemovedMessage deleted from zelappslocation without touching fluxapprunningbroadcasts, leaving orphaned signed broadcasts (~44 per 20-minute monitor cycle). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- storeAppRemovedMessage: $addToSet excludedApps on v2 broadcast docs so the derived view skips removed apps without mutating signed data - storeSignedAppRunningBroadcast + batch sync: $unset excludedApps when a newer broadcast upserts (clears stale exclusions) - appLocationFromBroadcasts: filter out excluded apps after v2 unwind - reindexGlobalAppsLocation: also drop running broadcasts collection - explorer rescan: also drop running + installing broadcasts - Export handleMissingMasterSlaveContainer from stoppedAppsRecovery - Fix all 10 CI test failures, add excludedApps tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MorningLightMountain713 · 2026-05-04T07:20:32Z

Superseded by #1726 which includes the event log approach.

MorningLightMountain713 requested review from Cabecinha84, TheTrunk, XK4MiLX and alihm April 30, 2026 08:31

MorningLightMountain713 and others added 12 commits April 30, 2026 18:11

fix: log verification failure reason in installing sync handler

2c12d2b

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MorningLightMountain713 changed the title ~~fix: prevent spawner from installing expired/cancelled apps~~ feat: event-driven app state sync & spawner startup May 1, 2026

MorningLightMountain713 and others added 9 commits May 1, 2026 15:02

MorningLightMountain713 and others added 2 commits May 1, 2026 22:05

MorningLightMountain713 and others added 10 commits May 2, 2026 09:14

MorningLightMountain713 closed this May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: event-driven app state sync & spawner startup#1724

feat: event-driven app state sync & spawner startup#1724
MorningLightMountain713 wants to merge 34 commits into
developmentfrom
fix/spawner-expiration-check

MorningLightMountain713 commented Apr 30, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 1, 2026

Uh oh!

MorningLightMountain713 commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MorningLightMountain713 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key changes:

The Problem

Architecture

Before (timer-based)

After (event-driven)

State Machine

Signed Broadcast Sync Protocol

How it works

Security

Capability

Sender-Side Filtering

Dual-Collection Storage

Validity Windows

Install Errors Fix

What was broken

What was fixed

Spawner network-wide error check

Other Changes

Rollout Behavior

Test Results

Test 1: Temp Message Sync

Test 2: App Running Sync (full, from clean slate)

Test 3: App Installing Sync

Test 4: Install Errors Sync

Test 5: Collection Consistency

Test 6: Spawner Error Check

Test 7: Capability Advertisement

Test 8: Installing Validity Window

Performance

Files Changed

Test plan

Before merging

Uh oh!

codecov Bot commented May 1, 2026

Codecov Report

Uh oh!

MorningLightMountain713 commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MorningLightMountain713 commented Apr 30, 2026 •

edited

Loading