Skip to content

feat: event-driven app state sync & spawner startup#1724

Closed
MorningLightMountain713 wants to merge 34 commits into
developmentfrom
fix/spawner-expiration-check
Closed

feat: event-driven app state sync & spawner startup#1724
MorningLightMountain713 wants to merge 34 commits into
developmentfrom
fix/spawner-expiration-check

Conversation

@MorningLightMountain713
Copy link
Copy Markdown
Contributor

@MorningLightMountain713 MorningLightMountain713 commented Apr 30, 2026

Summary

Replaces the timer-based spawner startup (125-minute fixed wait) with an event-driven architecture that syncs all ephemeral app state from peers in seconds. A restarting node now reaches spawner readiness in ~30 seconds instead of 2+ hours. Also fixes a broken TTL on the install errors collection and enables the spawner to skip apps with network-wide install failures.

Key changes:

  • AppSyncOrchestrator: State machine coordinating hash sync, DB rebuild, and spawner startup via events instead of timers
  • Signed broadcast sync protocol: Four binary message types (0x20-0x23) for syncing temp messages, apprunning locations, appinstalling locations, and install errors from peers over WebSocket
  • Spawner expiration filter: Pipeline-level check prevents spawning expired/cancelled apps (the original bug fix)
  • Immediate promotion: Temp messages arriving for known hashes are promoted instantly instead of waiting for the 30-minute sweep
  • Network-wide error check: Spawner skips apps with 5+ install failures across distinct nodes
  • Install errors TTL fix: Broken cachedAt index replaced with working broadcastedAt (24-hour TTL)

The Problem

Cancelled enterprise apps (expire: 100) were being installed on nodes that restarted after the cancellation. Six root causes identified:

  1. No expiration check in the spawner aggregation pipeline
  2. Stale globalAppsInformation rebuilt at boot before hash sync
  3. Slow hash sync: 30-minute loop, 500 at a time, one peer
  4. 95% threshold skipping bulk sync for nodes down briefly
  5. Missed gossip: P2P gossip is fire-and-forget, no replay
  6. Timer-based spawner: starts on fixed clock regardless of DB state

Additionally, the install errors collection had a broken TTL (index on cachedAt which was never set — 19,000+ records accumulating indefinitely on production nodes) and an unsigned HTTP bulk fetch (getPeerAppsInstallingErrorMessages) that trusted one peer's aggregate data without verification.

Architecture

Before (timer-based)

T+0        Boot, rebuild DB from incomplete data
T+15-30m   Hash sync starts (30-min loop)
T+62m      Enterprise spawner (TIMER) — may use stale data
T+125m     Non-enterprise spawner (TIMER) — may use stale data

After (event-driven)

T+0        Boot, AppSyncOrchestrator starts
T+~30s     Peer threshold (12) → sync all 4 data types from peers
T+~31s     All syncs complete, location data ready
T+varies   Explorer syncs → hash sync → DB rebuild
T+varies   Node confirmed → spawnerReady

State Machine

INITIALIZING ──► SYNCING ──► READY
                   ▲            |
                   |            v
                RESYNCING ◄── DEGRADED

Signed Broadcast Sync Protocol

Message Code Data Chunk Size
Temp Messages 0x20 App registrations/updates not yet on-chain 2000
App Running 0x21 Node → running apps mapping 2000
App Installing 0x22 In-progress installations 2000
Install Errors 0x23 Failed installations per node per app hash 2000

How it works

  1. Receiver sends 9-byte binary request: [type:1][sinceTimestamp:8]
  2. Sender opens MongoDB cursor (consistent snapshot), streams in chunks of 2000 — caps memory usage regardless of dataset size
  3. Each chunk is a signed broadcast containing an array of inner signed broadcasts
  4. Receiver verifies each inner broadcast signature against deterministic node list
  5. Verified broadcasts bulk-written to both signed and location collections
  6. Final chunk has done: true → receiver knows sync is complete

Security

  • Two-layer verification: Outer broadcast proves a real node sent the response. Inner broadcasts are individually verified — pubkey checked against deterministic node list, signature verified via bitcoinjs-message
  • Rate limiting: 5-minute cooldown per peer per sync type
  • Clock offset adjustment: sinceTimestamp adjusted by peer.remoteClockOffsetMs before sender queries
  • No trust in aggregates: Unlike the old getPeerAppsInstallingErrorMessages (which fetched unsigned data via HTTP from one peer), every synced record is cryptographically verified back to its originating node

Capability

Single appStateSync capability covers all four sync types. Defined once in FluxPeerSocket.js as FLUX_CAPABILITIES, used by both WebSocket client (outbound request headers) and server (inbound response headers). Previously capabilities were duplicated in two files with different values — fixed.

Sender-Side Filtering

Each sync type filters by validity on the sender side to avoid sending records that have technically expired but haven't been cleaned up yet by MongoDB's TTL thread (~60s lag):

  • Running: Only sends records newer than 125 minutes (matches TTL)
  • Installing: Only sends records newer than 15 minutes (matches TTL)
  • Install errors: Only sends records newer than 24 hours (matches TTL)

Dual-Collection Storage

Each ephemeral data type stored in two collections:

  • Signed collection (new): Full broadcast with pubKey, signature, data — used for serving sync requests
  • Location collection (existing, unchanged): Existing flat rows — used by all existing queries (spawner, advancedWorkflows, etc.)

Both written on every gossip message. Installing collections cleaned up when app transitions to running.

Why duplicate instead of replacing? There are 15+ direct queries against the existing location collections scattered across the codebase (messageStore, peerNotification, fluxCommunication, nodeStatusMonitor, stoppedAppsRecovery, advancedWorkflows, appController, syncthingMonitor). A separate branch centralizes all these queries through registryManager. Changing the document shape in this PR would conflict with that work. The signed collections add ~1.7MB overhead (running) — trivial. Once the centralization branch is rebased on top of this, the dual-collection pattern can be collapsed into a single collection with an aggregation-based appLocation() query.

Signed Location TTL
fluxapprunningbroadcasts zelappslocation 125 min
fluxappinstallingbroadcasts appsinstallinglocations 15 min
fluxappinstallingerrorsbroadcasts appsInstallingErrorsLocations 24 hours

Validity Windows

Data Type Gossip Validity Sync Validity TTL
Running 5 min 125 min 125 min
Installing 5 min 15 min 15 min
Install Errors 5 min 24 hours 24 hours

Gossip validity is short because gossip messages are relayed — stale messages indicate network issues. Sync validity matches TTL because sync is a point-in-time snapshot of the sender's DB.

Install Errors Fix

What was broken

  • TTL index on cachedAt — field never set in any document, nothing ever expired
  • 19,304 records on production (chud), oldest 3 weeks, 98% had expireAt: null
  • Null-expiry logic: when 5+ errors accumulated for same app+hash, set expireAt: null (persist forever)
  • removeDocumentsFromCollection({}) wiped collection on every restart (inconsistent with other ephemeral collections)
  • getPeerAppsInstallingErrorMessages fetched unsigned error data via HTTP from one random peer

What was fixed

  • TTL index changed from cachedAt to broadcastedAt with 86400s (24-hour) expiry
  • Removed null-expiry logic
  • Removed startup wipe (consistent with other collections, TTL handles cleanup)
  • Removed getPeerAppsInstallingErrorMessages — replaced by signed 0x23 sync
  • Gossip validity reduced from 60 minutes to 5 minutes (consistent with installing messages)

Spawner network-wide error check

Previously commented out (appSpawner.js:336-340). Re-enabled with new logic:

const errorCount = await registryManager.countAppInstallingErrors(appHash);
if (errorCount >= 5) { // 5+ distinct nodes failed → broken spec
  globalState.spawnErrorsLongerAppCache.set(appHash, '');
  throw new Error(`...has ${errorCount} network-wide install failures, skipping`);
}

Two-layer error handling:

  • Network DB (24h TTL): Prevents untried nodes from wasting time. ~5 nodes retry per day as errors expire
  • Per-node in-memory (7 days): spawnErrorsLongerAppCache prevents nodes that already failed from retrying for a week

Other Changes

  • Capability unification: FLUX_CAPABILITIES extracted to FluxPeerSocket.js as single source of truth (was duplicated between socketServer.js and fluxCommunication.js with different values)
  • X-Flux-Uptime: Now sent in both inbound and outbound connection headers (was outbound only)
  • peerNotification.js: Simplified from 10-parameter injection to direct globalState imports
  • messageVerifier.js: Removed ~165 lines of duplicate code
  • appHashSyncService.js: Rewritten — multi-peer parallel fetch, no 95% threshold skip
  • Sync guard: #syncInProgress flag prevents overlapping syncs in orchestrator

Rollout Behavior

During rollout (mixed network): Nodes running new code will find no appStateSync peers and fall back to the block-count timer — 125 minutes for non-enterprise, 62 minutes for enterprise. This is the same behavior as today. All other improvements (hash sync, DB rebuild sequencing, spawner expiration filter, immediate promotion, network error check) apply immediately regardless.

After full network upgrade: Restarting nodes sync all location data from peers in ~1.5 seconds. Spawner can start as soon as explorer syncs + hash sync + DB rebuild completes — typically 2-5 minutes after restart instead of 2+ hours.

Test Results

Tested on two dedicated test nodes.

Test 1: Temp Message Sync

  • Stopped sandwich, registered 2 test apps while it was down
  • Restarted sandwich, peered to squidward
  • Sandwich received 3 temp messages (chunkymonkey + 2 test apps), all processed
  • Verified via /apps/temporarymessages API

Test 2: App Running Sync (full, from clean slate)

handleAppRunningSyncResponse - Received 2000 broadcasts (done: false)
handleAppRunningSyncResponse - Received 1336 broadcasts (done: true)
handleAppRunningSyncResponse - Stored 2000 of 2000 verified broadcasts
handleAppRunningSyncResponse - Stored 1335 of 1335 verified broadcasts

1 broadcast failed: nodeNotFound (node dropped off network between storage and verification — expected)

Test 3: App Installing Sync

handleAppInstallingSyncResponse - Received 57 broadcasts (done: true)
handleAppInstallingSyncResponse - Stored 57 of 57 verified broadcasts

Test 4: Install Errors Sync

handleAppInstallingErrorsSyncResponse - Received 3 broadcasts (done: true)
handleAppInstallingErrorsSyncResponse - Stored 3 of 3 verified broadcasts

Test 5: Collection Consistency

running broadcasts: 3335  apps: 5165  locations: 5166  match: ~✓ (1 from gossip)
installing broadcasts: 56  locations: 56  match: true
error broadcasts: 3  error locations: 3  match: true

Test 6: Spawner Error Check

Inserted 6 test error records for a fake hash. Verified:

countDocuments({hash: "testhash..."}) = 6 → would skip (>=5): true
countDocuments({hash: "nonexistent"}) = 0 → would skip (>=5): false

Test 7: Capability Advertisement

Fixed and verified: both inbound (server response) and outbound (client request) now advertise appStateSync. Previously server-side was missing new capabilities.

Test 8: Installing Validity Window

Initial test showed 132 received, 44 stored. Root cause: sender was sending records up to 15 minutes old (TTL window) but receiver only accepted 5 minutes (gossip validity). Fixed: sender now filters to validity window, 100% stored on subsequent tests.

Performance

T+0.0s   Boot
T+28.0s  Peer threshold reached (12 peers)
T+28.0s  All 4 sync requests sent
T+28.2s  Temp: 1 received, processed
T+28.3s  Installing: 57 received, verified, stored
T+28.3s  Errors: 3 received, verified, stored
T+28.7s  Running chunk 1: 2000 received
T+29.1s  Running chunk 2: 1336 received
T+29.7s  All syncs complete

Total sync time: ~1.5 seconds for full network state

Files Changed

File Change
config/default.js New collection names, peer thresholds
utils/peerCodec.js 0x20 sinceTimestamp, 0x21, 0x22, 0x23 encode/decode
utils/FluxPeerSocket.js FLUX_CAPABILITIES constant
utils/FluxPeerManager.js Binary handlers for 0x21-0x23, eligibility methods, threshold events
utils/globalState.js appRunningSyncComplete, spawnerPaused
lib/socketServer.js Uses shared FLUX_CAPABILITIES, adds X-Flux-Uptime
appMessaging/messageStore.js Signed broadcast storage, batch operations, cleanup, error TTL fix
appMessaging/appSyncOrchestrator.js Event-driven coordinator
appMessaging/appHashSyncService.js Rewritten hash sync
appMessaging/messageVerifier.js Removed duplicate code
appMessaging/peerNotification.js Direct imports, simplified signature
fluxCommunication.js Sync handlers, dispatch routes, capability, X-Flux-Uptime
fluxCommunicationMessagesSender.js Cursor-based streaming responders
serviceManager.js Collection indexes, orchestrator wiring, TTL fixes
appLifecycle/appSpawner.js Event-driven startup, pause/resume, network error check
appLifecycle/advancedWorkflows.js Removed getPeerAppsInstallingErrorMessages
appDatabase/registryManager.js countAppInstallingErrors, cleaned projections

Test plan

  • Deploy on 2 test nodes, verify full sync lifecycle from clean slate
  • Verify all 4 sync types: temp messages, running, installing, errors
  • Verify collection consistency after sync and gossip
  • Verify installing cleanup: both collections cleaned on app→running transition
  • Verify capability advertisement in both connection directions
  • Verify sender-side validity filtering (installing, errors)
  • Verify spawner error check logic (count >= 5 threshold)
  • Verify spawner starts after sync instead of after 125-minute timer (requires full network upgrade)
  • Verify cancelled app is not reinstalled after sync
  • Verify fallback: no appStateSync peers → block-count timer activates
  • Run on mixed network (new + old nodes) — new nodes fall back gracefully
  • Pre-PR: Change uptime threshold from 60s to 7500s (done, 4 locations — matches apprunning TTL)

Before merging

  • Final code review
  • Extended testing on test nodes (leave running for 24+ hours, verify TTL cleanup, error accumulation, spawner behavior)
  • Review all commit messages and squash if needed

The spawner had no awareness of block height or app expiration. Cancelled
apps (expire=100) remained in globalAppsInformation for up to 3+ hours
until the next expireGlobalApplications sweep, during which the spawner
would install them on new nodes.

Add a PON-fork-aware expiration filter to the aggregation pipeline that
excludes apps expiring within 100 blocks (newMinBlocksAllowance). Filter
runs before the $lookup join so expired apps never reach candidate selection.

Also fix two existing bugs:
- findIndex used >= instead of <=, popping deferred apps immediately
  instead of waiting their scheduled time
- Array.includes() with a callback always returned false (should be .some())

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MorningLightMountain713 and others added 12 commits April 30, 2026 18:11
Replace timer/loop-based app hash sync with an event-driven architecture
to fix cancelled apps being installed on nodes with stale data.

AppSyncOrchestrator coordinates the sync lifecycle:
- Listens for blockEmitter (explorer synced) and peerManager threshold events
- Fetches temp messages from eligible peers on reconnect (new tempMessageSync
  P2P capability with binary 0x20 request + signed JSON response)
- Runs unified syncMissingHashes (multi-peer, no 95% threshold, no 30-min loop)
- Rebuilds globalAppsInformation AFTER sync via reindexGlobalAppsInformation
- Emits spawnerReady when all readiness conditions met
- Pauses spawner if peers drop below degraded threshold (hysteresis)

Other changes:
- FluxPeerManager extends EventEmitter with peerThresholdReached/peersBelowThreshold
- X-Flux-Uptime header in WebSocket handshake for peer uptime tracking
- messageStore triggers immediate promotion when temp message arrives for known hash
- peerNotification.checkAndNotifyPeersOfRunningApps uses direct imports
- Remove duplicate continuousFluxAppHashesCheck from messageVerifier
- Remove timer-based spawner start and 30-min hash sync loop from serviceManager
- Fix syncthingHealthMonitor test to match broadcast change (sendMessage=true)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add #syncInProgress flag to prevent concurrent syncs in orchestrator
- Add per-peer 5-minute cooldown on temp message requests
- Cap incoming temp sync messages at 500
- Remove dead ponFork variable and unused config import
- Collapse identical conditional branches in #runHashSync

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Store signed apprunning broadcasts in new fluxapprunningbroadcasts
collection alongside existing zelappslocation. On peer threshold,
request signed broadcasts from capable peers via binary 0x21 with
sinceTimestamp for delta sync. Sender streams results using cursor-
based chunking (2000/chunk) with done flag. Receiver verifies each
inner broadcast signature individually then bulk writes both
collections. Location readiness now data-driven (appRunningSyncComplete)
with block-count fallback for legacy networks.

Also updates temp message sync (0x20) to include sinceTimestamp and
use cursor-based streaming. Removes arbitrary 500-message cap.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract FLUX_CAPABILITIES to FluxPeerSocket.js as single source of truth.
Both the WebSocket server (inbound upgrade response) and outbound client
(upgrade request) now use the same list. Also adds X-Flux-Uptime to
server response headers so peers learn our uptime regardless of
connection direction.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add signed appinstalling broadcast sync (0x22) using the same
cursor-based streaming pattern as apprunning sync. Stores signed
broadcasts in fluxappinstallingbroadcasts collection for sync,
existing appsinstallinglocations unchanged.

Unify capability names: replace tempMessageSync + appRunningSync
with single appStateSync capability covering all three sync types
(0x20 temp messages, 0x21 apprunning, 0x22 appinstalling).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When storeAppRunningMessage removes entries from
appsinstallinglocations (app finished installing), also remove
the corresponding signed broadcast from fluxappinstallingbroadcasts
to keep both collections consistent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Only send installing broadcasts newer than 5 minutes. The TTL on the
collection is 15 minutes but the validity window is 5 minutes — records
between 5-15 minutes old are stale (install likely finished/failed) and
would be rejected by the receiver anyway. Filtering on the sender
avoids wasting bandwidth on records that can't be stored.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sender filters to 15 minutes (matching TTL) instead of 5 minutes.
Receiver signed storage and batch store also use 15 minutes. The
gossip path retains 5-minute validity — gossip messages are relayed
and could be stale, but sync is a point-in-time snapshot of current
DB state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix broken TTL on appsInstallingErrorsLocations: index was on
`cachedAt` (never set), changed to `broadcastedAt` with 24-hour
expiry. Remove null-expiry logic that prevented 98% of records
from ever expiring. Remove startup wipe for consistency with
other ephemeral collections. Change gossip validity from 60min
to 5min (sync path accepts 24 hours).

Add signed error broadcast sync (0x23) following the same pattern
as apprunning (0x21) and appinstalling (0x22). Sender filters to
24-hour window. Remove unsigned HTTP fetch
(getPeerAppsInstallingErrorMessages) that was trusting one peer's
aggregate data without verification.

Enable spawner to check network-wide error count per app hash.
If 5+ distinct nodes have reported install failures for the same
hash, skip the app. Combined with the 7-day in-memory
spawnErrorsLongerAppCache, this means: broken specs are skipped
network-wide within minutes, retried once per day as errors
expire from TTL.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add orchestrator tests for apprunning sync requests, location
readiness with appRunningSyncComplete flag, block-count fallback,
and degradation reset. Add peerCodec tests for all four sync
message types (0x20-0x23) with timestamp roundtrip verification.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change eligibility threshold from 60s (testing) to 7200s (2 hours)
for all 4 sync types. A peer needs at least 2 hours of uptime to
have accumulated a full cycle of apprunning broadcasts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MorningLightMountain713 MorningLightMountain713 changed the title fix: prevent spawner from installing expired/cancelled apps feat: event-driven app state sync & spawner startup May 1, 2026
MorningLightMountain713 and others added 9 commits May 1, 2026 15:02
Change apprunning gossip store validity from 125 min to 5 min —
consistent with installing and errors. The broadcast relay check
already limits to ~5 minutes so no legitimate gossip message
should be older.

Change peer uptime threshold from 7200s to 7500s to match the
apprunning TTL exactly. A peer needs a full TTL cycle (2h5m) to
have complete data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Filter apprunning broadcasts to 125 minutes on sender side,
consistent with installing and errors. Prevents sending records
that expired but haven't been cleaned up by MongoDB's TTL thread.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
reindexGlobalAppsInformation was wiping the entire error locations
collection on every reindex while leaving the signed broadcasts
untouched, causing a growing mismatch (854 vs 301 on sandwich).
expireGlobalApplications and updateAppSpecifications only removed
from error locations, not broadcasts. All three now operate on both
collections consistently. Explorer rescan also drops both.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The spawner was starting after syncing from just 1 peer — any single
appStateSync peer responding with done:true set appRunningSyncComplete
globally, bypassing the 125-minute block-count fallback entirely.

Redesigned sync readiness:
- Merged duplicate getEligibleTempSyncPeers/getEligibleAppRunningSyncPeers
  into getEligibleSyncPeers with missedPongs===0 check and shuffle
- Orchestrator tracks asked peers per cycle and sync completions per type
- Requires all 3 sync types (apprunning, appinstalling, apperrors) to
  complete from 3 peers before setting stateSyncComplete
- Falls back to block-count timer if <3 eligible peers exist
- 2-minute timeout falls back if syncs don't complete
- Replaced globalState.appRunningSyncComplete with orchestrator-internal
  state, wired via setOnSyncComplete callback
- Renamed isLocationReady to isStateSyncReady
- Extracted peer uptime threshold to config (appSyncMinPeerUptime),
  set to 60s for testing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Set to 1 in default config for testing on 2-node networks, 3 in test
config for unit tests. Also adds appSyncMinPeerUptime and other sync
config values to the test config.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The signed broadcasts collection was upserting by {ip} only, so v1
gossip messages (one per app) from the same IP overwrote each other.
Only the last app's broadcast survived — losing data for sync
responses. Changed to upsert by {ip, data.name} for v1 messages,
keeping {ip} for v2 messages which already contain all apps.
Updated unique index to match.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
peerNotification was sending v1 (per-app) for single-app nodes and
v2 (apps array) for multi-app nodes, with duplicate local storage.
appInstaller was also sending v1. Simplified both to always send v2,
removing the v1/v2 branching on the sending side. Receivers still
handle v1 from old nodes on the network.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Aggregation function that produces the same output shape as
appLocation() but reads from the signed broadcasts collection.
Handles both v1 (per-app) and v2 (multi-app) documents via $facet,
deduplicates by {name, ip} taking the latest broadcastedAt. Adds
data.apps.name index for v2 app queries. Also reverts the v1
cleanup logic — v1 docs coexist with v2 and TTL out naturally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…orage

Major refactoring of app state broadcasting:

1. Split peerNotification into broadcast-only + recovery:
   - Stopped-app recovery logic (recreateMissingContainers,
     handleMissingMasterSlaveContainer, checkStoppedApps) moved to
     stoppedAppsRecovery.js where it belongs
   - peerNotification.js now only handles broadcasting + timer
   - Breaks circular dep: appInstaller no longer imports peerNotification
   - appInstaller uses setOnInstallComplete callback wired by serviceManager

2. Always send v2 apprunning broadcasts:
   - peerNotification always sends v2 with full apps array
   - appInstaller triggers full checkAndNotifyPeersOfRunningApps via callback
   - Removes v1/v2 branching on the sending side

3. Fix v1 broadcast storage overwriting:
   - v1 broadcasts (from old nodes) now upsert by {ip, data.name} not {ip}
   - Prevents losing per-app data when multiple v1 messages arrive
   - Updated unique index to {ip, data.name}

4. Broadcast timer self-resets:
   - Moved interval from orchestrator to peerNotification
   - resetBroadcastInterval called in finally block
   - Any call to checkAndNotifyPeersOfRunningApps resets the 1h timer

5. appLocationFromBroadcasts aggregation view:
   - New function produces same output shape as appLocation()
   - Handles both v1 and v2 via $facet, dedupes by {name, ip}
   - Added data.apps.name index for v2 app queries
   - Prepares for replacing locations collection with broadcast collection

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 1, 2026

Codecov Report

❌ Patch coverage is 35.02890% with 562 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.08%. Comparing base (8412550) to head (f2eb3f3).
⚠️ Report is 32 commits behind head on development.

Files with missing lines Patch % Lines
...k/src/services/appLifecycle/stoppedAppsRecovery.js 5.26% 126 Missing ⚠️
ZelBack/src/services/fluxCommunication.js 3.66% 105 Missing ⚠️
ZelBack/src/services/appMessaging/messageStore.js 20.38% 82 Missing ⚠️
...ck/src/services/fluxCommunicationMessagesSender.js 2.38% 82 Missing ⚠️
...ck/src/services/appMessaging/appHashSyncService.js 48.71% 60 Missing ⚠️
ZelBack/src/services/utils/FluxPeerManager.js 26.00% 37 Missing ⚠️
...Back/src/services/appMessaging/peerNotification.js 13.04% 20 Missing ⚠️
ZelBack/src/services/appLifecycle/appSpawner.js 34.78% 15 Missing ⚠️
...elBack/src/services/appDatabase/registryManager.js 7.14% 13 Missing ⚠️
...k/src/services/appMessaging/appSyncOrchestrator.js 92.39% 13 Missing ⚠️
... and 4 more
Additional details and impacted files
@@               Coverage Diff               @@
##           development    #1724      +/-   ##
===============================================
+ Coverage        55.02%   55.08%   +0.05%     
===============================================
  Files              135      139       +4     
  Lines            27734    28442     +708     
===============================================
+ Hits             15260    15666     +406     
- Misses           12474    12776     +302     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

MorningLightMountain713 and others added 2 commits May 1, 2026 22:05
broadcastMessageToAll now returns the signed object. peerNotification
stores it locally via storeSignedAppRunningBroadcast, closing the gap
where a node's own running apps appeared in the locations collection
but not in the signed broadcasts collection until gossip relayed the
message back.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The batch store path (used by sync responses) was doing unconditional
$set on location collections, which could overwrite newer data from
the gossip path or the node's own broadcast. Use aggregation pipeline
updates with $cond to only write fields when the incoming broadcastedAt
is newer than what's already in the DB.

Affects all three batch stores: running, installing, and errors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MorningLightMountain713 and others added 10 commits May 2, 2026 09:14
…node

When a v2 broadcast arrives with fewer apps than previously stored, the
location collection kept orphaned entries for removed apps. Now both the
gossip path and batch sync path remove location entries for apps no
longer in the v2 broadcast's app list.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The batch sync upserts filter by {name, ip} but the collection only had
an index on {name}. For popular apps running on ~100 nodes, each upsert
examined ~99 docs to find 1. The compound index reduces this to a single
key lookup — explain shows 99 docs examined → 1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a v2 broadcast arrives for an IP, any v1 signed docs for apps no
longer in the v2's app list are now removed. Applies to both the gossip
path (storeSignedAppRunningBroadcast) and batch sync path. Ensures the
signed broadcast collection stays consistent with the location collection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The cleanup query filters by ip first with $nin on name. The previous
{name, ip} order couldn't use the index prefix for ip-first lookups,
causing 1131 key scans per IP. Reversing to {ip, name} reduces cleanup
to 4 key scans per IP. The upsert query {name, ip} still uses this
index efficiently (MongoDB handles field order).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The signed broadcast stores on the gossip path used the full TTL
(125 min / 15 min / 24 hours) while the location stores used 5 min.
Stale gossip (>5 min) would be stored in the signed collection but
rejected from the location collection, causing inconsistency. Now
both use 5 min on the gossip path. The batch sync path retains full
TTL validity since sync is a point-in-time snapshot.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Gossip arrives in unpredictable order — a stale v2 relay can trigger
cleanup that removes valid fresher v1 data. Remove all cleanup from
the gossip path (storeSignedAppRunningBroadcast, storeAppRunningMessage).

Batch sync cleanup is safe because it processes a consistent snapshot.
Add broadcastedAt condition to cleanup deletes so concurrent gossip
with fresher data survives. Merge upserts and cleanup into a single
bulkWrite per collection for atomicity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three changes to eliminate orphaned entries between collections:

1. break → continue in storeAppRunningMessage loop: for v2 messages
   with multiple apps, skip apps that already have current data but
   keep processing the rest. Previously broke out of the entire loop.

2. storeAppRunningMessage returns { stored, rebroadcast } instead of
   true/false. The gossip handler only calls storeSignedAppRunningBroadcast
   when stored is true, ensuring both collections accept or reject together.

3. Remove redundant 5-minute gossip validity check from
   storeSignedAppRunningBroadcast — it's now gated on the location
   store's acceptance, eliminating the timing edge where one store
   accepts at the boundary and the other rejects milliseconds later.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The sigterm handler was mutating broadcastedAt on location records to
force 7-minute TTL expiry. This broke the data contract — broadcastedAt
is derived from signed data and should never change. Stale gossip could
also overwrite the sigterm by passing the "is newer" check against the
fake broadcastedAt value.

Switch all 6 ephemeral collections to expireAt-based TTL (expireAt:0).
expireAt is operational metadata we control, not part of the signed
payload. Sigterm now sets expireAt = now + 7min on both locations and
signed broadcasts without touching broadcastedAt.

Also: split gossip validity (5min) from record expiry into named
constants, add missing expireAt to error stores, fix empty-apps v2
handler to clean up signed broadcasts with broadcastedAt guard.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nodeStatusMonitor and storeAppRemovedMessage deleted from
zelappslocation without touching fluxapprunningbroadcasts, leaving
orphaned signed broadcasts (~44 per 20-minute monitor cycle).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- storeAppRemovedMessage: $addToSet excludedApps on v2 broadcast docs
  so the derived view skips removed apps without mutating signed data
- storeSignedAppRunningBroadcast + batch sync: $unset excludedApps
  when a newer broadcast upserts (clears stale exclusions)
- appLocationFromBroadcasts: filter out excluded apps after v2 unwind
- reindexGlobalAppsLocation: also drop running broadcasts collection
- explorer rescan: also drop running + installing broadcasts
- Export handleMissingMasterSlaveContainer from stoppedAppsRecovery
- Fix all 10 CI test failures, add excludedApps tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MorningLightMountain713
Copy link
Copy Markdown
Contributor Author

Superseded by #1726 which includes the event log approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant