Skip to content

[WIP] Node Manager: desired-state, reconciliation & multi-step tasks#3948

Draft
Apollon77 wants to merge 59 commits into
mainfrom
node-manager
Draft

[WIP] Node Manager: desired-state, reconciliation & multi-step tasks#3948
Apollon77 wants to merge 59 commits into
mainfrom
node-manager

Conversation

@Apollon77

Copy link
Copy Markdown
Collaborator

Umbrella / integration branch for the Node Manager feature. Long-lived WIP PR — each phase merges into node-manager via its own sub-PR; this PR is the CI backstop for the integrated branch against main. Do not merge until all phases land and it's de-WIP'd.

Design doc: docs/superpowers/specs/2026-06-14-node-manager-desired-state-design.md

What this builds

A controller-side layer holding the intended state of fabric nodes (certs/keys/IPK, group keys, bindings, ACLs, group membership), with offline-tolerant reconciliation and multi-step orchestration (e.g. group-key rotation). Modeled on the JointFabric Datastore cluster (0x0752) and generalized.

Phase checklist

  • Phase 1 — Tier-1 desired-state model (@matter/node): ManagedItem/StatusEntry, ItemKind registry, capacity admission + typed errors, persistent DesiredStateBehavior on ClientNode (Node Manager Phase 1: Tier-1 desired-state model #3946)
  • Phase 1b — live drift detection via subscriptions (watchPaths)
  • Phase 2 — Reconciler + Task layer (new @matter/node-manager pkg): triggers, settle delay, verify-barrier, priority ordering, capacity reads, concrete ItemKinds; RotateGroupKey/MoveNodeToGroup; changeset rollback; 2-node rotation harness
  • Phase 3 — JFDS 0x0752 facade
  • Phase 4 — policy/optimizer (ACL merge, CAT grants, world-reconcile)
  • Phase 5 — developer API (Bindings/Groups/Scenes)

Carry-forwards tracked for Phase 2

  • Capacity cache → make ephemeral (refresh on connect / subscription re-establish), not persisted; cache limit only, derive used fresh at admission.
  • itemMapKey separator unescaped — escape/document before non-identifier keys.
  • Public-API surface: confirm public vs internal for the Tier-1 barrel exports when the Reconciler package forces the boundary.

🤖 Generated with Claude Code

Apollon77 and others added 17 commits June 19, 2026 13:52
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…edStateBehavior

Drop the second `, unknown` type param from both Observable declarations in
DesiredStateBehavior.Events to match the codebase convention. Convert the two
test handlers that return number (Array.push) to block bodies so they type-check
under the stricter void return. Remove the inline WHAT comment above static schema.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Add export * from "./desired-state/index.js" to system behavior barrel
- Add DesiredStateBehavior to ClientNode.RootEndpoint.with(...)
- New DesiredStatePersistenceTest: verifies registration on ClientNode and
  intent persistence across a node restart via shared Environment storage
- Update PEER1_STATE in ClientNodeTest to include desiredState initial state

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add test coverage for GroupCapacityExceededError mapping and remove
the unused unknownKindMapped fixture entry from the cache.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Re-emit a peer's BasicInformation softwareVersion change as a node-level
lifecycle signal, wired into Peers BasicInformation instrumentation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pure status x mode branch-table function, internal to the package
(consumed by the reconciler via #-import, not part of the public API).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Opt-in ServerNode-root behavior driving reachable peers toward intended
state: pure planActions decision + executeActions executor, six triggers
(settle, sweep, peers add/del, subscription-active, intent-change,
software-version-change), per-peer in-flight guard, capacity refresh, and
asyncDispose cleanup. Reachability gated on an active sustained subscription.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tion

- Wire per-peer trigger handlers directly on the peer ObserverGroup instead
  of via this.callback, so they are torn down on peer removal (no reactor
  leak on peer churn).
- #reachable mirrors NetworkClient.subscriptionActive: a sustained
  subscription counts as reachable only once active, not merely created.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reconciler engine: @matter/node-manager package, ReconcilerBehavior
(opt-in ServerNode-root, planActions decision + executeActions executor,
six triggers, per-peer in-flight guard, capacity refresh, asyncDispose),
plus @matter/node ephemeral capacity cache and softwareVersionChanged signal.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Apollon77 and others added 12 commits June 22, 2026 13:43
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rity bands

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ng, dispose race

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ting

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ard test

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… tradeoff

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… unread

Pre-flight admission stays meaningful instead of failing open; the device
write remains the authoritative gate for over-capacity (RESOURCE_EXHAUSTED).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Apollon77 and others added 5 commits June 22, 2026 18:13
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…sing binding endpoint

Also document itemMapKey separator escaping and make its tests format-agnostic.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…caping

Simpler than escape-then-join; the separator never appears in identifier/number
keys. Tests use the itemMapKey() helper rather than hardcoding the key format.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Apollon77 and others added 10 commits June 23, 2026 11:29
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…re groupKey capacity

Trigger observers fired the async reconcile with a naked `void`, so a rejection
from a detached pass (e.g. a capacity command initiating an exchange against a
peer torn down mid-flight) surfaced as an unhandled rejection. Route triggers
through #fireTrigger, which logs and swallows. This also covers command-based
verify. With it, groupKey.capacity() (KeySetReadAllIndices) is safe to restore.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…antics

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d.apply

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A verify pass that detects drift now re-applies in the same pass instead of
re-pending, so reconcile(verify) converges deterministically without relying on
a follow-up trigger.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the InFlightGuard + fire-and-forget trigger wiring with a per-ClientNode
Mutex. Triggers synchronously enqueue a coalesced reconcile request (verify /
capacity-refresh flags OR-merge); the mutex serializes passes per node, owns the
work, and logs task rejections (no voided or silently swallowed promises).
Explicit reconcile() runs via mutex.produce so it is awaitable and never overlaps
a triggered pass, making convergence deterministic. A request arriving mid-pass
coalesces into exactly one follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Remove the now-unreachable repend ReconcileAction (verify drift applies directly);
re-clear pending after mutex close in #unwirePeer; fix a stale test comment.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…live device read

Capacity counts come from the subscription-maintained state (stateOf), not a
forced Matter read (getStateOf). Live reads stay only on the RMW path of data we
modify (apply/remove), before writing. groupKey drops capacity() entirely: the
key-set count has no subscribed attribute (only the KeySetReadAllIndices command),
so the device's RESOURCE_EXHAUSTED on KeySetWrite is its over-capacity gate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Apollon77 and others added 3 commits June 23, 2026 20:00
Implements GroupMembershipItemKind — the fifth reconciler kind — which
provisions peer endpoint group membership via the Groups cluster
AddGroup/RemoveGroup/GetGroupMembership commands. Non-success statuses
in response payloads are re-thrown as StatusResponseError so the
engine's retry/drop machinery works uniformly. Capacity is read from
the subscription-cached GroupKeyManagement state (no live I/O).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Register GroupMembershipItemKind in ReconcilerBehavior.initialize() after
GroupKeyMapItemKind (priority order: keyset < group < membership). Add
three-scenario integration test (apply, behind-back re-apply, removeIntent).

API-drift notes:
- DesiredStateBehavior uses removeIntent(), not deleteIntent() (brief was wrong).
- GroupsServer.removeGroup calls assertRemoteActor, so the behind-back removal
  test mutates groupTable directly (same idiom as GroupKeyIntegrationTest).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Apollon77 and others added 9 commits June 25, 2026 17:00
…, driver)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…letion

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… park/resume)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…uccess

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add the AddNodeToGroup task that provisions a peer endpoint into a group
(groupKey + groupKeyMap + endpointGroupMembership), registered as a builtin.

Fix the gate to park (not fail) when a peer is unreachable at gate entry:
TaskContextImpl#evaluate now skips the unguarded reconcile for unreachable
peers so the predicate stays unsatisfied and the gate waits for the
reachability wake.

Defer the persisted-task resume pass until the node is online so a builtin
task resuming on a fresh node does not act before initialization completes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tImpl->RunningTaskContext

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts:
#	packages/node/test/behaviors/thermostat/AtomicWriteHandlerTest.ts
#	packages/node/test/endpoint/EndpointVariableServiceTest.ts
#	packages/node/test/node/ServerNodeTest.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant