Skip to content

feat(node-manager): Task runtime core + AddNodeToGroup (Task layer increment 1)#3984

Merged
Apollon77 merged 8 commits into
node-managerfrom
node-manager-task1
Jun 25, 2026
Merged

feat(node-manager): Task runtime core + AddNodeToGroup (Task layer increment 1)#3984
Apollon77 merged 8 commits into
node-managerfrom
node-manager-task1

Conversation

@Apollon77

Copy link
Copy Markdown
Collaborator

First increment of the imperative, persisted, resumable Task layer on top of the reconciler. A Task orchestrates a multi-step, multi-node change by writing tier-1 desired-state intent and waiting on convergence gates — it never touches nodes directly; only the Reconciler does. Durable state is minimal, so tasks are restart-safe and offline-tolerant.

What

  • TaskManagerBehavior — opt-in on the ServerNode root, attached via dynamic behaviors.require(ReconcilerBehavior) (no static .with; TaskManager pulls the Reconciler in itself). Registry (type → Task ctor), run(type, params, {externalId?}) (synchronous handle, internal deterministic id → idempotent re-issue), get/tasks/cancel, nonvolatile persistence ({type, params, phaseIndex, state, externalId?, addLog}), startup + on-register + on-online resume.
  • TaskContext — two node-affecting verbs (setIntent/removeIntent, creates recorded into an add-log) + awaitGate(nodes, until) / awaitCommitted(items). Gates verify-reconcile the targeted peers, evaluate a predicate over their DesiredStateBehavior items, and suspend via per-gate ObserverGroup subscriptions (cross-peer → ObserverGroup, not reactTo) until it holds — parking while a node is unreachable, resuming on the reachability wake. Coalesced, abort-safe, no voided promises.
  • Resume — re-instantiate non-terminal tasks by type, re-enter phaseIndex, re-run (phases re-entrant: setIntent upserts, gates re-read).
  • Cancel = best-effort revert — reverse the add-log to deletePending, gate on removal → cancelled; offline peer ⇒ parks; terminal revert error ⇒ cancelFailed (reserved). Aborts an in-flight gate cleanly (cancelled, not failed).
  • AddNodeToGroup — concrete task: provisions groupKey + groupKeyMap + endpointGroupMembership intents (converge; priority bands order keyset→group→membership), gates until all committed.

Scope

Increment 1 of 3. Add-log only — the full prior-state changeset + automatic hard-failure rollback + multi-node pre-flight admission + RemoveNodeFromGroup are increment 2; ipk kind + RotateGroupKey + 2-node rotation harness are increment 3.

Tests

  • Unit: manager (require/run/dedup/external-id/persist), gates (resolve/wait/park-resume with concurrency), lifecycle (resume across restart, cancel reverse-revert, in-flight cancel).
  • Integration (commissioned OnOffLightSwitchDevice.with(GroupsServer)): happy → member; park/resume; cancel → removed; restart-resume (real close + recreate-by-id → resumes to completed). All assert real device groupTable state.

Gates

build --clean ✓ · format ✓ · lint ✓ · node-manager 92/92 · node 1231/1231 (no regression). Per-task + whole-branch review: zero Critical/Important.

Base = node-manager (sub-PR; CI runs on umbrella #3948 → main).

🤖 Generated with Claude Code

Apollon77 and others added 7 commits June 25, 2026 13:36
…, driver)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…letion

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… park/resume)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…uccess

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add the AddNodeToGroup task that provisions a peer endpoint into a group
(groupKey + groupKeyMap + endpointGroupMembership), registered as a builtin.

Fix the gate to park (not fail) when a peer is unreachable at gate entry:
TaskContextImpl#evaluate now skips the unguarded reconcile for unreachable
peers so the predicate stays unsatisfied and the gate waits for the
reachability wake.

Defer the persisted-task resume pass until the node is online so a builtin
task resuming on a fresh node does not act before initialization completes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mergify

mergify Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Tick the box to add this pull request to the merge queue (same as @mergifyio queue).

  • Queue this pull request

…tImpl->RunningTaskContext

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Apollon77 Apollon77 merged commit f8b505f into node-manager Jun 25, 2026
2 checks passed
@Apollon77 Apollon77 deleted the node-manager-task1 branch June 25, 2026 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant