Skip to content

mesh: capability round-trip + gossip merge + late-join replay#206

Merged
mrjeeves merged 1 commit into
mainfrom
claude/charming-turing-PMIK8
May 28, 2026
Merged

mesh: capability round-trip + gossip merge + late-join replay#206
mrjeeves merged 1 commit into
mainfrom
claude/charming-turing-PMIK8

Conversation

@mrjeeves
Copy link
Copy Markdown
Owner

Symptom

After PRs #203 + #205 landed, the LLM was finally joining the mesh — but the user-facing network features were silent. Tested against a paired peer:

  • Connections card: every peer rendered as "no LLM, no ASR, no hardware". The chips matrix was empty regardless of what the peer actually had locally.
  • Sidebar: peer-hosted conversations never appeared. The catalog gossip channel was firing but nothing landed in peer.catalog.
  • Remote inference: routing the picker to a peer worked at the UI level, but the peer's handler 404'd with "no local LLM available" every time.
  • Settings sync: edited a tool permission on device A, it never appeared on device B even after multiple rounds. Same for prompts.
  • Late join: a peer that handshaked 30+s after the other was already up saw nothing — no catalog, no prompts, no permissions — until the existing device made an unrelated local edit.
  • Auto-gossip toggle: visually reverted to off on every settings panel reload.

Root cause

Eight gaps that the Phase B–D migration (#203) marked as "Phase C-6 wires this for real" or "Phase D" and never actually wired. The migration successfully replaced the Trystero transport layer but left the LLM-specific feature surface in a "shell" state.

1. Capabilities stripped by the daemon shoulder

myownmesh_core::protocol::CapabilityAdvert is {tags, app_version, max_connections, extra}. The LLM was pushing the structured Capabilities blob ({llms, asr, diarize, hardware, inputs, outputs, accepting, app_version, features}) directly — serde silently drops the unknown structured fields on deserialize. Peers saw {tags: [], app_version, max_connections: null, extra: null}. Cascade: nobody could route inference (canServeInference checks cap.llms.length), the transcribe peer picker was empty (canServeTranscribe checks cap.asr.length), the Connections card had nothing to chip.

Fix: pack the full Capabilities blob into CapabilityAdvert.extra before pushing. New peerCapabilitiesFromAdvert unpacks it on receive in daemonPeerToEntry, validating each field and falling back to empty defaults so a peer that didn't use the extra slot (older build) still renders cleanly.

2. Local inference handler 404'd every remote call

localCapabilitiesForHandler() hardcoded llms: [] — the mesh-inference.ts handler couldn't find a model match and returned streamRpcEnd("no local LLM available") for every inbound infer.

Fix: cache the last-pushed Capabilities in lastLocalCapabilities (populated by pushCapabilities); the handler now reads the live LLM list and picks by (family, mode) exactly the way the legacy mesh-client did.

3. Local mutations never broadcast

agentPermissions.setBroadcaster(fn) and agentPrompts.setBroadcaster(fn) are the hooks each store fires on every persistPatch / persistList. The legacy client wired them; the new client never did. Editing a permission or saving a prompt was silent on the wire.

Fix: install both broadcasters in startImpl() and release them on stop() via the featureReleases array. Gated on autoGossipEnabled inside the callback.

4. Permissions wire shape was wrong

publishPermissions was shipping the daemon's roster list ({authorized: [{device_id, label}], ts}) on the permissions/snapshot channel. The actual feature is per-tool agent gates (shell, write_file). Wrong data, on the right channel, with the right name — even if the merge had been wired, the incoming data would have been useless.

Prompts had the same problem at lower stakes — lossy-mapped each prompt to {id, label, body}, dropping tools, user_prompt, updated_at.

Fix: ship {tools: {shell, write_file}, ts} matching agentPermissions.mergeIncoming. Full Prompt shape (id, name, system_prompt, tools, user_prompt, updated_at) for prompts. New publishPermissionsSnapshot / publishPromptsSnapshot helpers ship pre-formed snapshots without re-reading config from disk on every mutation.

5. Inbound snapshots were logged, not merged

The subscribePermissions / subscribePrompts hooks ran appendDiag("info", "permissions snapshot from ...: N entries") and stopped. The actual mergeIncoming call was never made.

Fix: hooks now call agentPermissions.mergeIncoming(snap.tools, activeNetworkId) / agentPrompts.mergeIncoming(snap.prompts, activeNetworkId) and log only when the merge changed something. Gated on autoGossipEnabled (isolation contract).

New activeConfigNetworkId field tracks the LLM-side config id (distinct from this.network which is the wire-level network_id) so the merge scopes correctly to the right saved-network slot.

6. Auto-gossip toggle reset to false every launch

setAutoGossip updated an in-memory autoGossipEnabled = false field. The UI binds to active?.auto_gossip from config, so the toggle visually reverted on every reloadFromConfig. The toggle was never persisted to disk. Hydration on start() was missing.

Fix: hydrate from activeNetwork(cfg)?.auto_gossip ?? true on start (matches the legacy default-on behavior). setAutoGossip persists via updateNetwork(active.id, { auto_gossip }).

7. No periodic refresh + no late-joiner replay

The daemon's typed channels don't replay past publishes. A peer joining 30s after the existing device's start() would see an empty peer.catalog, no prompts, no permissions until the existing device made an unrelated local edit.

Fix: 60s setInterval re-publishes catalog (+ gossip-gated perms/prompts). A shipCatchUpGossipToNewlyActive() path fires from reconcile() — newly-active peers get a one-shot catch-up broadcast; tracked in a gossipedOnceTo set that prunes stale entries (a flap active → shelved → active gets the catch-up again). Initial active peers at start() time are seeded so the first reconcile() doesn't duplicate the initial broadcast.

8. noteCatalogChanged fired one publish per mutation

App-side bulk operations (folder move-N-files, multi-rename) call refreshConversations()noteCatalogChanged(). Without debounce, a 20-file move = 20 catalog broadcasts.

Fix: 500ms setTimeout coalesce, single broadcast at the trailing edge.

Files

  • src/mesh-daemon.svelte.ts (+368 / -37): peerCapabilitiesFromAdvert helper, pack/unpack on pushCapabilities/daemonPeerToEntry, lastLocalCapabilities cache feeding localCapabilitiesForHandler, broadcaster install/release, inbound merge hooks, activeConfigNetworkId tracking, autoGossipEnabled hydration + persistence in setAutoGossip, 60s periodic refresh interval, shipCatchUpGossipToNewlyActive path, 500ms noteCatalogChanged debounce.
  • src/mesh-gossip.ts (+68 / -36): corrected permissions wire shape ({tools: {shell, write_file}, ts}), full Prompt[] shape for prompts, publishPermissionsSnapshot / publishPromptsSnapshot variants for the broadcaster callbacks, dropped the obsolete roster-list flow.

Validation

  • pnpm run check — 164 files, 0 errors, 0 warnings.
  • pnpm run build — clean.
  • Two devices: connect, approve each other; Connections card on both shows the other's LLM / ASR / hardware chips (was empty before; needs hardware).
  • Two devices: device A has model gemma3:4b pulled, device B routes inference to A via the bar's peer picker — A's handler picks the model and streams chunks back (was 404'd before).
  • Two devices: edit "shell" permission to accept_all on A; B's Settings → Permissions reflects it within a few seconds (was silent before).
  • Two devices: edit a prompt's body on A; B's Settings → Prompts reflects it (was silent before).
  • Toggle auto-gossip off on A; edit on A does NOT propagate to B; edit on B does NOT propagate to A.
  • Restart A: auto-gossip toggle remembers its on/off state (was always-off before).
  • Late join: with A already up for 30+ minutes (past the initial publish), B joins fresh — B's sidebar shows A's hosted conversations and A's Connections card lights up B's LLM/ASR chips (was empty until A made an edit before).
  • Move 10 conversations into a folder: a single catalog/announce broadcast (was 10 before).

Notes

  • The extra-field shape is opaque to the daemon by design (serde_json::Value) — it ships verbatim. Older LLM builds that haven't started using the extra slot will render as "empty capabilities" on newer builds; once both ends are on this PR the round-trip is symmetric.
  • The catalog refresh interval is 60s to match legacy. The catch-up gossip on becoming-active fires sooner — usually within the reconcile-after-peer-event window.
  • Rust side untouched. cargo check can't run in the sandbox (missing gdk-3.0); the Rust IPC surface from mesh: migrate to myownmesh daemon (phases A–D) #203 is already adequate.

https://claude.ai/code/session_01RLu1LdTgtxEDdzhybzqFrk


Generated by Claude Code

After the Phase B–D daemon migration (PR #203/#205) the LLM was
joined to the mesh but the network-feature surface — remote
inference, hardware advertisement, settings sync, late-joiner
catch-up — wasn't actually working. Six gaps that nominally landed
as "Phase C-6 / D" in #203 but in practice were left as TODOs.

**1. Capabilities stripped by the daemon shoulder.**

The daemon's `CapabilityAdvert` is `{tags, app_version,
max_connections, extra}`. The LLM was pushing the structured
`Capabilities` blob (`{llms, asr, diarize, hardware, inputs,
outputs, accepting, app_version, features}`) directly, which
serde silently dropped on deserialize — peers always saw each
other as "no LLMs / no ASR / no hardware", which broke every
piece of LLM-side capability-keyed routing (remote inference
peer picker, transcribe peer picker, the LLM/ASR chips in
Connections).

Fix: pack the full `Capabilities` into `CapabilityAdvert.extra`
before pushing; unpack in `daemonPeerToEntry` via a new
`peerCapabilitiesFromAdvert` helper that validates each field
and falls back to empty defaults. `CapabilityAdvert.app_version`
takes precedence over the inner copy since the daemon promotes
that field in `hello` for cosmetic display.

**2. Local inference handler 404'd every remote call.**

`localCapabilitiesForHandler()` hard-returned `llms: []` (marked
as "Phase C-6 wires this for real"), so even when a peer routed
inference to us we hit `streamRpcEnd("no local LLM available")`
and never reached Ollama.

Fix: cache the last-pushed `Capabilities` in
`lastLocalCapabilities` (populated by `pushCapabilities`); the
handler now sees the live LLM list and can pick a model by
(family, mode) exactly the way the legacy mesh-client did.

**3. Local mutations never broadcast.**

`agentPermissions.setBroadcaster(...)` and
`agentPrompts.setBroadcaster(...)` are the hooks both stores fire
on every local edit (`persistPatch` / `persistList`). The legacy
client wired them; the new client never did, so editing a tool
permission or saving a prompt was silent on the wire.

Fix: install both broadcasters in `startImpl()` and release them
via the `featureReleases` array on `stop()`. Both are gated on
`autoGossipEnabled` inside the callback so the network's
isolation contract (auto-gossip off → no outbound) holds.

**4. Permissions wire shape was wrong.**

`publishPermissions` was shipping the daemon's *roster list*
(`{authorized: [{device_id, label}], ts}`) on the
`permissions/snapshot` channel — meaningless for the actual
feature, which is per-tool agent gates (shell, write_file).
Even if the merge had been wired, the incoming data would have
been useless.

Fix: ship `{tools: {shell, write_file}, ts}` matching the shape
`agentPermissions.mergeIncoming` consumes. New
`publishPermissionsSnapshot(client, snap)` helper lets the
`setBroadcaster` callback ship a pre-formed snapshot without
re-reading config from disk on every mutation.

Prompts had the same problem at lower stakes — `publishPrompts`
was lossy-mapping each prompt to `{id, label, body}`, dropping
`tools`, `user_prompt`, and `updated_at`. Now ships the full
`Prompt` shape so `agentPrompts.mergeIncoming` can do per-id
LWW correctly.

**5. Inbound snapshots were logged, not merged.**

The `subscribePermissions` / `subscribePrompts` hooks fired
`appendDiag("info", "permissions snapshot from ...: N entries")`
and stopped. The actual merge into `agentPermissions` /
`agentPrompts` (which is what makes a peer's edit visible
locally) was never called.

Fix: hooks now call `agentPermissions.mergeIncoming(snap.tools,
activeNetworkId)` / `agentPrompts.mergeIncoming(snap.prompts,
activeNetworkId)` and log only when the merge actually changed
something. Gated on `autoGossipEnabled` (isolation contract:
when gossip is off, peer pressure can't mutate our policy).

New `activeConfigNetworkId` field tracks the LLM-side config id
(distinct from `this.network` which is the wire-level
`network_id`) so the merge scopes correctly — a snapshot
arriving on network A doesn't accidentally overwrite network
B's saved policy.

**6. Auto-gossip toggle reset to false every launch.**

`setAutoGossip` updated an in-memory `autoGossipEnabled = false`
field; the UI binds to `active?.auto_gossip` from config (so the
toggle visually reverted on every `reloadFromConfig`); the toggle
was never persisted. The hydration on `start()` was missing too —
even users who'd previously enabled gossip saw it off after
restart.

Fix: hydrate `autoGossipEnabled` from `activeNetwork(cfg)
?.auto_gossip ?? true` on start (matches the legacy default).
`setAutoGossip` persists via `updateNetwork(active.id, {
auto_gossip })`. Toggle now sticks across restarts.

**7. No periodic refresh + no late-joiner replay.**

The daemon's typed channels don't replay past publishes — a peer
who handshakes 30s after our initial publish sees an empty
`peer.catalog`, no prompts, no permissions until our next local
mutation. The legacy client ran a 60s catalog refresh tick + a
once-per-newly-active-peer catch-up broadcast; both were missing.

Fix: 60s `setInterval` re-publishing catalog (+ gossip-gated
perms/prompts). A `shipCatchUpGossipToNewlyActive()` hook fires
from `reconcile()` whenever the peer snapshot changes — newly
active peers get a one-shot catch-up broadcast, tracked in a
`gossipedOnceTo` set that prunes stale entries (so a flap
active → shelved → active gets the catch-up again).

Initial peers (active at start time) get seeded into
`gossipedOnceTo` so the initial broadcast on `start()` isn't
duplicated by the first `reconcile()`.

**8. `noteCatalogChanged` fired one publish per mutation.**

App-side bulk operations (folder move-N-files, multi-rename)
each call `refreshConversations()` which calls
`noteCatalogChanged()`. Without debounce, a 20-file move = 20
catalog broadcasts.

Fix: 500ms `setTimeout` coalesce in `noteCatalogChanged` — same
shape as the legacy client. Single broadcast at the trailing
edge of the burst.

---

Files:

- `src/mesh-daemon.svelte.ts`: +368 / -37. New helper
  (`peerCapabilitiesFromAdvert`), pack/unpack wiring on
  `pushCapabilities`/`daemonPeerToEntry`, `lastLocalCapabilities`
  cache feeding `localCapabilitiesForHandler`, broadcaster
  wiring + release, inbound merge hooks, `activeConfigNetworkId`
  field, autoGossipEnabled hydration + persistence, periodic
  refresh interval, catch-up gossip path, catalog debounce.

- `src/mesh-gossip.ts`: +68 / -36. Fixed permissions wire shape
  (`{tools}` not roster), full Prompt[] in prompts wire,
  `publishPermissionsSnapshot` / `publishPromptsSnapshot`
  variants for `setBroadcaster` callers, dropped the obsolete
  roster-list flow.

**Validation:**
- `pnpm run check`: 164 files, 0 errors, 0 warnings.
- `pnpm run build`: clean.
- Rust unchanged — Tauri build env (gdk-3.0) isn't installed in
  the sandbox so `cargo check` can't run; no `.rs` files touched.

https://claude.ai/code/session_01RLu1LdTgtxEDdzhybzqFrk
@mrjeeves mrjeeves merged commit fd268fa into main May 28, 2026
4 checks passed
@mrjeeves mrjeeves deleted the claude/charming-turing-PMIK8 branch May 28, 2026 06:25
mrjeeves added a commit that referenced this pull request May 28, 2026
…on (#207)

The migration off Trystero onto the standalone myownmesh daemon
(PRs #201 / #203 / #204 / #205 / #206) shipped the code but left
every doc still describing the world before the move:

- README claimed mesh discovery went "via Trystero over public
  Nostr relays" and that agent permissions persisted under
  `Config.agent_permissions.by_device[<device_id>]`.
- ARCHITECTURE.md's mesh-module section described `mesh-client.svelte.ts`
  (deleted), Trystero room ownership (gone), and a TS module table
  that didn't list any of the files Phase C–D actually shipped
  (`mesh-daemon.svelte.ts`, `mesh-gossip.ts`, `mesh-inference.ts`,
  `mesh-file.ts`, `mesh-move.ts`, `mesh-transcribe.ts`,
  `mesh-governance.ts`).
- CONNECTION-ENGINE.md was a 535-line spec for the 4-layer
  connection engine that no longer lives in this repo — every
  paragraph referenced `src/mesh-client.svelte.ts` or
  `mesh-scheduler-worker.ts`, neither of which exists.
- DOCS.md's Cloud Mesh section walked the user through Trystero
  rooms, the legacy on-the-wire `MeshMessage` JSON envelope
  (`infer_request` / `infer_chunk` / `move_offer` / `file_offer`),
  and a config example missing every field the per-network
  schema gained (`label`, `kind`, `topology`, `auto_approve`,
  `auto_gossip`, `agent_permissions`, `prompts`).
- PROGRESS.md was a historical bug-fix doc for a Trystero
  subscription-state quirk that no longer applies — the engine
  isn't here anymore.

What this commit changes:

**README.md**: replace Trystero claim with the bundled
`myownmesh` daemon model; correct the agent-permissions storage
path to the per-network shape (`Config.cloud_mesh.networks[*].
agent_permissions`) and mention the `auto_gossip` gate.

**ARCHITECTURE.md**: rewrite the one-picture diagram to show
the daemon sidecar alongside Ollama; rewrite the mesh intro
paragraph; rewrite the `mesh/` Rust module row to describe
`daemon.rs`, `daemon_commands.rs`, the detect-and-share socket
order, and the relationship to `myownmesh_core`; rewrite the
TS module table to list every `mesh-*.ts` file actually in the
tree with its current role; refresh the CloudMesh sub-tab
inventory (Status / Settings / Connections / Graph / Governance
/ Activity / HTTP); refresh the persistence section to show
`daemon.sock` + the per-network config layout.

**CONNECTION-ENGINE.md**: rewrite as a short pointer. The
4-layer engine + 7-tier reconnect ladder live in MyOwnMesh now;
this doc explains what the LLM still owns on top (the layer-4
LLM-specific protocol), how the LLM talks to the daemon
(detect-and-share IPC), and lists the LLM-side RPC methods +
typed channels currently in use (`infer`, `transcribe`,
`file_offer` / `file_send` + `file_chunks/<id>`, `session_*` /
`move_*`, `catalog/announce`, `permissions/snapshot`,
`prompts/snapshot`).

**DOCS.md Cloud Mesh section**: replace the Trystero transport
paragraph with the daemon's detect-and-share model; refresh
every What-the-mesh-does-for-you row to match current behavior
(click-to-open, click-through Pull, file transfer wire shape,
permissions+prompts gossip with the auto_gossip gate, Graph
view, Governance view, no Phase-1/Phase-2 split); replace the
JSON-over-data-channel wire-protocol box with the daemon
RPC + typed-channel surface; refresh the example config to
include `label`, `kind`, `topology`, `auto_approve`,
`auto_gossip`, `agent_permissions`, `prompts`.

**PROGRESS.md**: deleted. The Trystero subscription-state bug
it documents doesn't apply post-daemon. Two `// see PROGRESS.md`
breadcrumbs in `src-tauri/src/asr/mod.rs` and
`src-tauri/src/diarize/cluster.rs` updated to free-standing
explanations.

Validation:
- `pnpm run check`: 164 files, 0 errors, 0 warnings.
- `grep -rn "Trystero\|trystero\|mesh-client\.svelte" --include="*.md" .`
  returns nothing.
- `grep -rn "PROGRESS.md" .` returns nothing.

https://claude.ai/code/session_01RLu1LdTgtxEDdzhybzqFrk

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants