mesh: capability round-trip + gossip merge + late-join replay by mrjeeves · Pull Request #206 · mrjeeves/MyOwnLLM

mrjeeves · 2026-05-28T06:09:25Z

Symptom

After PRs #203 + #205 landed, the LLM was finally joining the mesh — but the user-facing network features were silent. Tested against a paired peer:

Connections card: every peer rendered as "no LLM, no ASR, no hardware". The chips matrix was empty regardless of what the peer actually had locally.
Sidebar: peer-hosted conversations never appeared. The catalog gossip channel was firing but nothing landed in peer.catalog.
Remote inference: routing the picker to a peer worked at the UI level, but the peer's handler 404'd with "no local LLM available" every time.
Settings sync: edited a tool permission on device A, it never appeared on device B even after multiple rounds. Same for prompts.
Late join: a peer that handshaked 30+s after the other was already up saw nothing — no catalog, no prompts, no permissions — until the existing device made an unrelated local edit.
Auto-gossip toggle: visually reverted to off on every settings panel reload.

Root cause

Eight gaps that the Phase B–D migration (#203) marked as "Phase C-6 wires this for real" or "Phase D" and never actually wired. The migration successfully replaced the Trystero transport layer but left the LLM-specific feature surface in a "shell" state.

1. Capabilities stripped by the daemon shoulder

myownmesh_core::protocol::CapabilityAdvert is {tags, app_version, max_connections, extra}. The LLM was pushing the structured Capabilities blob ({llms, asr, diarize, hardware, inputs, outputs, accepting, app_version, features}) directly — serde silently drops the unknown structured fields on deserialize. Peers saw {tags: [], app_version, max_connections: null, extra: null}. Cascade: nobody could route inference (canServeInference checks cap.llms.length), the transcribe peer picker was empty (canServeTranscribe checks cap.asr.length), the Connections card had nothing to chip.

Fix: pack the full Capabilities blob into CapabilityAdvert.extra before pushing. New peerCapabilitiesFromAdvert unpacks it on receive in daemonPeerToEntry, validating each field and falling back to empty defaults so a peer that didn't use the extra slot (older build) still renders cleanly.

2. Local inference handler 404'd every remote call

localCapabilitiesForHandler() hardcoded llms: [] — the mesh-inference.ts handler couldn't find a model match and returned streamRpcEnd("no local LLM available") for every inbound infer.

Fix: cache the last-pushed Capabilities in lastLocalCapabilities (populated by pushCapabilities); the handler now reads the live LLM list and picks by (family, mode) exactly the way the legacy mesh-client did.

3. Local mutations never broadcast

agentPermissions.setBroadcaster(fn) and agentPrompts.setBroadcaster(fn) are the hooks each store fires on every persistPatch / persistList. The legacy client wired them; the new client never did. Editing a permission or saving a prompt was silent on the wire.

Fix: install both broadcasters in startImpl() and release them on stop() via the featureReleases array. Gated on autoGossipEnabled inside the callback.

4. Permissions wire shape was wrong

publishPermissions was shipping the daemon's roster list ({authorized: [{device_id, label}], ts}) on the permissions/snapshot channel. The actual feature is per-tool agent gates (shell, write_file). Wrong data, on the right channel, with the right name — even if the merge had been wired, the incoming data would have been useless.

Prompts had the same problem at lower stakes — lossy-mapped each prompt to {id, label, body}, dropping tools, user_prompt, updated_at.

Fix: ship {tools: {shell, write_file}, ts} matching agentPermissions.mergeIncoming. Full Prompt shape (id, name, system_prompt, tools, user_prompt, updated_at) for prompts. New publishPermissionsSnapshot / publishPromptsSnapshot helpers ship pre-formed snapshots without re-reading config from disk on every mutation.

5. Inbound snapshots were logged, not merged

The subscribePermissions / subscribePrompts hooks ran appendDiag("info", "permissions snapshot from ...: N entries") and stopped. The actual mergeIncoming call was never made.

Fix: hooks now call agentPermissions.mergeIncoming(snap.tools, activeNetworkId) / agentPrompts.mergeIncoming(snap.prompts, activeNetworkId) and log only when the merge changed something. Gated on autoGossipEnabled (isolation contract).

New activeConfigNetworkId field tracks the LLM-side config id (distinct from this.network which is the wire-level network_id) so the merge scopes correctly to the right saved-network slot.

6. Auto-gossip toggle reset to false every launch

setAutoGossip updated an in-memory autoGossipEnabled = false field. The UI binds to active?.auto_gossip from config, so the toggle visually reverted on every reloadFromConfig. The toggle was never persisted to disk. Hydration on start() was missing.

Fix: hydrate from activeNetwork(cfg)?.auto_gossip ?? true on start (matches the legacy default-on behavior). setAutoGossip persists via updateNetwork(active.id, { auto_gossip }).

7. No periodic refresh + no late-joiner replay

The daemon's typed channels don't replay past publishes. A peer joining 30s after the existing device's start() would see an empty peer.catalog, no prompts, no permissions until the existing device made an unrelated local edit.

Fix: 60s setInterval re-publishes catalog (+ gossip-gated perms/prompts). A shipCatchUpGossipToNewlyActive() path fires from reconcile() — newly-active peers get a one-shot catch-up broadcast; tracked in a gossipedOnceTo set that prunes stale entries (a flap active → shelved → active gets the catch-up again). Initial active peers at start() time are seeded so the first reconcile() doesn't duplicate the initial broadcast.

8. `noteCatalogChanged` fired one publish per mutation

App-side bulk operations (folder move-N-files, multi-rename) call refreshConversations() → noteCatalogChanged(). Without debounce, a 20-file move = 20 catalog broadcasts.

Fix: 500ms setTimeout coalesce, single broadcast at the trailing edge.

Files

src/mesh-daemon.svelte.ts (+368 / -37): peerCapabilitiesFromAdvert helper, pack/unpack on pushCapabilities/daemonPeerToEntry, lastLocalCapabilities cache feeding localCapabilitiesForHandler, broadcaster install/release, inbound merge hooks, activeConfigNetworkId tracking, autoGossipEnabled hydration + persistence in setAutoGossip, 60s periodic refresh interval, shipCatchUpGossipToNewlyActive path, 500ms noteCatalogChanged debounce.
src/mesh-gossip.ts (+68 / -36): corrected permissions wire shape ({tools: {shell, write_file}, ts}), full Prompt[] shape for prompts, publishPermissionsSnapshot / publishPromptsSnapshot variants for the broadcaster callbacks, dropped the obsolete roster-list flow.

Validation

Notes

The extra-field shape is opaque to the daemon by design (serde_json::Value) — it ships verbatim. Older LLM builds that haven't started using the extra slot will render as "empty capabilities" on newer builds; once both ends are on this PR the round-trip is symmetric.
The catalog refresh interval is 60s to match legacy. The catch-up gossip on becoming-active fires sooner — usually within the reconcile-after-peer-event window.
Rust side untouched. cargo check can't run in the sandbox (missing gdk-3.0); the Rust IPC surface from mesh: migrate to myownmesh daemon (phases A–D) #203 is already adequate.

https://claude.ai/code/session_01RLu1LdTgtxEDdzhybzqFrk

Generated by Claude Code

After the Phase B–D daemon migration (PR #203/#205) the LLM was joined to the mesh but the network-feature surface — remote inference, hardware advertisement, settings sync, late-joiner catch-up — wasn't actually working. Six gaps that nominally landed as "Phase C-6 / D" in #203 but in practice were left as TODOs. **1. Capabilities stripped by the daemon shoulder.** The daemon's `CapabilityAdvert` is `{tags, app_version, max_connections, extra}`. The LLM was pushing the structured `Capabilities` blob (`{llms, asr, diarize, hardware, inputs, outputs, accepting, app_version, features}`) directly, which serde silently dropped on deserialize — peers always saw each other as "no LLMs / no ASR / no hardware", which broke every piece of LLM-side capability-keyed routing (remote inference peer picker, transcribe peer picker, the LLM/ASR chips in Connections). Fix: pack the full `Capabilities` into `CapabilityAdvert.extra` before pushing; unpack in `daemonPeerToEntry` via a new `peerCapabilitiesFromAdvert` helper that validates each field and falls back to empty defaults. `CapabilityAdvert.app_version` takes precedence over the inner copy since the daemon promotes that field in `hello` for cosmetic display. **2. Local inference handler 404'd every remote call.** `localCapabilitiesForHandler()` hard-returned `llms: []` (marked as "Phase C-6 wires this for real"), so even when a peer routed inference to us we hit `streamRpcEnd("no local LLM available")` and never reached Ollama. Fix: cache the last-pushed `Capabilities` in `lastLocalCapabilities` (populated by `pushCapabilities`); the handler now sees the live LLM list and can pick a model by (family, mode) exactly the way the legacy mesh-client did. **3. Local mutations never broadcast.** `agentPermissions.setBroadcaster(...)` and `agentPrompts.setBroadcaster(...)` are the hooks both stores fire on every local edit (`persistPatch` / `persistList`). The legacy client wired them; the new client never did, so editing a tool permission or saving a prompt was silent on the wire. Fix: install both broadcasters in `startImpl()` and release them via the `featureReleases` array on `stop()`. Both are gated on `autoGossipEnabled` inside the callback so the network's isolation contract (auto-gossip off → no outbound) holds. **4. Permissions wire shape was wrong.** `publishPermissions` was shipping the daemon's *roster list* (`{authorized: [{device_id, label}], ts}`) on the `permissions/snapshot` channel — meaningless for the actual feature, which is per-tool agent gates (shell, write_file). Even if the merge had been wired, the incoming data would have been useless. Fix: ship `{tools: {shell, write_file}, ts}` matching the shape `agentPermissions.mergeIncoming` consumes. New `publishPermissionsSnapshot(client, snap)` helper lets the `setBroadcaster` callback ship a pre-formed snapshot without re-reading config from disk on every mutation. Prompts had the same problem at lower stakes — `publishPrompts` was lossy-mapping each prompt to `{id, label, body}`, dropping `tools`, `user_prompt`, and `updated_at`. Now ships the full `Prompt` shape so `agentPrompts.mergeIncoming` can do per-id LWW correctly. **5. Inbound snapshots were logged, not merged.** The `subscribePermissions` / `subscribePrompts` hooks fired `appendDiag("info", "permissions snapshot from ...: N entries")` and stopped. The actual merge into `agentPermissions` / `agentPrompts` (which is what makes a peer's edit visible locally) was never called. Fix: hooks now call `agentPermissions.mergeIncoming(snap.tools, activeNetworkId)` / `agentPrompts.mergeIncoming(snap.prompts, activeNetworkId)` and log only when the merge actually changed something. Gated on `autoGossipEnabled` (isolation contract: when gossip is off, peer pressure can't mutate our policy). New `activeConfigNetworkId` field tracks the LLM-side config id (distinct from `this.network` which is the wire-level `network_id`) so the merge scopes correctly — a snapshot arriving on network A doesn't accidentally overwrite network B's saved policy. **6. Auto-gossip toggle reset to false every launch.** `setAutoGossip` updated an in-memory `autoGossipEnabled = false` field; the UI binds to `active?.auto_gossip` from config (so the toggle visually reverted on every `reloadFromConfig`); the toggle was never persisted. The hydration on `start()` was missing too — even users who'd previously enabled gossip saw it off after restart. Fix: hydrate `autoGossipEnabled` from `activeNetwork(cfg) ?.auto_gossip ?? true` on start (matches the legacy default). `setAutoGossip` persists via `updateNetwork(active.id, { auto_gossip })`. Toggle now sticks across restarts. **7. No periodic refresh + no late-joiner replay.** The daemon's typed channels don't replay past publishes — a peer who handshakes 30s after our initial publish sees an empty `peer.catalog`, no prompts, no permissions until our next local mutation. The legacy client ran a 60s catalog refresh tick + a once-per-newly-active-peer catch-up broadcast; both were missing. Fix: 60s `setInterval` re-publishing catalog (+ gossip-gated perms/prompts). A `shipCatchUpGossipToNewlyActive()` hook fires from `reconcile()` whenever the peer snapshot changes — newly active peers get a one-shot catch-up broadcast, tracked in a `gossipedOnceTo` set that prunes stale entries (so a flap active → shelved → active gets the catch-up again). Initial peers (active at start time) get seeded into `gossipedOnceTo` so the initial broadcast on `start()` isn't duplicated by the first `reconcile()`. **8. `noteCatalogChanged` fired one publish per mutation.** App-side bulk operations (folder move-N-files, multi-rename) each call `refreshConversations()` which calls `noteCatalogChanged()`. Without debounce, a 20-file move = 20 catalog broadcasts. Fix: 500ms `setTimeout` coalesce in `noteCatalogChanged` — same shape as the legacy client. Single broadcast at the trailing edge of the burst. --- Files: - `src/mesh-daemon.svelte.ts`: +368 / -37. New helper (`peerCapabilitiesFromAdvert`), pack/unpack wiring on `pushCapabilities`/`daemonPeerToEntry`, `lastLocalCapabilities` cache feeding `localCapabilitiesForHandler`, broadcaster wiring + release, inbound merge hooks, `activeConfigNetworkId` field, autoGossipEnabled hydration + persistence, periodic refresh interval, catch-up gossip path, catalog debounce. - `src/mesh-gossip.ts`: +68 / -36. Fixed permissions wire shape (`{tools}` not roster), full Prompt[] in prompts wire, `publishPermissionsSnapshot` / `publishPromptsSnapshot` variants for `setBroadcaster` callers, dropped the obsolete roster-list flow. **Validation:** - `pnpm run check`: 164 files, 0 errors, 0 warnings. - `pnpm run build`: clean. - Rust unchanged — Tauri build env (gdk-3.0) isn't installed in the sandbox so `cargo check` can't run; no `.rs` files touched. https://claude.ai/code/session_01RLu1LdTgtxEDdzhybzqFrk

…on (#207) The migration off Trystero onto the standalone myownmesh daemon (PRs #201 / #203 / #204 / #205 / #206) shipped the code but left every doc still describing the world before the move: - README claimed mesh discovery went "via Trystero over public Nostr relays" and that agent permissions persisted under `Config.agent_permissions.by_device[<device_id>]`. - ARCHITECTURE.md's mesh-module section described `mesh-client.svelte.ts` (deleted), Trystero room ownership (gone), and a TS module table that didn't list any of the files Phase C–D actually shipped (`mesh-daemon.svelte.ts`, `mesh-gossip.ts`, `mesh-inference.ts`, `mesh-file.ts`, `mesh-move.ts`, `mesh-transcribe.ts`, `mesh-governance.ts`). - CONNECTION-ENGINE.md was a 535-line spec for the 4-layer connection engine that no longer lives in this repo — every paragraph referenced `src/mesh-client.svelte.ts` or `mesh-scheduler-worker.ts`, neither of which exists. - DOCS.md's Cloud Mesh section walked the user through Trystero rooms, the legacy on-the-wire `MeshMessage` JSON envelope (`infer_request` / `infer_chunk` / `move_offer` / `file_offer`), and a config example missing every field the per-network schema gained (`label`, `kind`, `topology`, `auto_approve`, `auto_gossip`, `agent_permissions`, `prompts`). - PROGRESS.md was a historical bug-fix doc for a Trystero subscription-state quirk that no longer applies — the engine isn't here anymore. What this commit changes: **README.md**: replace Trystero claim with the bundled `myownmesh` daemon model; correct the agent-permissions storage path to the per-network shape (`Config.cloud_mesh.networks[*]. agent_permissions`) and mention the `auto_gossip` gate. **ARCHITECTURE.md**: rewrite the one-picture diagram to show the daemon sidecar alongside Ollama; rewrite the mesh intro paragraph; rewrite the `mesh/` Rust module row to describe `daemon.rs`, `daemon_commands.rs`, the detect-and-share socket order, and the relationship to `myownmesh_core`; rewrite the TS module table to list every `mesh-*.ts` file actually in the tree with its current role; refresh the CloudMesh sub-tab inventory (Status / Settings / Connections / Graph / Governance / Activity / HTTP); refresh the persistence section to show `daemon.sock` + the per-network config layout. **CONNECTION-ENGINE.md**: rewrite as a short pointer. The 4-layer engine + 7-tier reconnect ladder live in MyOwnMesh now; this doc explains what the LLM still owns on top (the layer-4 LLM-specific protocol), how the LLM talks to the daemon (detect-and-share IPC), and lists the LLM-side RPC methods + typed channels currently in use (`infer`, `transcribe`, `file_offer` / `file_send` + `file_chunks/<id>`, `session_*` / `move_*`, `catalog/announce`, `permissions/snapshot`, `prompts/snapshot`). **DOCS.md Cloud Mesh section**: replace the Trystero transport paragraph with the daemon's detect-and-share model; refresh every What-the-mesh-does-for-you row to match current behavior (click-to-open, click-through Pull, file transfer wire shape, permissions+prompts gossip with the auto_gossip gate, Graph view, Governance view, no Phase-1/Phase-2 split); replace the JSON-over-data-channel wire-protocol box with the daemon RPC + typed-channel surface; refresh the example config to include `label`, `kind`, `topology`, `auto_approve`, `auto_gossip`, `agent_permissions`, `prompts`. **PROGRESS.md**: deleted. The Trystero subscription-state bug it documents doesn't apply post-daemon. Two `// see PROGRESS.md` breadcrumbs in `src-tauri/src/asr/mod.rs` and `src-tauri/src/diarize/cluster.rs` updated to free-standing explanations. Validation: - `pnpm run check`: 164 files, 0 errors, 0 warnings. - `grep -rn "Trystero\|trystero\|mesh-client\.svelte" --include="*.md" .` returns nothing. - `grep -rn "PROGRESS.md" .` returns nothing. https://claude.ai/code/session_01RLu1LdTgtxEDdzhybzqFrk Co-authored-by: Claude <noreply@anthropic.com>

mrjeeves merged commit fd268fa into main May 28, 2026
4 checks passed

mrjeeves deleted the claude/charming-turing-PMIK8 branch May 28, 2026 06:25

mrjeeves mentioned this pull request May 28, 2026

docs: refresh README, ARCHITECTURE, DOCS, CONNECTION-ENGINE post-daemon #207

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mesh: capability round-trip + gossip merge + late-join replay#206

mesh: capability round-trip + gossip merge + late-join replay#206
mrjeeves merged 1 commit into
mainfrom
claude/charming-turing-PMIK8

mrjeeves commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mrjeeves commented May 28, 2026

Symptom

Root cause

1. Capabilities stripped by the daemon shoulder

2. Local inference handler 404'd every remote call

3. Local mutations never broadcast

4. Permissions wire shape was wrong

5. Inbound snapshots were logged, not merged

6. Auto-gossip toggle reset to false every launch

7. No periodic refresh + no late-joiner replay

8. noteCatalogChanged fired one publish per mutation

Files

Validation

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

8. `noteCatalogChanged` fired one publish per mutation