mesh: capability round-trip + gossip merge + late-join replay#206
Merged
Conversation
After the Phase B–D daemon migration (PR #203/#205) the LLM was joined to the mesh but the network-feature surface — remote inference, hardware advertisement, settings sync, late-joiner catch-up — wasn't actually working. Six gaps that nominally landed as "Phase C-6 / D" in #203 but in practice were left as TODOs. **1. Capabilities stripped by the daemon shoulder.** The daemon's `CapabilityAdvert` is `{tags, app_version, max_connections, extra}`. The LLM was pushing the structured `Capabilities` blob (`{llms, asr, diarize, hardware, inputs, outputs, accepting, app_version, features}`) directly, which serde silently dropped on deserialize — peers always saw each other as "no LLMs / no ASR / no hardware", which broke every piece of LLM-side capability-keyed routing (remote inference peer picker, transcribe peer picker, the LLM/ASR chips in Connections). Fix: pack the full `Capabilities` into `CapabilityAdvert.extra` before pushing; unpack in `daemonPeerToEntry` via a new `peerCapabilitiesFromAdvert` helper that validates each field and falls back to empty defaults. `CapabilityAdvert.app_version` takes precedence over the inner copy since the daemon promotes that field in `hello` for cosmetic display. **2. Local inference handler 404'd every remote call.** `localCapabilitiesForHandler()` hard-returned `llms: []` (marked as "Phase C-6 wires this for real"), so even when a peer routed inference to us we hit `streamRpcEnd("no local LLM available")` and never reached Ollama. Fix: cache the last-pushed `Capabilities` in `lastLocalCapabilities` (populated by `pushCapabilities`); the handler now sees the live LLM list and can pick a model by (family, mode) exactly the way the legacy mesh-client did. **3. Local mutations never broadcast.** `agentPermissions.setBroadcaster(...)` and `agentPrompts.setBroadcaster(...)` are the hooks both stores fire on every local edit (`persistPatch` / `persistList`). The legacy client wired them; the new client never did, so editing a tool permission or saving a prompt was silent on the wire. Fix: install both broadcasters in `startImpl()` and release them via the `featureReleases` array on `stop()`. Both are gated on `autoGossipEnabled` inside the callback so the network's isolation contract (auto-gossip off → no outbound) holds. **4. Permissions wire shape was wrong.** `publishPermissions` was shipping the daemon's *roster list* (`{authorized: [{device_id, label}], ts}`) on the `permissions/snapshot` channel — meaningless for the actual feature, which is per-tool agent gates (shell, write_file). Even if the merge had been wired, the incoming data would have been useless. Fix: ship `{tools: {shell, write_file}, ts}` matching the shape `agentPermissions.mergeIncoming` consumes. New `publishPermissionsSnapshot(client, snap)` helper lets the `setBroadcaster` callback ship a pre-formed snapshot without re-reading config from disk on every mutation. Prompts had the same problem at lower stakes — `publishPrompts` was lossy-mapping each prompt to `{id, label, body}`, dropping `tools`, `user_prompt`, and `updated_at`. Now ships the full `Prompt` shape so `agentPrompts.mergeIncoming` can do per-id LWW correctly. **5. Inbound snapshots were logged, not merged.** The `subscribePermissions` / `subscribePrompts` hooks fired `appendDiag("info", "permissions snapshot from ...: N entries")` and stopped. The actual merge into `agentPermissions` / `agentPrompts` (which is what makes a peer's edit visible locally) was never called. Fix: hooks now call `agentPermissions.mergeIncoming(snap.tools, activeNetworkId)` / `agentPrompts.mergeIncoming(snap.prompts, activeNetworkId)` and log only when the merge actually changed something. Gated on `autoGossipEnabled` (isolation contract: when gossip is off, peer pressure can't mutate our policy). New `activeConfigNetworkId` field tracks the LLM-side config id (distinct from `this.network` which is the wire-level `network_id`) so the merge scopes correctly — a snapshot arriving on network A doesn't accidentally overwrite network B's saved policy. **6. Auto-gossip toggle reset to false every launch.** `setAutoGossip` updated an in-memory `autoGossipEnabled = false` field; the UI binds to `active?.auto_gossip` from config (so the toggle visually reverted on every `reloadFromConfig`); the toggle was never persisted. The hydration on `start()` was missing too — even users who'd previously enabled gossip saw it off after restart. Fix: hydrate `autoGossipEnabled` from `activeNetwork(cfg) ?.auto_gossip ?? true` on start (matches the legacy default). `setAutoGossip` persists via `updateNetwork(active.id, { auto_gossip })`. Toggle now sticks across restarts. **7. No periodic refresh + no late-joiner replay.** The daemon's typed channels don't replay past publishes — a peer who handshakes 30s after our initial publish sees an empty `peer.catalog`, no prompts, no permissions until our next local mutation. The legacy client ran a 60s catalog refresh tick + a once-per-newly-active-peer catch-up broadcast; both were missing. Fix: 60s `setInterval` re-publishing catalog (+ gossip-gated perms/prompts). A `shipCatchUpGossipToNewlyActive()` hook fires from `reconcile()` whenever the peer snapshot changes — newly active peers get a one-shot catch-up broadcast, tracked in a `gossipedOnceTo` set that prunes stale entries (so a flap active → shelved → active gets the catch-up again). Initial peers (active at start time) get seeded into `gossipedOnceTo` so the initial broadcast on `start()` isn't duplicated by the first `reconcile()`. **8. `noteCatalogChanged` fired one publish per mutation.** App-side bulk operations (folder move-N-files, multi-rename) each call `refreshConversations()` which calls `noteCatalogChanged()`. Without debounce, a 20-file move = 20 catalog broadcasts. Fix: 500ms `setTimeout` coalesce in `noteCatalogChanged` — same shape as the legacy client. Single broadcast at the trailing edge of the burst. --- Files: - `src/mesh-daemon.svelte.ts`: +368 / -37. New helper (`peerCapabilitiesFromAdvert`), pack/unpack wiring on `pushCapabilities`/`daemonPeerToEntry`, `lastLocalCapabilities` cache feeding `localCapabilitiesForHandler`, broadcaster wiring + release, inbound merge hooks, `activeConfigNetworkId` field, autoGossipEnabled hydration + persistence, periodic refresh interval, catch-up gossip path, catalog debounce. - `src/mesh-gossip.ts`: +68 / -36. Fixed permissions wire shape (`{tools}` not roster), full Prompt[] in prompts wire, `publishPermissionsSnapshot` / `publishPromptsSnapshot` variants for `setBroadcaster` callers, dropped the obsolete roster-list flow. **Validation:** - `pnpm run check`: 164 files, 0 errors, 0 warnings. - `pnpm run build`: clean. - Rust unchanged — Tauri build env (gdk-3.0) isn't installed in the sandbox so `cargo check` can't run; no `.rs` files touched. https://claude.ai/code/session_01RLu1LdTgtxEDdzhybzqFrk
6 tasks
mrjeeves
added a commit
that referenced
this pull request
May 28, 2026
…on (#207) The migration off Trystero onto the standalone myownmesh daemon (PRs #201 / #203 / #204 / #205 / #206) shipped the code but left every doc still describing the world before the move: - README claimed mesh discovery went "via Trystero over public Nostr relays" and that agent permissions persisted under `Config.agent_permissions.by_device[<device_id>]`. - ARCHITECTURE.md's mesh-module section described `mesh-client.svelte.ts` (deleted), Trystero room ownership (gone), and a TS module table that didn't list any of the files Phase C–D actually shipped (`mesh-daemon.svelte.ts`, `mesh-gossip.ts`, `mesh-inference.ts`, `mesh-file.ts`, `mesh-move.ts`, `mesh-transcribe.ts`, `mesh-governance.ts`). - CONNECTION-ENGINE.md was a 535-line spec for the 4-layer connection engine that no longer lives in this repo — every paragraph referenced `src/mesh-client.svelte.ts` or `mesh-scheduler-worker.ts`, neither of which exists. - DOCS.md's Cloud Mesh section walked the user through Trystero rooms, the legacy on-the-wire `MeshMessage` JSON envelope (`infer_request` / `infer_chunk` / `move_offer` / `file_offer`), and a config example missing every field the per-network schema gained (`label`, `kind`, `topology`, `auto_approve`, `auto_gossip`, `agent_permissions`, `prompts`). - PROGRESS.md was a historical bug-fix doc for a Trystero subscription-state quirk that no longer applies — the engine isn't here anymore. What this commit changes: **README.md**: replace Trystero claim with the bundled `myownmesh` daemon model; correct the agent-permissions storage path to the per-network shape (`Config.cloud_mesh.networks[*]. agent_permissions`) and mention the `auto_gossip` gate. **ARCHITECTURE.md**: rewrite the one-picture diagram to show the daemon sidecar alongside Ollama; rewrite the mesh intro paragraph; rewrite the `mesh/` Rust module row to describe `daemon.rs`, `daemon_commands.rs`, the detect-and-share socket order, and the relationship to `myownmesh_core`; rewrite the TS module table to list every `mesh-*.ts` file actually in the tree with its current role; refresh the CloudMesh sub-tab inventory (Status / Settings / Connections / Graph / Governance / Activity / HTTP); refresh the persistence section to show `daemon.sock` + the per-network config layout. **CONNECTION-ENGINE.md**: rewrite as a short pointer. The 4-layer engine + 7-tier reconnect ladder live in MyOwnMesh now; this doc explains what the LLM still owns on top (the layer-4 LLM-specific protocol), how the LLM talks to the daemon (detect-and-share IPC), and lists the LLM-side RPC methods + typed channels currently in use (`infer`, `transcribe`, `file_offer` / `file_send` + `file_chunks/<id>`, `session_*` / `move_*`, `catalog/announce`, `permissions/snapshot`, `prompts/snapshot`). **DOCS.md Cloud Mesh section**: replace the Trystero transport paragraph with the daemon's detect-and-share model; refresh every What-the-mesh-does-for-you row to match current behavior (click-to-open, click-through Pull, file transfer wire shape, permissions+prompts gossip with the auto_gossip gate, Graph view, Governance view, no Phase-1/Phase-2 split); replace the JSON-over-data-channel wire-protocol box with the daemon RPC + typed-channel surface; refresh the example config to include `label`, `kind`, `topology`, `auto_approve`, `auto_gossip`, `agent_permissions`, `prompts`. **PROGRESS.md**: deleted. The Trystero subscription-state bug it documents doesn't apply post-daemon. Two `// see PROGRESS.md` breadcrumbs in `src-tauri/src/asr/mod.rs` and `src-tauri/src/diarize/cluster.rs` updated to free-standing explanations. Validation: - `pnpm run check`: 164 files, 0 errors, 0 warnings. - `grep -rn "Trystero\|trystero\|mesh-client\.svelte" --include="*.md" .` returns nothing. - `grep -rn "PROGRESS.md" .` returns nothing. https://claude.ai/code/session_01RLu1LdTgtxEDdzhybzqFrk Co-authored-by: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Symptom
After PRs #203 + #205 landed, the LLM was finally joining the mesh — but the user-facing network features were silent. Tested against a paired peer:
peer.catalog.Root cause
Eight gaps that the Phase B–D migration (#203) marked as "Phase C-6 wires this for real" or "Phase D" and never actually wired. The migration successfully replaced the Trystero transport layer but left the LLM-specific feature surface in a "shell" state.
1. Capabilities stripped by the daemon shoulder
myownmesh_core::protocol::CapabilityAdvertis{tags, app_version, max_connections, extra}. The LLM was pushing the structuredCapabilitiesblob ({llms, asr, diarize, hardware, inputs, outputs, accepting, app_version, features}) directly — serde silently drops the unknown structured fields on deserialize. Peers saw{tags: [], app_version, max_connections: null, extra: null}. Cascade: nobody could route inference (canServeInferencecheckscap.llms.length), the transcribe peer picker was empty (canServeTranscribecheckscap.asr.length), the Connections card had nothing to chip.Fix: pack the full
Capabilitiesblob intoCapabilityAdvert.extrabefore pushing. NewpeerCapabilitiesFromAdvertunpacks it on receive indaemonPeerToEntry, validating each field and falling back to empty defaults so a peer that didn't use theextraslot (older build) still renders cleanly.2. Local inference handler 404'd every remote call
localCapabilitiesForHandler()hardcodedllms: []— themesh-inference.tshandler couldn't find a model match and returnedstreamRpcEnd("no local LLM available")for every inboundinfer.Fix: cache the last-pushed
CapabilitiesinlastLocalCapabilities(populated bypushCapabilities); the handler now reads the live LLM list and picks by(family, mode)exactly the way the legacy mesh-client did.3. Local mutations never broadcast
agentPermissions.setBroadcaster(fn)andagentPrompts.setBroadcaster(fn)are the hooks each store fires on everypersistPatch/persistList. The legacy client wired them; the new client never did. Editing a permission or saving a prompt was silent on the wire.Fix: install both broadcasters in
startImpl()and release them onstop()via thefeatureReleasesarray. Gated onautoGossipEnabledinside the callback.4. Permissions wire shape was wrong
publishPermissionswas shipping the daemon's roster list ({authorized: [{device_id, label}], ts}) on thepermissions/snapshotchannel. The actual feature is per-tool agent gates (shell, write_file). Wrong data, on the right channel, with the right name — even if the merge had been wired, the incoming data would have been useless.Prompts had the same problem at lower stakes — lossy-mapped each prompt to
{id, label, body}, droppingtools,user_prompt,updated_at.Fix: ship
{tools: {shell, write_file}, ts}matchingagentPermissions.mergeIncoming. FullPromptshape (id, name, system_prompt, tools, user_prompt, updated_at) for prompts. NewpublishPermissionsSnapshot/publishPromptsSnapshothelpers ship pre-formed snapshots without re-reading config from disk on every mutation.5. Inbound snapshots were logged, not merged
The
subscribePermissions/subscribePromptshooks ranappendDiag("info", "permissions snapshot from ...: N entries")and stopped. The actualmergeIncomingcall was never made.Fix: hooks now call
agentPermissions.mergeIncoming(snap.tools, activeNetworkId)/agentPrompts.mergeIncoming(snap.prompts, activeNetworkId)and log only when the merge changed something. Gated onautoGossipEnabled(isolation contract).New
activeConfigNetworkIdfield tracks the LLM-side config id (distinct fromthis.networkwhich is the wire-levelnetwork_id) so the merge scopes correctly to the right saved-network slot.6. Auto-gossip toggle reset to false every launch
setAutoGossipupdated an in-memoryautoGossipEnabled = falsefield. The UI binds toactive?.auto_gossipfrom config, so the toggle visually reverted on everyreloadFromConfig. The toggle was never persisted to disk. Hydration onstart()was missing.Fix: hydrate from
activeNetwork(cfg)?.auto_gossip ?? trueon start (matches the legacy default-on behavior).setAutoGossippersists viaupdateNetwork(active.id, { auto_gossip }).7. No periodic refresh + no late-joiner replay
The daemon's typed channels don't replay past publishes. A peer joining 30s after the existing device's
start()would see an emptypeer.catalog, no prompts, no permissions until the existing device made an unrelated local edit.Fix: 60s
setIntervalre-publishes catalog (+ gossip-gated perms/prompts). AshipCatchUpGossipToNewlyActive()path fires fromreconcile()— newly-active peers get a one-shot catch-up broadcast; tracked in agossipedOnceToset that prunes stale entries (a flapactive → shelved → activegets the catch-up again). Initial active peers atstart()time are seeded so the firstreconcile()doesn't duplicate the initial broadcast.8.
noteCatalogChangedfired one publish per mutationApp-side bulk operations (folder move-N-files, multi-rename) call
refreshConversations()→noteCatalogChanged(). Without debounce, a 20-file move = 20 catalog broadcasts.Fix: 500ms
setTimeoutcoalesce, single broadcast at the trailing edge.Files
src/mesh-daemon.svelte.ts(+368 / -37):peerCapabilitiesFromAdverthelper, pack/unpack onpushCapabilities/daemonPeerToEntry,lastLocalCapabilitiescache feedinglocalCapabilitiesForHandler, broadcaster install/release, inbound merge hooks,activeConfigNetworkIdtracking,autoGossipEnabledhydration + persistence insetAutoGossip, 60s periodic refresh interval,shipCatchUpGossipToNewlyActivepath, 500msnoteCatalogChangeddebounce.src/mesh-gossip.ts(+68 / -36): corrected permissions wire shape ({tools: {shell, write_file}, ts}), fullPrompt[]shape for prompts,publishPermissionsSnapshot/publishPromptsSnapshotvariants for the broadcaster callbacks, dropped the obsolete roster-list flow.Validation
pnpm run check— 164 files, 0 errors, 0 warnings.pnpm run build— clean.gemma3:4bpulled, device B routes inference to A via the bar's peer picker — A's handler picks the model and streams chunks back (was 404'd before).accept_allon A; B's Settings → Permissions reflects it within a few seconds (was silent before).catalog/announcebroadcast (was 10 before).Notes
extra-field shape is opaque to the daemon by design (serde_json::Value) — it ships verbatim. Older LLM builds that haven't started using theextraslot will render as "empty capabilities" on newer builds; once both ends are on this PR the round-trip is symmetric.cargo checkcan't run in the sandbox (missing gdk-3.0); the Rust IPC surface from mesh: migrate to myownmesh daemon (phases A–D) #203 is already adequate.https://claude.ai/code/session_01RLu1LdTgtxEDdzhybzqFrk
Generated by Claude Code