Skip to content

signaling: relax announce cadence (storage + reflection do discovery now)#15

Merged
mrjeeves merged 5 commits into
mainfrom
claude/relax-announce-cadence
May 27, 2026
Merged

signaling: relax announce cadence (storage + reflection do discovery now)#15
mrjeeves merged 5 commits into
mainfrom
claude/relax-announce-cadence

Conversation

@mrjeeves
Copy link
Copy Markdown
Owner

Summary

Now that PR #14 landed (stored kind 1077 + reactive reflection on every inbound announce), the dense periodic announce schedule we inherited from the ephemeral-kind era is doing redundant work. This PR resizes the cadence to match what the new mechanisms actually need.

Before — 13 publishes per peer in the first 175 s, then one every 60 s forever:

0s, 5s, 10s, 15s, 20s, 25s, 35s, 45s, 55s, 70s, 85s, 100s, 115s, 145s, 175s, 235s, 295s, …

After — 2 publishes in the first 30 s, then one every 5 min forever:

0s, 30s, 330s, 630s, …

Roughly an 85–90 % reduction in publish volume per peer per hour.

Why this is safe

Each thing the old cadence was protecting against now has a better answer:

Old job Now covered by
Be visible to a fresh joiner in their first few seconds Stored kind 1077 — joiner's REQ since=now-300s replays our last stored announce immediately
Compensate for ephemerals being dropped Moot — we use a stored kind
Detect a peer is gone App-level WebRTC heartbeats on the data channel (HEARTBEAT_INTERVAL_MS=30s)
Wake up a steady-state peer who otherwise wouldn't reply for ~60 s Engine reflects every inbound PeerAnnounced with a fresh Announce (rate-limited 1 s)
Be visible to a freshly-(re)connected relay nostr::driver::run_relay_inner already publishes a one-shot open-announce per relay

The only remaining job is refreshing relay storage well inside the retention window. Five minutes is conservative against every public relay I'm aware of (typical retention is hours to days). The 30 s safety-net publish catches a silently-failed first publish at startup.

Test plan

  • cargo test --workspace — all suites green.
  • Manual: bring 3 peers up over the default relay pool, watch the Activity log, verify discovery latency is still snappy (<2 s) and per-peer publish rate dropped.
  • Manual: leave a 2-peer room idle for 10 min, confirm a 3rd joiner still finds both within ~1 s of joining (stored kind + REQ replay carries it).

https://claude.ai/code/session_01Vp4cvRTaLYd3162EwwcCXg


Generated by Claude Code

claude added 5 commits May 27, 2026 05:38
…scovery

With stored kind 1077 and engine-side reactive reflection on every
inbound PeerAnnounced, the dense early schedule is redundant — a
late joiner sees every existing peer's last announce via REQ replay,
and existing peers re-publish within ~1s of any inbound announce.
Per-relay open-announce in run_relay_inner covers freshly-
(re)connected relays.

Collapses the post-startup curve from 13 publishes in 175 s
(5s × 5, 10s × 3, 15s × 4, 30s × 2) to one safety-net publish at
+30 s, and bumps steady from 60 s → 300 s. The remaining periodic
publish only exists to refresh storage well inside the relay's
retention window. Roughly 85–90% reduction in publish volume per
peer per hour with no impact on discovery latency.

https://claude.ai/code/session_01Vp4cvRTaLYd3162EwwcCXg
…tion

Two follow-on connection-quality fixes layered onto the cadence
relaxation.

**Re-offer on stuck Sighted**: when an inbound PeerAnnounced arrives
for a peer we already have a session with, but that session is still
at status Sighted (PC created, data channel never opened) and we're
the Offerer, re-create + re-send the SDP offer. webrtc-rs's
create_offer calls set_local_description internally, kicking off a
fresh ICE gathering cycle on the same PC — no teardown, no PC
recreation, the remote handles the renegotiation transparently. Per-
peer rate-limited via PeerStateData::last_offer_sent_at (2 s floor)
so a REQ-replay burst doesn't fan out to fourteen redundant offers.
Together with PR #14's reactive-announce, this is the "no network
restart needed to rebuild a stuck connection" property: every
announce we hear from a stuck peer prods our handshake forward,
and once the channel opens the gate naturally closes.

**Selected-pair classification retry**: webrtc-rs's CandidatePair
stats can lag the ICE state callback — particularly on the
controlling (Offerer) agent, which only flips `nominated=true`
after sending USE-CANDIDATE and getting a response. The single-
shot record in handle_ice_state_change therefore sometimes runs
before nomination is reflected in stats, and selected_pair stays
None forever even though packets are flowing — exactly the
"laptop says it's not LAN even though it is" symptom on the
Offerer side of a working pair. The existing ICE poller (already
running every 3 s) now retries record_selected_pair for any
Active/Shelved peer with no pair recorded yet. Cheap, self-
healing within one poll tick.

https://claude.ai/code/session_01Vp4cvRTaLYd3162EwwcCXg
CI's `cargo fmt --all --check` flagged a multi-line `format!` that
rustfmt prefers to collapse. Functional no-op.

https://claude.ai/code/session_01Vp4cvRTaLYd3162EwwcCXg
webrtc-rs doesn't always flip `nominated=true` on the controlling
(Offerer) side, even after ICE is solidly Connected and packets are
flowing. Confirmed against a working LAN pair where the laptop
(Offerer) stayed unclassified while the answerer correctly painted
"LAN". The ICE-poll retry from the previous commit can't recover
this case — get_stats() returns nominated=false consistently, not
just transiently.

Pick the Succeeded pair with the largest bytes_received as the
fallback. If multiple Succeeded pairs have zero bytes (briefly the
case right after ICE settles), any of them classifies the same way
for LAN / STUN / TURN purposes since they're all viable paths to
the same peer. Nominated remains the preferred signal where the
agent does set it.

https://claude.ai/code/session_01Vp4cvRTaLYd3162EwwcCXg
Field-confirmed root cause of the "Offerer side classifies LAN
peer as STUN" symptom: on a fast local network the remote's first
trickled ICE candidate (carrying its LAN host address) routinely
arrives 50–500 ms ahead of the answer SDP it's associated with.
webrtc-rs's `add_ice_candidate` returns "remote description is
not set" and the host candidate is silently dropped. ICE then
recovers via peer-reflexive discovery from STUN binding probes,
which succeeds — packets flow — but the agent's selected pair is
now (Host, PeerReflexive) instead of (Host, Host), and the GUI's
LAN/STUN/TURN classifier (correctly) paints it as STUN.

Fix: track `remote_description_set` per peer and queue inbound
ICE candidates in `pending_remote_candidates` until the first
`set_remote_description` succeeds, then drain the queue. The
drain happens inside `apply_remote_sdp` after the SDP is in
place; the per-peer state lock is dropped before each
`add_ice_candidate` await to avoid serializing the rest of the
engine on a webrtc call.

drop_peer naturally resets both fields because it removes the
PeerConnection entry entirely — a reconnect creates a fresh one
with `remote_description_set: false`.

https://claude.ai/code/session_01Vp4cvRTaLYd3162EwwcCXg
@mrjeeves mrjeeves merged commit 3846fb5 into main May 27, 2026
6 checks passed
@mrjeeves mrjeeves deleted the claude/relax-announce-cadence branch May 27, 2026 07:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants