signaling: relax announce cadence (storage + reflection do discovery now)#15
Merged
Conversation
…scovery With stored kind 1077 and engine-side reactive reflection on every inbound PeerAnnounced, the dense early schedule is redundant — a late joiner sees every existing peer's last announce via REQ replay, and existing peers re-publish within ~1s of any inbound announce. Per-relay open-announce in run_relay_inner covers freshly- (re)connected relays. Collapses the post-startup curve from 13 publishes in 175 s (5s × 5, 10s × 3, 15s × 4, 30s × 2) to one safety-net publish at +30 s, and bumps steady from 60 s → 300 s. The remaining periodic publish only exists to refresh storage well inside the relay's retention window. Roughly 85–90% reduction in publish volume per peer per hour with no impact on discovery latency. https://claude.ai/code/session_01Vp4cvRTaLYd3162EwwcCXg
…tion Two follow-on connection-quality fixes layered onto the cadence relaxation. **Re-offer on stuck Sighted**: when an inbound PeerAnnounced arrives for a peer we already have a session with, but that session is still at status Sighted (PC created, data channel never opened) and we're the Offerer, re-create + re-send the SDP offer. webrtc-rs's create_offer calls set_local_description internally, kicking off a fresh ICE gathering cycle on the same PC — no teardown, no PC recreation, the remote handles the renegotiation transparently. Per- peer rate-limited via PeerStateData::last_offer_sent_at (2 s floor) so a REQ-replay burst doesn't fan out to fourteen redundant offers. Together with PR #14's reactive-announce, this is the "no network restart needed to rebuild a stuck connection" property: every announce we hear from a stuck peer prods our handshake forward, and once the channel opens the gate naturally closes. **Selected-pair classification retry**: webrtc-rs's CandidatePair stats can lag the ICE state callback — particularly on the controlling (Offerer) agent, which only flips `nominated=true` after sending USE-CANDIDATE and getting a response. The single- shot record in handle_ice_state_change therefore sometimes runs before nomination is reflected in stats, and selected_pair stays None forever even though packets are flowing — exactly the "laptop says it's not LAN even though it is" symptom on the Offerer side of a working pair. The existing ICE poller (already running every 3 s) now retries record_selected_pair for any Active/Shelved peer with no pair recorded yet. Cheap, self- healing within one poll tick. https://claude.ai/code/session_01Vp4cvRTaLYd3162EwwcCXg
CI's `cargo fmt --all --check` flagged a multi-line `format!` that rustfmt prefers to collapse. Functional no-op. https://claude.ai/code/session_01Vp4cvRTaLYd3162EwwcCXg
webrtc-rs doesn't always flip `nominated=true` on the controlling (Offerer) side, even after ICE is solidly Connected and packets are flowing. Confirmed against a working LAN pair where the laptop (Offerer) stayed unclassified while the answerer correctly painted "LAN". The ICE-poll retry from the previous commit can't recover this case — get_stats() returns nominated=false consistently, not just transiently. Pick the Succeeded pair with the largest bytes_received as the fallback. If multiple Succeeded pairs have zero bytes (briefly the case right after ICE settles), any of them classifies the same way for LAN / STUN / TURN purposes since they're all viable paths to the same peer. Nominated remains the preferred signal where the agent does set it. https://claude.ai/code/session_01Vp4cvRTaLYd3162EwwcCXg
Field-confirmed root cause of the "Offerer side classifies LAN peer as STUN" symptom: on a fast local network the remote's first trickled ICE candidate (carrying its LAN host address) routinely arrives 50–500 ms ahead of the answer SDP it's associated with. webrtc-rs's `add_ice_candidate` returns "remote description is not set" and the host candidate is silently dropped. ICE then recovers via peer-reflexive discovery from STUN binding probes, which succeeds — packets flow — but the agent's selected pair is now (Host, PeerReflexive) instead of (Host, Host), and the GUI's LAN/STUN/TURN classifier (correctly) paints it as STUN. Fix: track `remote_description_set` per peer and queue inbound ICE candidates in `pending_remote_candidates` until the first `set_remote_description` succeeds, then drain the queue. The drain happens inside `apply_remote_sdp` after the SDP is in place; the per-peer state lock is dropped before each `add_ice_candidate` await to avoid serializing the rest of the engine on a webrtc call. drop_peer naturally resets both fields because it removes the PeerConnection entry entirely — a reconnect creates a fresh one with `remote_description_set: false`. https://claude.ai/code/session_01Vp4cvRTaLYd3162EwwcCXg
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Now that PR #14 landed (stored kind 1077 + reactive reflection on every inbound announce), the dense periodic announce schedule we inherited from the ephemeral-kind era is doing redundant work. This PR resizes the cadence to match what the new mechanisms actually need.
Before — 13 publishes per peer in the first 175 s, then one every 60 s forever:
After — 2 publishes in the first 30 s, then one every 5 min forever:
Roughly an 85–90 % reduction in publish volume per peer per hour.
Why this is safe
Each thing the old cadence was protecting against now has a better answer:
REQ since=now-300sreplays our last stored announce immediatelyHEARTBEAT_INTERVAL_MS=30s)PeerAnnouncedwith a freshAnnounce(rate-limited 1 s)nostr::driver::run_relay_inneralready publishes a one-shot open-announce per relayThe only remaining job is refreshing relay storage well inside the retention window. Five minutes is conservative against every public relay I'm aware of (typical retention is hours to days). The 30 s safety-net publish catches a silently-failed first publish at startup.
Test plan
cargo test --workspace— all suites green.https://claude.ai/code/session_01Vp4cvRTaLYd3162EwwcCXg
Generated by Claude Code