docs: v0.31.0 relay-mediated connectivity investigation#67
Conversation
What: Documents the multi-hour investigation into why `ztlp connect z2ls-desktop-... --ns-server ...` fails end-to-end against the Z2LS Windows gateway over the relay-fallback path. Why: v0.31.0 was nearly mis-released as "broken" when the actual blocker is a missing inbound UDP port-forward at the Tech Rockstars edge router. Need a durable record so the next person doesn't re-litigate this — and so the deployment requirement (gateway behind firewall needs inbound :23095/UDP allowed) is documented somewhere discoverable. Details: - Full test matrix: which paths work (LAN, WAN-to-WAN-with-firewall-open) vs which fail (relay-forwarded to gateway behind closed firewall). - Hex-decoded the "unparseable packets" red herring (Z2LS keepalive GATEWAY_REGISTER frames misrouted by relay). - Root cause walk-through with packet captures from relay (tcpdump) and gateway (pktmon) proving Z2LS never receives the relay-forwarded HELLO when the source is an internet IP. - Reproduction recipe for the failing AND working paths. - Live test infra notes (AWS gateway 16.147.41.195:23997 + client 54.218.127.30) so the working-config bench survives session restart. Tests: No code change; documentation only. Validation: - WAN→WAN ZTLP confirmed working end-to-end between two AWS boxes: client `54.218.127.30` → ZTLP tunnel → gateway `16.147.41.195:23997` → SSH session into AWS host, `whoami → ubuntu`, `hostname → ip-172-26-13-55`. v0.31.0 binaries on all hops. Follow-up: - Tech Rockstars adds `WAN :23095/UDP → 10.170.3.111:23095` port-forward. - v0.32 considers UPnP/NAT-PMP for `ztlp listen --gateway`. - v0.32 considers relay control-frame reverse-forward filter (relay/lib/ztlp_relay/udp_listener.ex:127 missing the L2 filter that exists at line 208). - v0.32 considers symmetric-NAT hole-punching hardening.
📝 WalkthroughWalkthroughThis PR adds documentation of a relay-mediated connectivity investigation for v0.31.0. The investigation confirms v0.31.0 code functions correctly but identifies a missing gateway deployment requirement: inbound UDP port-forwarding at the network edge. The document includes test results, root-cause analysis, reproduction steps, and follow-up actions. Changesv0.31.0 Relay Deployment Investigation
Estimated Code Review Effort🎯 1 (Trivial) | ⏱️ ~5 minutes Possibly Related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
docs/v0.31.0-relay-deployment-investigation.md (2)
175-175: ⚡ Quick winConsider using relative code references instead of hardcoded line numbers.
Line 175 references specific line numbers (
:127,:208) inrelay/lib/ztlp_relay/udp_listener.ex. As the codebase evolves, these line numbers will drift, requiring manual updates to keep the documentation accurate.Consider using function/symbol names or short code snippets instead, e.g., "the L1 forwarder in
lookup_by_peer" rather than "line 127".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/v0.31.0-relay-deployment-investigation.md` at line 175, The doc currently points to hardcoded line numbers in relay/lib/ztlp_relay/udp_listener.ex (":127" and ":208"); update the documentation to use stable, relative references instead: mention the function names and code paths (e.g., "the L1 forwarder in lookup_by_peer" and "the L2 QUIC-bypass path in the function that implements the L2 forward/filter logic") or include a short code snippet of the 3-line filter check instead of line numbers so references remain valid as the file changes; ensure you remove the numeric line annotations and replace them with the function/symbol names (lookup_by_peer and the L2 filter function) and/or the minimal snippet.
50-52: ⚡ Quick winAdd language specifiers to fenced code blocks.
Three code blocks lack language identifiers, which reduces readability and prevents proper syntax highlighting:
- Line 50: Log output block
- Line 65: Log output block
- Line 99: Configuration example
📝 Suggested fixes
Line 50:
-``` +```log dispatcher: dropping unparseable packet from 34.218.240.106:23095 len=79 hex=5a 37 0a bc 97 d6 55 92 9c 30 be 37 88 5f f8 deLine 65: ```diff -``` +```log [GatewayFwd] Forwarding HELLO for session ADEAB418F3C2CF5DFAAEF9AB from {{204,16,122,24}, 50905} to gateway {{204,16,122,24}, 23095}Line 99: ```diff -``` +```text WAN UDP :23095 → <internal-gateway-IP>:23095 ```Also applies to: 65-67, 99-102
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/v0.31.0-relay-deployment-investigation.md` around lines 50 - 52, Update the three fenced code blocks that currently lack language specifiers by adding the appropriate language after the opening backticks: mark the log lines containing "dispatcher: dropping unparseable packet from 34.218.240.106:23095..." and the "[GatewayFwd] Forwarding HELLO..." block with ```log, and mark the WAN UDP mapping configuration block (the one showing "WAN UDP :23095 → <internal-gateway-IP>:23095") with ```text so they render with correct syntax highlighting and readability.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@docs/v0.31.0-relay-deployment-investigation.md`:
- Line 175: The doc currently points to hardcoded line numbers in
relay/lib/ztlp_relay/udp_listener.ex (":127" and ":208"); update the
documentation to use stable, relative references instead: mention the function
names and code paths (e.g., "the L1 forwarder in lookup_by_peer" and "the L2
QUIC-bypass path in the function that implements the L2 forward/filter logic")
or include a short code snippet of the 3-line filter check instead of line
numbers so references remain valid as the file changes; ensure you remove the
numeric line annotations and replace them with the function/symbol names
(lookup_by_peer and the L2 filter function) and/or the minimal snippet.
- Around line 50-52: Update the three fenced code blocks that currently lack
language specifiers by adding the appropriate language after the opening
backticks: mark the log lines containing "dispatcher: dropping unparseable
packet from 34.218.240.106:23095..." and the "[GatewayFwd] Forwarding HELLO..."
block with ```log, and mark the WAN UDP mapping configuration block (the one
showing "WAN UDP :23095 → <internal-gateway-IP>:23095") with ```text so they
render with correct syntax highlighting and readability.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 06c7871e-7150-45a3-88b2-6e17e7148462
📒 Files selected for processing (1)
docs/v0.31.0-relay-deployment-investigation.md
What
Captures the multi-hour investigation we did this session into why the canonical NS-resolved
ztlp connectagainst the Z2LS Windows gateway fails end-to-end on the relay-fallback path, despite v0.31.0 being correct.Why
v0.31.0 was nearly mis-tagged as "broken" because the symptom (client gets
handshake failed: no HELLO_ACK after retransmits, relay logs claim forwarding, gateway log silent) looked like a code regression. Two separate "I found the bug" theories were walked back during diagnosis before the real root cause landed.Need a durable record so:
:23095/UDP) is documented somewhere discoverable.Headline finding
The Tech Rockstars office edge router has no inbound UDP port-forward rule for
204.16.122.24:23095 → 10.170.3.111:23095. Relay-forwarded UDP arrives at the TR WAN IP and is dropped by the edge before reaching Z2LS's NIC. Confirmed by pktmon capture on Z2LS itself: zero inbound packets to:23095from any AWS source during the entire test window.v0.31.0 IS shippable. Direct WAN→WAN ZTLP between two AWS boxes (test gateway on
16.147.41.195:23997, client on54.218.127.30) completed a real SSH session:whoami → ubuntu,hostname → ip-172-26-13-55. Protocol stack is healthy.Contents
Tests / Validation
No code change. The doc itself was validated by the live tests it documents — see "Live test infrastructure" section for components still running that can be re-tested any time.
Follow-up
Tracked inline in the doc under "Follow-up actions". Notable: this should NOT become a v0.31.1 patch — there is nothing in v0.31.0 to patch. Code-level follow-ups belong in v0.32.
Summary by CodeRabbit