Skip to content

docs: v0.31.0 relay-mediated connectivity investigation#67

Open
priceflex wants to merge 1 commit into
mainfrom
docs/v0.31.0-relay-deployment-investigation
Open

docs: v0.31.0 relay-mediated connectivity investigation#67
priceflex wants to merge 1 commit into
mainfrom
docs/v0.31.0-relay-deployment-investigation

Conversation

@priceflex
Copy link
Copy Markdown
Owner

@priceflex priceflex commented May 27, 2026

What

Captures the multi-hour investigation we did this session into why the canonical NS-resolved ztlp connect against the Z2LS Windows gateway fails end-to-end on the relay-fallback path, despite v0.31.0 being correct.

Why

v0.31.0 was nearly mis-tagged as "broken" because the symptom (client gets handshake failed: no HELLO_ACK after retransmits, relay logs claim forwarding, gateway log silent) looked like a code regression. Two separate "I found the bug" theories were walked back during diagnosis before the real root cause landed.

Need a durable record so:

  • Next investigator doesn't re-litigate this.
  • The deployment requirement (gateway behind firewall needs inbound :23095/UDP) is documented somewhere discoverable.
  • We have a known-good WAN→WAN reproduction the next time someone needs to confirm "is it the network or the code?"

Headline finding

The Tech Rockstars office edge router has no inbound UDP port-forward rule for 204.16.122.24:23095 → 10.170.3.111:23095. Relay-forwarded UDP arrives at the TR WAN IP and is dropped by the edge before reaching Z2LS's NIC. Confirmed by pktmon capture on Z2LS itself: zero inbound packets to :23095 from any AWS source during the entire test window.

v0.31.0 IS shippable. Direct WAN→WAN ZTLP between two AWS boxes (test gateway on 16.147.41.195:23997, client on 54.218.127.30) completed a real SSH session: whoami → ubuntu, hostname → ip-172-26-13-55. Protocol stack is healthy.

Contents

  • Full 7-row test matrix (which paths work, which fail, why)
  • Hex-decoded analysis of the "unparseable 79-byte packet" red herring
  • Reproduction recipes for both the failing path AND the working path
  • Live test infrastructure notes (so the AWS test gateway stays useful)
  • Three categories of follow-up actions: immediate (TR network), documentation (deployment docs gap), v0.32 code candidates (UPnP, relay control-frame filter fix, symmetric-NAT punching)
  • Section listing exactly which Hermes claims got walked back during the investigation, for accountability

Tests / Validation

No code change. The doc itself was validated by the live tests it documents — see "Live test infrastructure" section for components still running that can be re-tested any time.

Follow-up

Tracked inline in the doc under "Follow-up actions". Notable: this should NOT become a v0.31.1 patch — there is nothing in v0.31.0 to patch. Code-level follow-ups belong in v0.32.

Summary by CodeRabbit

  • Documentation
    • Added investigation documentation for v0.31.0 relay connectivity issues, clarifying a deployment configuration requirement (UDP port forwarding) rather than a code regression. Includes troubleshooting steps and operational recommendations.

Review Change Stack

What:
Documents the multi-hour investigation into why
`ztlp connect z2ls-desktop-... --ns-server ...` fails end-to-end
against the Z2LS Windows gateway over the relay-fallback path.

Why:
v0.31.0 was nearly mis-released as "broken" when the actual blocker
is a missing inbound UDP port-forward at the Tech Rockstars edge
router. Need a durable record so the next person doesn't re-litigate
this — and so the deployment requirement (gateway behind firewall
needs inbound :23095/UDP allowed) is documented somewhere
discoverable.

Details:
- Full test matrix: which paths work (LAN, WAN-to-WAN-with-firewall-open)
  vs which fail (relay-forwarded to gateway behind closed firewall).
- Hex-decoded the "unparseable packets" red herring (Z2LS keepalive
  GATEWAY_REGISTER frames misrouted by relay).
- Root cause walk-through with packet captures from relay (tcpdump)
  and gateway (pktmon) proving Z2LS never receives the relay-forwarded
  HELLO when the source is an internet IP.
- Reproduction recipe for the failing AND working paths.
- Live test infra notes (AWS gateway 16.147.41.195:23997 + client
  54.218.127.30) so the working-config bench survives session restart.

Tests:
No code change; documentation only.

Validation:
- WAN→WAN ZTLP confirmed working end-to-end between two AWS boxes:
  client `54.218.127.30` → ZTLP tunnel → gateway `16.147.41.195:23997`
  → SSH session into AWS host, `whoami → ubuntu`, `hostname →
  ip-172-26-13-55`. v0.31.0 binaries on all hops.

Follow-up:
- Tech Rockstars adds `WAN :23095/UDP → 10.170.3.111:23095` port-forward.
- v0.32 considers UPnP/NAT-PMP for `ztlp listen --gateway`.
- v0.32 considers relay control-frame reverse-forward filter
  (relay/lib/ztlp_relay/udp_listener.ex:127 missing the L2 filter
  that exists at line 208).
- v0.32 considers symmetric-NAT hole-punching hardening.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

📝 Walkthrough

Walkthrough

This PR adds documentation of a relay-mediated connectivity investigation for v0.31.0. The investigation confirms v0.31.0 code functions correctly but identifies a missing gateway deployment requirement: inbound UDP port-forwarding at the network edge. The document includes test results, root-cause analysis, reproduction steps, and follow-up actions.

Changes

v0.31.0 Relay Deployment Investigation

Layer / File(s) Summary
Investigation header and executive summary
docs/v0.31.0-relay-deployment-investigation.md (lines 1–28)
Metadata and overall verdict stating the relay→gateway connectivity failure is a gateway deployment/firewall requirement (missing inbound UDP), not a v0.31.0 code regression.
Test matrix and root-cause analysis
docs/v0.31.0-relay-deployment-investigation.md (lines 29–93)
Test matrix across LAN/WAN and relay/no-relay paths with failure modes, root-cause walk-through distinguishing client-side noise from actual inbound UDP blockage at the gateway, and enumeration of verified working v0.31.0 behaviors.
Deployment requirement and reproduction steps
docs/v0.31.0-relay-deployment-investigation.md (lines 94–142)
Deployment requirement (gateway must allow inbound UDP via port-forwarding, cloud exposure, or successful hole-punch) and step-by-step reproduction recipes for both the failing relay-mediated path with gateway packet capture and the working WAN-to-WAN control case.
Infrastructure inventory and follow-up actions
docs/v0.31.0-relay-deployment-investigation.md (lines 143–191)
Current infrastructure details with teardown commands, follow-up action items grouped by immediacy (deployment, documentation, v0.32 code candidates, operational guidance), walk-backs of incorrect early assumptions, and corrected debugging order prioritizing gateway-side packet capture.

Estimated Code Review Effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Possibly Related PRs

  • priceflex/ztlp#61: Both PRs investigate and document the same relay→gateway connectivity failure mode—missing/open inbound UDP at the gateway firewall—as the root cause of the broken relay-mediated path.

Poem

🐰 Through relay routes and firewall gates,
We traced the path that meets its fates—
The code was sound, the logic true,
But UDP ports needed breaking through!
No regression here, just deployment care,
The answer lived in firewalls fair.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding documentation of an investigation into v0.31.0 relay-mediated connectivity issues. It is concise, specific, and clearly relates to the primary content of the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/v0.31.0-relay-deployment-investigation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
docs/v0.31.0-relay-deployment-investigation.md (2)

175-175: ⚡ Quick win

Consider using relative code references instead of hardcoded line numbers.

Line 175 references specific line numbers (:127, :208) in relay/lib/ztlp_relay/udp_listener.ex. As the codebase evolves, these line numbers will drift, requiring manual updates to keep the documentation accurate.

Consider using function/symbol names or short code snippets instead, e.g., "the L1 forwarder in lookup_by_peer" rather than "line 127".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/v0.31.0-relay-deployment-investigation.md` at line 175, The doc
currently points to hardcoded line numbers in
relay/lib/ztlp_relay/udp_listener.ex (":127" and ":208"); update the
documentation to use stable, relative references instead: mention the function
names and code paths (e.g., "the L1 forwarder in lookup_by_peer" and "the L2
QUIC-bypass path in the function that implements the L2 forward/filter logic")
or include a short code snippet of the 3-line filter check instead of line
numbers so references remain valid as the file changes; ensure you remove the
numeric line annotations and replace them with the function/symbol names
(lookup_by_peer and the L2 filter function) and/or the minimal snippet.

50-52: ⚡ Quick win

Add language specifiers to fenced code blocks.

Three code blocks lack language identifiers, which reduces readability and prevents proper syntax highlighting:

  • Line 50: Log output block
  • Line 65: Log output block
  • Line 99: Configuration example
📝 Suggested fixes

Line 50:

-```
+```log
 dispatcher: dropping unparseable packet from 34.218.240.106:23095 len=79 hex=5a 37 0a bc 97 d6 55 92 9c 30 be 37 88 5f f8 de

Line 65:
```diff
-```
+```log
 [GatewayFwd] Forwarding HELLO for session ADEAB418F3C2CF5DFAAEF9AB from {{204,16,122,24}, 50905} to gateway {{204,16,122,24}, 23095}

Line 99:
```diff
-```
+```text
    WAN UDP :23095  →  <internal-gateway-IP>:23095
    ```

Also applies to: 65-67, 99-102

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/v0.31.0-relay-deployment-investigation.md` around lines 50 - 52, Update
the three fenced code blocks that currently lack language specifiers by adding
the appropriate language after the opening backticks: mark the log lines
containing "dispatcher: dropping unparseable packet from
34.218.240.106:23095..." and the "[GatewayFwd] Forwarding HELLO..." block with
```log, and mark the WAN UDP mapping configuration block (the one showing "WAN
UDP :23095  →  <internal-gateway-IP>:23095") with ```text so they render with
correct syntax highlighting and readability.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@docs/v0.31.0-relay-deployment-investigation.md`:
- Line 175: The doc currently points to hardcoded line numbers in
relay/lib/ztlp_relay/udp_listener.ex (":127" and ":208"); update the
documentation to use stable, relative references instead: mention the function
names and code paths (e.g., "the L1 forwarder in lookup_by_peer" and "the L2
QUIC-bypass path in the function that implements the L2 forward/filter logic")
or include a short code snippet of the 3-line filter check instead of line
numbers so references remain valid as the file changes; ensure you remove the
numeric line annotations and replace them with the function/symbol names
(lookup_by_peer and the L2 filter function) and/or the minimal snippet.
- Around line 50-52: Update the three fenced code blocks that currently lack
language specifiers by adding the appropriate language after the opening
backticks: mark the log lines containing "dispatcher: dropping unparseable
packet from 34.218.240.106:23095..." and the "[GatewayFwd] Forwarding HELLO..."
block with ```log, and mark the WAN UDP mapping configuration block (the one
showing "WAN UDP :23095  →  <internal-gateway-IP>:23095") with ```text so they
render with correct syntax highlighting and readability.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 06c7871e-7150-45a3-88b2-6e17e7148462

📥 Commits

Reviewing files that changed from the base of the PR and between 1e4242e and fc1cf1a.

📒 Files selected for processing (1)
  • docs/v0.31.0-relay-deployment-investigation.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant