-
Notifications
You must be signed in to change notification settings - Fork 345
Description
Summary
nemoclaw cluster admin deploy fails with container exited (status=EXITED, exit_code=1) when k3s cannot find a default route inside the cluster container. The fatal error from k3s:
level=fatal msg="no default routes found in \"/proc/net/route\" or \"/proc/net/ipv6_route\""
This has been observed on remote deploys (--remote spark), but the underlying issue could affect any host where Docker does not inject a default route into the container's routing table on a user-defined bridge network.
Reproduction
nemoclaw cluster admin deploy --remote spark --name my-sparkFull output:
✓ Pulling cluster image on remote host
✓ Pulling navigator/cluster:latest (linux/arm64) on remote host
✓ Image d1i0nduu2f6qxk.cloudfront.net/navigator/cluster ready on remote host
✓ Creating cluster network
✓ Preparing cluster volume
✓ Creating cluster container
✓ Starting cluster container
x Waiting for kubeconfig
x Cluster failed: my-spark
Error: × cluster container is not running while waiting for kubeconfig: container exited (status=EXITED, exit_code=1)
│ container logs:
│ Warning: Could not discover Docker DNS ports from iptables
│ UDP_PORT=<not found> TCP_PORT=<not found>
│ DNS proxy setup failed, falling back to public DNS servers
│ Note: this may not work on Docker Desktop (Mac/Windows)
│ Configuring registry mirror for d1i0nduu2f6qxk.cloudfront.net via d1i0nduu2f6qxk.cloudfront.net (https)
│ ...
│ time="2026-03-05T17:37:36Z" level=fatal msg="no default routes found in \"/proc/net/route\" or \"/proc/net/ipv6_route\""
Note: the Warning: Could not discover Docker DNS ports from iptables is a related signal — both DNS discovery and route presence depend on the Docker networking stack behaving as expected on bridge networks.
Root Cause Analysis
k3s's embedded Flannel CNI auto-detects the node's network interface by looking up the default route in /proc/net/route (IPv4) and /proc/net/ipv6_route (IPv6). If neither file contains a default route entry (0.0.0.0 destination), Flannel refuses to start and k3s exits fatally.
Why the default route is missing
The cluster container runs on a user-defined Docker bridge network (navigator-cluster). Normally, Docker injects a default route through the bridge gateway into the container. However, on certain Docker configurations — particularly Docker Desktop with vpnkit/gvproxy, rootless Docker, or remote Docker engines — this route may not be present or may not be visible via /proc/net/route.
Why the code doesn't handle this
There are two gaps in the current cluster infrastructure:
-
deploy/docker/cluster-entrypoint.sh— has no route verification or creation logic before launching k3s. The DNS proxy fallback exists, but no equivalent for routes. -
crates/navigator-bootstrap/src/docker.rs(ensure_container(), ~line 307) — the k3s server command does not pass--flannel-ifaceor--node-ip, so Flannel relies entirely on default route auto-detection.
Relevant code locations
| File | Lines | What |
|---|---|---|
deploy/docker/cluster-entrypoint.sh |
283 | k3s launch — only adds --resolv-conf, no network flags |
deploy/docker/cluster-entrypoint.sh |
42-90 | DNS proxy setup (works) but no route setup |
crates/navigator-bootstrap/src/docker.rs |
296-305 | HostConfig — no sysctls, no cgroupns_mode |
crates/navigator-bootstrap/src/docker.rs |
307-316 | k3s CMD — no --flannel-iface, --node-ip, or --flannel-backend |
crates/navigator-bootstrap/src/docker.rs |
125-146 | Network creation — default bridge, no custom subnet/gateway |
Proposed Fix
The recommended approach is to add --flannel-iface=eth0 to the k3s server command. This bypasses the default-route-based interface auto-detection entirely, since the container's eth0 is always the bridge network interface.
This can be combined with a defensive route check in cluster-entrypoint.sh for additional robustness:
# Before launching k3s, ensure a default route exists
if ! ip route show default > /dev/null 2>&1 || [ -z "$(ip route show default)" ]; then
echo "Warning: No default route found, attempting to add one"
GATEWAY=$(ip route | grep -oP 'via \K[^ ]+' | head -1 || true)
if [ -z "$GATEWAY" ]; then
GATEWAY=$(ip -4 addr show eth0 | grep -oP 'inet \K[^/]+' | head -1)
GATEWAY="${GATEWAY%.*}.1" # assume .1 gateway on the subnet
fi
ip route add default via "$GATEWAY" dev eth0 || echo "Warning: Could not add default route"
fiImplementation summary
crates/navigator-bootstrap/src/docker.rs(~line 307): Add--flannel-iface=eth0to the k3s server CMDdeploy/docker/cluster-entrypoint.sh(before line 283): Add a default route check/creation as a defensive fallback.agents/skills/debug-navigator-cluster/SKILL.md: Add this failure mode to the known issues/debug stepsarchitecture/cluster-single-node.md: Document the flannel interface override and route requirements
Environment
- Remote host:
spark(linux/arm64) - k3s version: v1.29.8-k3s1 (from
Dockerfile.cluster) - Docker network: user-defined bridge (
navigator-cluster) - Container mode: privileged