Skip to content

feat(linode): ephemeral cache wg-bootstrap end-to-end — Phase C (FJB-99)#116

Open
hstern wants to merge 2 commits into
mainfrom
feat/fjb-99-phase-c-harness-reactivation
Open

feat(linode): ephemeral cache wg-bootstrap end-to-end — Phase C (FJB-99)#116
hstern wants to merge 2 commits into
mainfrom
feat/fjb-99-phase-c-harness-reactivation

Conversation

@hstern

@hstern hstern commented May 29, 2026

Copy link
Copy Markdown
Owner

Closes the FJB-99 arc. Each e2e run now provisions its own per-deployment cache; bootstrap loop validated end-to-end against live Linode in 1m35s create-to-tunnel-up.

Bootstrap loop

  • Cache cloud-init curls a pre-signed S3 PUT URL the orchestrator baked in at create time. No awscli on the cache — Debian 13's awscli has a NoneType-iteration bug against Linode Object Storage that breaks both s3 cp and s3api put-object. Pre-signing on the orchestrator (minio-go) sidesteps the buggy path; cache just curls.
  • managedCache.presignedWGPubkeyPutURL signs the PUT with the same scoped bucket creds cloud-init already ships.
  • WaitForWGPubkey switched to minio-go for the GET (Phase B's plain-HTTPS path assumed public-read, which Linode rejects with NotImplemented — minio-go signs).
  • managedBucket now exposes accessKey + secretKey so cache.go can build a signed client without re-fetching.

Harness reactivation

  • Drops the persistent-cache preflight (the FJB-91 workaround). LoadOrGenerateKey creates the orchestrator WG private key on first run.
  • transport.wg.peer.{public_key,endpoint} left empty in rendered e2e config; bootstrap loop fills both.
  • Control-plane wait bumps to 600s under cache-gateway (cache provision + cloud-init + apt + wireguard install + pubkey publish + wgboot.Boot all on the path). Discover-peer bound bumps to 8 min for the same reason.
  • On failure, harness SSHes into the cache and dumps /var/log/fjb-wg-bootstrap.log + wg show + iptables + ip route + cache↔worker connectivity probes. Cuts debug time for cache-cloud-init regressions.

Firewall debug break-glass (FJB-89 regression fix)

  • synthSpecsForTransport under cache-gateway now emits BOTH udp/<wg listen> (load-bearing) AND tcp/22 (break-glass). An earlier FJB-89 revision closed tcp/22 entirely; that made cache cloud-init failures undebuggable. Operators can lock tcp/22 down via allow_inbound once their deployment is stable.

Out of scope (handed off)

  • Worker dispatch via netstack: cache↔worker VPC connectivity still fails (likely Linode firewall on the worker's VPC interface). Tracked under FJB-94 fjbagent that replaces SSH dispatch entirely. Phase C stops after the bootstrap loop is validated.

Test plan

  • go build ./...
  • go test ./... (incl. updated TestRenderCacheCloudInitWGBootstrap for presigned-curl, TestWaitForWGPubkey_* via fake S3 with sigv4 over httptest, firewall synth tests updated for the tcp/22 + WG dual-spec)
  • golangci-lint run (0 issues)
  • bash test/e2e-linode/run-local.sh --transport=cache-gatewayALL OK (live Linode; fresh ephemeral cache; 1m35s pubkey discovery)

🤖 Generated with Claude Code

hstern and others added 2 commits May 28, 2026 21:06
…FJB-99)

Completes the bootstrap loop FJB-99 Phase A started: the cache nanode
generates its WG keypair at first boot and publishes the pubkey to S3
as public-read; the orchestrator polls the bucket via plain HTTPS GET
(no S3 SDK — WG public keys are designed to be world-readable) and
reads the cache's public IPv4 from the Linode API to build the WG
peer endpoint. Both feed into wgboot.Config as runtime overrides for
transport.wg.peer.{public_key,endpoint}.

Why HTTPS GET over signed S3: WG pubkeys are intentionally public;
the threat model isn't worse than what's already on the wire during
a handshake. Skipping the S3 SDK keeps the dependency surface lean.

What lands:
- cache cloud-init: the aws s3 cp now sets --acl public-read so the
  orchestrator can fetch the pubkey via plain HTTPS.
- managedCache.WaitForWGPubkey: polls <bucket-endpoint>/<bucket>/wg-
  pubkey.txt with 5s initial delay, exponential backoff to 30s, ctx-
  cancellable. Returns the trimmed pubkey on first 200 OK with a
  non-empty body.
- managedCache.PublicEndpoint: GetInstance on the cache linodeID,
  picks the public (non-RFC1918) IPv4, appends the operator-configured
  listen port or defaults to 51820.
- Linode duck-typed accessors WaitForCacheWGPubkey + CachePublic-
  Endpoint surface both through cmd/fj-bellows.
- wgboot.Config: new optional CachePubkey + CacheEndpoint fields.
  planBoot reads from override first, falls back to config peer
  knobs, errors out at boot if neither is set — clear error message
  points operators at the bootstrap loop.
- config.WGPeer.validate: public_key + endpoint are now optional at
  config-load time (the bootstrap loop populates them at runtime).
- cmd/fj-bellows.discoverCachePeer: runs the bootstrap loop ONLY when
  static config knobs are empty. Daemon startup stays fast under
  back-compat static config — matters for the FJB-91 e2e harness
  that uses the persistent test cache.

Tests cover the bucket-poll happy path, timeout surface, public-IPv4
selection, and wgboot's new no-peer rejection. Verified end-to-end
against live Linode via test/e2e-linode/run-local.sh --transport=
cache-gateway (FJB-91 stack-up scope still passes; the new code path
is exercised but no-op since the harness sets static peer config).

Out of scope (FJB-99 Phase C):
- e2e harness reactivation against the ephemeral cache (drops
  persistent-cache preflight, reactivates worker readiness checks).
- Drop the static transport.wg.peer.{public_key,endpoint} config
  knobs entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the FJB-99 arc. Each e2e run now provisions its own per-deployment
cache; the cache cloud-init generates a WG keypair, uploads its pubkey
to S3, and the orchestrator polls the bucket via a signed S3 GET. wgboot
brings up the cache-gateway transport against the fresh cache without
any persistent external infrastructure.

Bootstrap loop (validated 1m35s create-to-tunnel-up against live Linode):
- cache cloud-init curls a pre-signed S3 PUT URL the orchestrator
  baked in at cache-create time. Avoids awscli on the cache entirely:
  awscli on Debian 13 has a NoneType-iteration bug against Linode
  Object Storage that breaks both `s3 cp` and `s3api put-object`.
  Pre-signing on the orchestrator (minio-go) sidesteps the buggy code
  path; the cache just curls the URL with --fail-with-body.
- managedCache.presignedWGPubkeyPutURL signs the PUT with the same
  scoped bucket creds that already reach the cache via cloud-init.
- managedCache.WaitForWGPubkey switched to minio-go for the GET
  (Phase B's plain HTTPS path assumed public-read, which Linode
  rejects with NotImplemented; minio-go signs).
- managedBucket exposes accessKey + secretKey so cache.go can build
  a signed client without re-fetching the API key.

Harness reactivation (FJB-91 stack-up scope removed):
- Drops the persistent-cache preflight (was the FJB-91 workaround).
  fj-bellows provisions the cache from scratch; LoadOrGenerateKey
  creates the orchestrator's WG private key on first run.
- transport.wg.peer.{public_key,endpoint} are left empty in the
  rendered e2e config — the bootstrap loop fills both.
- Control-plane readiness wait bumps to 600s under cache-gateway:
  cache provision + cloud-init + apt update + wireguard install +
  pubkey publish + wgboot.Boot all sit on the path. Discover-peer
  bound bumps to 8 min for the same reason.
- On failure, the harness now SSHes into the cache and dumps
  /var/log/fjb-wg-bootstrap.log + wg show + iptables + ip route +
  cache to worker connectivity probes.

Firewall debug break-glass (FJB-89 regression):
- synthSpecsForTransport under cache-gateway now emits BOTH the
  WG listener (load-bearing) AND tcp/22 (debug break-glass). An
  earlier FJB-89 revision closed tcp/22 entirely, which made cache
  cloud-init failures undebuggable. Operators can lock tcp/22 down
  via allow_inbound once their deployment is stable.

Out of scope (handed off):
- Worker dispatch via netstack: cache to worker connectivity inside
  the VPC still fails (likely Linode firewall on the worker's VPC
  interface). Tracked under the FJB-94 fjbagent transition that
  replaces SSH dispatch entirely. FJB-99 Phase C stops after the
  bootstrap loop is validated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hstern hstern enabled auto-merge (squash) May 29, 2026 12:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant