feat(linode): ephemeral cache wg-bootstrap end-to-end — Phase C (FJB-99)#116
Open
hstern wants to merge 2 commits into
Open
feat(linode): ephemeral cache wg-bootstrap end-to-end — Phase C (FJB-99)#116hstern wants to merge 2 commits into
hstern wants to merge 2 commits into
Conversation
…FJB-99)
Completes the bootstrap loop FJB-99 Phase A started: the cache nanode
generates its WG keypair at first boot and publishes the pubkey to S3
as public-read; the orchestrator polls the bucket via plain HTTPS GET
(no S3 SDK — WG public keys are designed to be world-readable) and
reads the cache's public IPv4 from the Linode API to build the WG
peer endpoint. Both feed into wgboot.Config as runtime overrides for
transport.wg.peer.{public_key,endpoint}.
Why HTTPS GET over signed S3: WG pubkeys are intentionally public;
the threat model isn't worse than what's already on the wire during
a handshake. Skipping the S3 SDK keeps the dependency surface lean.
What lands:
- cache cloud-init: the aws s3 cp now sets --acl public-read so the
orchestrator can fetch the pubkey via plain HTTPS.
- managedCache.WaitForWGPubkey: polls <bucket-endpoint>/<bucket>/wg-
pubkey.txt with 5s initial delay, exponential backoff to 30s, ctx-
cancellable. Returns the trimmed pubkey on first 200 OK with a
non-empty body.
- managedCache.PublicEndpoint: GetInstance on the cache linodeID,
picks the public (non-RFC1918) IPv4, appends the operator-configured
listen port or defaults to 51820.
- Linode duck-typed accessors WaitForCacheWGPubkey + CachePublic-
Endpoint surface both through cmd/fj-bellows.
- wgboot.Config: new optional CachePubkey + CacheEndpoint fields.
planBoot reads from override first, falls back to config peer
knobs, errors out at boot if neither is set — clear error message
points operators at the bootstrap loop.
- config.WGPeer.validate: public_key + endpoint are now optional at
config-load time (the bootstrap loop populates them at runtime).
- cmd/fj-bellows.discoverCachePeer: runs the bootstrap loop ONLY when
static config knobs are empty. Daemon startup stays fast under
back-compat static config — matters for the FJB-91 e2e harness
that uses the persistent test cache.
Tests cover the bucket-poll happy path, timeout surface, public-IPv4
selection, and wgboot's new no-peer rejection. Verified end-to-end
against live Linode via test/e2e-linode/run-local.sh --transport=
cache-gateway (FJB-91 stack-up scope still passes; the new code path
is exercised but no-op since the harness sets static peer config).
Out of scope (FJB-99 Phase C):
- e2e harness reactivation against the ephemeral cache (drops
persistent-cache preflight, reactivates worker readiness checks).
- Drop the static transport.wg.peer.{public_key,endpoint} config
knobs entirely.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the FJB-99 arc. Each e2e run now provisions its own per-deployment
cache; the cache cloud-init generates a WG keypair, uploads its pubkey
to S3, and the orchestrator polls the bucket via a signed S3 GET. wgboot
brings up the cache-gateway transport against the fresh cache without
any persistent external infrastructure.
Bootstrap loop (validated 1m35s create-to-tunnel-up against live Linode):
- cache cloud-init curls a pre-signed S3 PUT URL the orchestrator
baked in at cache-create time. Avoids awscli on the cache entirely:
awscli on Debian 13 has a NoneType-iteration bug against Linode
Object Storage that breaks both `s3 cp` and `s3api put-object`.
Pre-signing on the orchestrator (minio-go) sidesteps the buggy code
path; the cache just curls the URL with --fail-with-body.
- managedCache.presignedWGPubkeyPutURL signs the PUT with the same
scoped bucket creds that already reach the cache via cloud-init.
- managedCache.WaitForWGPubkey switched to minio-go for the GET
(Phase B's plain HTTPS path assumed public-read, which Linode
rejects with NotImplemented; minio-go signs).
- managedBucket exposes accessKey + secretKey so cache.go can build
a signed client without re-fetching the API key.
Harness reactivation (FJB-91 stack-up scope removed):
- Drops the persistent-cache preflight (was the FJB-91 workaround).
fj-bellows provisions the cache from scratch; LoadOrGenerateKey
creates the orchestrator's WG private key on first run.
- transport.wg.peer.{public_key,endpoint} are left empty in the
rendered e2e config — the bootstrap loop fills both.
- Control-plane readiness wait bumps to 600s under cache-gateway:
cache provision + cloud-init + apt update + wireguard install +
pubkey publish + wgboot.Boot all sit on the path. Discover-peer
bound bumps to 8 min for the same reason.
- On failure, the harness now SSHes into the cache and dumps
/var/log/fjb-wg-bootstrap.log + wg show + iptables + ip route +
cache to worker connectivity probes.
Firewall debug break-glass (FJB-89 regression):
- synthSpecsForTransport under cache-gateway now emits BOTH the
WG listener (load-bearing) AND tcp/22 (debug break-glass). An
earlier FJB-89 revision closed tcp/22 entirely, which made cache
cloud-init failures undebuggable. Operators can lock tcp/22 down
via allow_inbound once their deployment is stable.
Out of scope (handed off):
- Worker dispatch via netstack: cache to worker connectivity inside
the VPC still fails (likely Linode firewall on the worker's VPC
interface). Tracked under the FJB-94 fjbagent transition that
replaces SSH dispatch entirely. FJB-99 Phase C stops after the
bootstrap loop is validated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes the FJB-99 arc. Each e2e run now provisions its own per-deployment cache; bootstrap loop validated end-to-end against live Linode in 1m35s create-to-tunnel-up.
Bootstrap loop
s3 cpands3api put-object. Pre-signing on the orchestrator (minio-go) sidesteps the buggy path; cache just curls.managedCache.presignedWGPubkeyPutURLsigns the PUT with the same scoped bucket creds cloud-init already ships.WaitForWGPubkeyswitched to minio-go for the GET (Phase B's plain-HTTPS path assumedpublic-read, which Linode rejects with NotImplemented — minio-go signs).managedBucketnow exposesaccessKey+secretKeyso cache.go can build a signed client without re-fetching.Harness reactivation
LoadOrGenerateKeycreates the orchestrator WG private key on first run.transport.wg.peer.{public_key,endpoint}left empty in rendered e2e config; bootstrap loop fills both./var/log/fjb-wg-bootstrap.log+wg show+iptables+ip route+ cache↔worker connectivity probes. Cuts debug time for cache-cloud-init regressions.Firewall debug break-glass (FJB-89 regression fix)
synthSpecsForTransportunder cache-gateway now emits BOTHudp/<wg listen>(load-bearing) ANDtcp/22(break-glass). An earlier FJB-89 revision closed tcp/22 entirely; that made cache cloud-init failures undebuggable. Operators can lock tcp/22 down viaallow_inboundonce their deployment is stable.Out of scope (handed off)
Test plan
go build ./...go test ./...(incl. updatedTestRenderCacheCloudInitWGBootstrapfor presigned-curl,TestWaitForWGPubkey_*via fake S3 with sigv4 over httptest, firewall synth tests updated for the tcp/22 + WG dual-spec)golangci-lint run(0 issues)bash test/e2e-linode/run-local.sh --transport=cache-gateway→ ALL OK (live Linode; fresh ephemeral cache; 1m35s pubkey discovery)🤖 Generated with Claude Code