diff --git a/tickets/devop-579-network-policy-rollout.md b/tickets/devop-579-network-policy-rollout.md new file mode 100644 index 0000000..0a35e60 --- /dev/null +++ b/tickets/devop-579-network-policy-rollout.md @@ -0,0 +1,166 @@ +# DEVOP-579 — NetworkPolicy egress rollout plan + +**Status:** plan only. Execution is staged across 3 engineer-weeks. Do NOT deploy any NetworkPolicies based on this plan without the rollout owner signing off on the scope of each phase. + +## Goal + +Add `default-deny-egress` NetworkPolicies to every Kubernetes namespace across our 13 clusters, then layer explicit egress allowlists per workload. Closes the "compromised pod can call out to attacker-controlled C2" Shai-Hulud propagation path. + +## Why this is hard (the 3-engineer-week estimate) + +NetworkPolicies are stateless and additive — meaning a `default-deny` policy will silently break every workload that has a legitimate outbound dependency that isn't yet enumerated. Production-impacting blast radius if rushed. The bulk of the work is **discovery**, not deployment. + +## Phase 0 — Pre-flight (week 1, days 1–2) + +- [ ] Confirm CNI on every cluster supports NetworkPolicy (Calico with Felix as the enforcer — yes; Cilium — yes, native; Antrea — yes; flannel without the `--network-policy` flag — no). +- [ ] For any cluster still on flannel-without-NetworkPolicy, plan a CNI migration to Calico or Cilium before proceeding. NetworkPolicy enforcement is unavailable otherwise; this rollout cannot land on those clusters until the CNI migration is done. +- [ ] Enable flow logs on at least one staging cluster. We need ~7 days of baseline traffic to enumerate legitimate egress. Note: these are L3/L4 logs — they give source/destination IP, port, and protocol only, **not** FQDNs. CNI-specific enablement: + - Cilium: `cilium hubble enable` (and ensure flow retention covers 7 days; `hubble-relay` persists in-memory by default, so for the baseline window export to a sink the team can query later — Loki, S3, or BigQuery). + - Calico OSS: patch the `default` `FelixConfiguration` CR with `spec.flowLogsFileEnabled: true` (plus `flowLogsFileIncludeLabels: true` so workload identity is queryable). Felix writes per-node JSON flow logs under `/var/log/calico/flowlogs/`; ship those to the same sink as above. OSS file-based flow logs cover allow/deny actions only — for richer flow context, Calico Enterprise / Calico Cloud is required, otherwise prefer the Cilium staging cluster for baseline capture. + - Antrea: enable the `FlowExporter` feature gate on the agent and run `flow-aggregator` to export to the sink. +- [ ] Enable DNS message capture on the same cluster so Phase 1 can map destination IPs back to FQDNs. The capture mechanism must include the DNS *response* (specifically the A / AAAA answer-section IPs), not just the query — without those resolved IPs there is nothing to join the flow-log `dstIP` against. Concretely: + - CoreDNS: enable the `dnstap` plugin with the `full` flag (e.g., `dnstap /var/run/dnstap.sock full` or `dnstap tcp://:6000 full`) and run a dnstap collector (such as `golang-dnstap` or `dnstap-receiver`) that decodes the wire-format messages and ships `(timestamp, client_pod_ip, query_name, response_ips[])` records to the same sink as the flow logs. The query-only `log` plugin is insufficient — it emits client IP, query name, and response code but not the answer-section IPs, so the `dstIP` join cannot be made from `log` output alone. The DNS messages are then joined to the flow logs on `(srcPodIP == client_pod_ip, dstIP ∈ response_ips, timestamp ± window)`. + - On Cilium clusters where we plan to author DNS-aware policies anyway, enable Cilium L7 DNS visibility (`hubble observe --type=dns` works once the DNS proxy is in path) — Hubble records per-pod resolved FQDNs *and* the answer IPs together, which removes the cross-source join entirely. + - Confirm dnstap / Hubble retention covers the full 7-day baseline window before Phase 1 starts. + +## Phase 1 — Discovery (week 1, days 3–5; week 2, days 1–2) + +For each namespace, in priority order (highest-value first): +1. `allora-chain-validators` +2. `allora-chain-rpc` +3. `harbor` +4. `flux-system` +5. ingress-nginx / traefik +6. cert-manager +7. application namespaces (`robonet`, `eliza-allora`, etc.) +8. system namespaces last (`kube-system`, `gke-system`) + +For each: +- [ ] Capture 7 days of egress flow logs from baseline. +- [ ] Enumerate destination CIDRs and ports directly from the flow logs (these are the only fields L3/L4 actually carries). +- [ ] Resolve those destinations to FQDNs by joining the flow logs against the CoreDNS dnstap stream (or Cilium L7 DNS events) enabled in Phase 0, on `(srcPodIP == DNS client IP, dstIP ∈ DNS response answer IPs, timestamp ± window)`. Destinations that have no matching DNS lookup (e.g., hard-coded IP literals, `169.254.169.254`, raw cloud-metadata IPs) get carried through as IP-only entries and are scrutinized in the suspect-destination step below. +- [ ] Group by category: `internal` (other Allora namespaces), `infra` (cloud-provider metadata, DNS, NTP, GKE), `vendor-saas` (Datadog, Slack, etc.), `package-registries` (npm, pypi, docker.io, ghcr.io, quay.io — these become Harbor proxy-cache after DEVOP-589), `customer-traffic` (per-namespace). +- [ ] **Explicitly flag suspect egress destinations** for incident review before they get added to any allowlist. Treat the following as suspect by default: + - Generic webhook receivers (`*.webhook.site`, `discord.com/api/webhooks/*`, `hooks.slack.com/services/*` that aren't ours, `*.pipedream.net`, `*.requestbin.com`, etc.). + - Pastebin-family services (`pastebin.com`, `paste.ee`, `hastebin.com`, `gist.githubusercontent.com` raw fetches from non-org accounts, `transfer.sh`, `0x0.st`). + - Tunnel / reverse-proxy services (`*.ngrok.io`, `*.ngrok-free.app`, `*.loca.lt`, `*.trycloudflare.com`, `*.serveo.net`). + - Cloud-instance metadata endpoints from inside a pod (`169.254.169.254`, `metadata.google.internal`, `100.100.100.200`) — these should be blocked outright unless a specific workload demonstrably needs them, and even then via an IRSA / Workload Identity allowlist, not raw IP. + - Anything resolving to a residential/dynamic-DNS provider (`*.duckdns.org`, `*.no-ip.com`, `*.dyndns.org`). + Each flagged destination needs an incident-response review: confirm a legitimate owner, document the use case, and either allowlist with a tight CIDR / FQDN or open a remediation ticket. Do NOT roll suspect destinations into the allowlist by default just because they appear in the 7-day baseline. +- [ ] Document in this repo as `network-policies/discovery/.md` for future audit, including the suspect-destination review notes. + +## Phase 2 — Allowlist authoring (week 2, days 3–5) + +### NetworkPolicy naming convention (pinned — runbook depends on it) + +Every NetworkPolicy this rollout creates uses one of these exact `metadata.name` values, in the namespace it targets. The rollback runbook (Phase 3) and the Kyverno asserter (Phase 4) both grep on these names, so deviations break both. + +| Direction | File name | `metadata.name` | Purpose | +|---|---|---|---| +| Egress | `default-deny-egress.yaml` | `default-deny-egress` | Deny all egress except the baseline allows in `egress-baseline-allow.yaml`. | +| Egress | `egress-baseline-allow.yaml` | `egress-baseline-allow` | Cluster-wide always-on allows: DNS to kube-dns / CoreDNS (53/udp, 53/tcp) and NTP (123/udp). Lives in every namespace so clock sync and name resolution survive the default-deny. | +| Egress | `egress-allowlist.yaml` | `egress-allowlist` | Per-namespace workload-specific egress allows derived from Phase 1. | +| Ingress | `default-deny-ingress.yaml` | `default-deny-ingress` | Deny all ingress except the baseline allows below. | +| Ingress | `ingress-allowlist.yaml` | `ingress-allowlist` | Per-namespace ingress allows (ingress controller, same-namespace pods, explicit upstreams). | + +Per namespace, write the two egress files (`default-deny-egress.yaml` plus `egress-allowlist.yaml`) under `network-policies///`. The baseline-allow policy is generated from a single template applied to every namespace; do not hand-author it per-namespace. + +### Patterns derived from Phase 1 + +- DNS to kube-dns / CoreDNS (53/udp, 53/tcp): lives in `egress-baseline-allow`, never in per-workload allowlists. +- NTP (123/udp): lives in `egress-baseline-allow`. +- Cluster-internal pod-to-pod within same namespace: allow by default in the per-namespace `egress-allowlist`. +- Outbound to other Allora namespaces: explicit per-namespace allow (no blanket). +- Outbound to public internet: only through a designated egress proxy (or whitelist by CIDR). + +## Phase 3 — Staged rollout (week 3) + +DEVOP-579 specifies **48-hour soak windows** between rollout stages +(not 24h) so a full business-day cycle plus a quieter overnight cycle +both elapse before the next stage advances. This catches workloads +whose egress only fires on cron/batch schedules. + +- [ ] Days 1–2: apply policies to **1 staging namespace** in **1 staging cluster**. Soak 48h. +- [ ] Days 3–4: apply to all staging namespaces in 1 cluster. Soak 48h. +- [ ] Days 5–6: apply to 1 production namespace (lowest-risk: docs site). Soak 48h. +- [ ] Days 7+: roll forward through remaining production namespaces in priority-inverse order (lowest-blast-radius first), keeping a 48h soak between each cluster cohort. + +A stage may only advance if the prior soak completed with zero +NetworkPolicy-attributable incidents. If anything broke, hold the +window open until the root cause is fixed (or the policy is amended) +and restart the 48-hour clock for that stage. + +**Rollback procedure** (must be documented before Day 1): +- Egress emergency: `kubectl delete networkpolicy default-deny-egress -n ` — restores all egress instantly. The per-namespace `egress-allowlist` and the `egress-baseline-allow` policies are additive-only and safe to leave in place. +- Ingress emergency (once Phase 3 has rolled the ingress cohort): `kubectl delete networkpolicy default-deny-ingress -n ` — restores all ingress instantly. +- Both names match the pinned naming convention in Phase 2. If you find a workload whose `default-deny-egress` policy has a different name, treat it as a policy-drift incident and fix the name before relying on the rollback command. +- Have both commands ready as runbook steps in the on-call channel. + +## Phase 4 — Steady state + +- [ ] Add Kyverno policies (after DEVOP-588 lands) that fail any new non-system namespace which is missing either a `default-deny-egress` or a `default-deny-ingress` NetworkPolicy. Match by `metadata.name` exactly — these are the names pinned in the Phase 2 convention table. +- [ ] Monthly review of `discovery/.md` for changes in legitimate egress (new vendor SaaS, etc.). +- [ ] **Document the rollout and steady-state policies in `SECURITY-RUNBOOK.md`** (DEVOP-571): add a NetworkPolicy section covering (a) the default-deny model, (b) where the per-namespace allowlists live in this repo, (c) the rollback commands (`kubectl delete networkpolicy default-deny-egress -n ` and `kubectl delete networkpolicy default-deny-ingress -n ` — both required, named per the Phase 2 convention), and (d) the on-call escalation path when a workload reports egress or ingress failures. Without this hook into the runbook, on-call has no reference for diagnosing "my pod can't reach X" pages once default-deny is org-wide. + +## Dependencies + +- Harbor proxy-cache projects (DEVOP-589) must land **before** Phase 2, or the allowlists will be churn — they'd need to allow direct `ghcr.io` etc., then be rewritten to allow only `harbor.allora-network.io`. +- Kyverno on all clusters (DEVOP-588) is a soft dependency: Phase 4 needs it but Phases 0–3 can proceed. + +## Ingress default-deny — same model, separate rollout cohort + +DEVOP-579 requires default-deny for both egress **and** ingress. The +two share a rollout shape but have different blast-radius and +different discovery inputs, so they run as parallel cohorts rather +than as one combined sweep. + +For ingress, mirror Phases 0–4 above with these substitutions: + +- **Phase 1 (discovery)**: capture the *inbound* flow logs per + namespace for 7 days. Categorize sources by `internal` (other + Allora namespaces), `infra` (ingress controllers, load balancers, + health-check probes), `vendor-saas` (webhook callbacks, etc.), and + `public-traffic` (customer-facing routes). Apply the same suspect- + destination flagging in reverse: any inbound source that resolves + to a residential-DNS / tunnel service / cloud-metadata range is + reviewed before allowlisting. +- **Phase 2 (allowlist authoring)**: per namespace, write + `network-policies///default-deny-ingress.yaml` + (`metadata.name: default-deny-ingress`) plus + `ingress-allowlist.yaml` (`metadata.name: ingress-allowlist`). Both + names match the pinned convention in the egress Phase 2 table. + Pattern: deny all inbound by default, allow from the ingress + controller's pod selector, allow from same-namespace pods, then + explicit allow rules per legitimate upstream. +- **Phase 3 (staged rollout)**: same 48-hour soak windows. Ingress + blast radius is generally *higher* than egress (a misconfigured + ingress policy can take a service offline for real users, not just + internal callouts), so the production cohort starts later and + proceeds slower than egress. +- **Phase 4 (steady state)**: Kyverno rule asserting that every + non-system namespace has both a `default-deny-egress` and a + `default-deny-ingress` NetworkPolicy (greps on the pinned + `metadata.name` values from the Phase 2 table). Runbook section + covers both rollback commands. + +Run egress first (it's lower-risk because the failure mode is +"workload can't reach Datadog" rather than "customers can't reach our +API"). Start ingress discovery in parallel during Phase 0–1 of the +egress rollout so the two cohorts can converge on Phase 4 around the +same time. + +## Out of scope for this ticket + +- IDS/anomaly detection on egress flow logs — separate ticket (Falco rules, DEVOP-570). + +## Who runs this + +- Owner: cluster-admin / platform team. +- Reviewer: security team (sign-off on each phase before proceeding to the next). +- Estimated total engineer-time: ~3 engineer-weeks calendar, ~50% utilization (lots of waiting for flow-log baselines to accumulate). + +## Links + +- Linear: https://linear.app/alloralabs/issue/DEVOP-579 +- Cilium NetworkPolicy reference: https://docs.cilium.io/en/stable/security/policy/ +- Calico NetworkPolicy reference: https://docs.tigera.io/calico/latest/network-policy/