From 3607f9522861e46ba0c6a990d51f22ed9d2ab40e Mon Sep 17 00:00:00 2001 From: srt0422 Date: Thu, 21 May 2026 17:07:16 -0700 Subject: [PATCH 1/6] docs: add DEVOP-579 NetworkPolicy egress rollout plan MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit NetworkPolicy egress hardening is a 3-engineer-week project that must NOT be rushed — `default-deny-egress` silently breaks every workload that has an un-enumerated outbound dependency. The bulk of the work is discovery (7 days of baseline flow logs per namespace), not deployment. This doc captures the staged rollout plan so subsequent loop runs (or whoever picks up execution) don't redo the planning work. Covers: - Phase 0: pre-flight (CNI compat, flow log enablement). - Phase 1: discovery (per-namespace egress enumeration). - Phase 2: allowlist authoring. - Phase 3: staged rollout (1 staging → 1 prod → fan out). - Phase 4: steady-state (Kyverno schema enforcement, monthly review). Dependencies: - DEVOP-589 (Harbor proxy-cache) must land before Phase 2 or the allowlists will churn. - DEVOP-588 (Kyverno on all clusters) is a soft dep for Phase 4. This PR adds the doc only. No NetworkPolicy is deployed. Linear: https://linear.app/alloralabs/issue/DEVOP-579 Co-Authored-By: Claude Opus 4.7 (1M context) --- tickets/devop-579-network-policy-rollout.md | 86 +++++++++++++++++++++ 1 file changed, 86 insertions(+) create mode 100644 tickets/devop-579-network-policy-rollout.md diff --git a/tickets/devop-579-network-policy-rollout.md b/tickets/devop-579-network-policy-rollout.md new file mode 100644 index 0000000..6510847 --- /dev/null +++ b/tickets/devop-579-network-policy-rollout.md @@ -0,0 +1,86 @@ +# DEVOP-579 — NetworkPolicy egress rollout plan + +**Status:** plan only. Execution is staged across 3 engineer-weeks. Do NOT deploy any NetworkPolicies based on this plan without the rollout owner signing off on the scope of each phase. + +## Goal + +Add `default-deny-egress` NetworkPolicies to every Kubernetes namespace across our 13 clusters, then layer explicit egress allowlists per workload. Closes the "compromised pod can call out to attacker-controlled C2" Shai-Hulud propagation path. + +## Why this is hard (the 3-engineer-week estimate) + +NetworkPolicies are stateless and additive — meaning a `default-deny` policy will silently break every workload that has a legitimate outbound dependency that isn't yet enumerated. Production-impacting blast radius if rushed. The bulk of the work is **discovery**, not deployment. + +## Phase 0 — Pre-flight (week 1, days 1–2) + +- [ ] Confirm CNI on every cluster supports NetworkPolicy (Calico, Cilium, Antrea — yes; flannel without --network-policy — no). +- [ ] Stand up `network-policy-engine` (Calico) or use Cilium's native NPL on any cluster that's still on flannel. +- [ ] Enable flow logs on at least one staging cluster: `cilium hubble enable` or `calicoctl flow logs enable`. We need ~7 days of baseline traffic to enumerate legitimate egress. + +## Phase 1 — Discovery (week 1, days 3–5; week 2, days 1–2) + +For each namespace, in priority order (highest-value first): +1. `allora-chain-validators` +2. `allora-chain-rpc` +3. `harbor` +4. `flux-system` +5. ingress-nginx / traefik +6. cert-manager +7. application namespaces (`robonet`, `eliza-allora`, etc.) +8. system namespaces last (`kube-system`, `gke-system`) + +For each: +- [ ] Capture 7 days of egress flow logs from baseline. +- [ ] Enumerate destination CIDRs, DNS names, and ports. +- [ ] Group by category: `internal` (other Allora namespaces), `infra` (cloud-provider metadata, DNS, NTP, GKE), `vendor-saas` (Datadog, Slack, etc.), `package-registries` (npm, pypi, docker.io, ghcr.io, quay.io — these become Harbor proxy-cache after DEVOP-589), `customer-traffic` (per-namespace). +- [ ] Document in this repo as `network-policies/discovery/.md` for future audit. + +## Phase 2 — Allowlist authoring (week 2, days 3–5) + +Per namespace, write two files: +- `network-policies///default-deny.yaml` — applies to all pods in the namespace, blocks all egress except DNS. +- `network-policies///allowlist.yaml` — explicit egress rules derived from Phase 1. + +Patterns to standardize: +- DNS always allowed to kube-dns / coredns (53/udp, 53/tcp). +- NTP always allowed (123/udp). +- Cluster-internal pod-to-pod within same namespace: allow by default. +- Outbound to other Allora namespaces: explicit per-namespace allow (no blanket). +- Outbound to public internet: only through a designated egress proxy (or whitelist by CIDR). + +## Phase 3 — Staged rollout (week 3) + +- [ ] Day 1: apply policies to **1 staging namespace** in **1 staging cluster**. Observe 24h. +- [ ] Day 2: apply to all staging namespaces in 1 cluster. Observe 24h. +- [ ] Day 3: apply to 1 production namespace (lowest-risk: docs site). Observe 24h. +- [ ] Days 4–5: roll forward through remaining production namespaces in priority-inverse order (lowest-blast-radius first). + +**Rollback procedure** (must be documented before Day 1): +- `kubectl delete networkpolicy default-deny -n ` — un-breaks egress instantly. +- Have this command ready as a runbook step in the on-call channel. + +## Phase 4 — Steady state + +- [ ] Add NetworkPolicy schemas to Kyverno (after DEVOP-588 lands) so any new namespace without a `default-deny` is auto-flagged. +- [ ] Monthly review of `discovery/.md` for changes in legitimate egress (new vendor SaaS, etc.). + +## Dependencies + +- Harbor proxy-cache projects (DEVOP-589) must land **before** Phase 2, or the allowlists will be churn — they'd need to allow direct `ghcr.io` etc., then be rewritten to allow only `harbor.allora-network.io`. +- Kyverno on all clusters (DEVOP-588) is a soft dependency: Phase 4 needs it but Phases 0–3 can proceed. + +## Out of scope for this ticket + +- IDS/anomaly detection on egress flow logs — separate ticket (Falco rules, DEVOP-570). +- Ingress NetworkPolicies — separate hardening pass, not in Shai-Hulud scope. + +## Who runs this + +- Owner: cluster-admin / platform team. +- Reviewer: security team (sign-off on each phase before proceeding to the next). +- Estimated total engineer-time: ~3 engineer-weeks calendar, ~50% utilization (lots of waiting for flow-log baselines to accumulate). + +## Links + +- Linear: https://linear.app/alloralabs/issue/DEVOP-579 +- Cilium NetworkPolicy reference: https://docs.cilium.io/en/stable/security/policy/ +- Calico NetworkPolicy reference: https://docs.tigera.io/calico/latest/network-policy/ From 7b93a18853671b2b472a01afe3f4ad15c92aa562 Mon Sep 17 00:00:00 2001 From: srt0422 Date: Thu, 21 May 2026 17:07:16 -0700 Subject: [PATCH 2/6] =?UTF-8?q?DEVOP-579:=20address=20cubic=20review=20?= =?UTF-8?q?=E2=80=94=20flag=20suspect=20egress,=2048h=20soak,=20runbook=20?= =?UTF-8?q?hook,=20ingress=20in=20scope?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Four findings from cubic addressed: 1. tickets/devop-579-network-policy-rollout.md:33 (P2) — Phase 1 discovery checklist now explicitly enumerates suspect egress destinations to flag for incident review (webhook receivers, pastebins, ngrok/tunnel services, 169.254.169.254 / cloud metadata, residential dynamic-DNS). Each flagged destination gets an owner-review gate before allowlist inclusion. 2. tickets/devop-579-network-policy-rollout.md:52 (P1) — Phase 3 staged rollout soak windows changed from 24h to the 48h spec'd by DEVOP-579, and now require a clean soak before advancing. 3. tickets/devop-579-network-policy-rollout.md:64 (P2) — Phase 4 steady-state now mandates documenting the rollout, allowlist layout, rollback command, and on-call escalation path in SECURITY-RUNBOOK.md (DEVOP-571). 4. tickets/devop-579-network-policy-rollout.md:74 (P2) — Ingress default-deny is no longer out-of-scope. Added a dedicated section laying out the parallel ingress cohort (same Phases 0–4 shape with ingress-specific discovery, allowlist patterns, slower production rollout because ingress blast-radius is higher, and Kyverno asserting both directions in Phase 4). Co-Authored-By: Claude Opus 4.7 (1M context) --- tickets/devop-579-network-policy-rollout.md | 67 +++++++++++++++++++-- 1 file changed, 61 insertions(+), 6 deletions(-) diff --git a/tickets/devop-579-network-policy-rollout.md b/tickets/devop-579-network-policy-rollout.md index 6510847..2ba2d4a 100644 --- a/tickets/devop-579-network-policy-rollout.md +++ b/tickets/devop-579-network-policy-rollout.md @@ -32,7 +32,14 @@ For each: - [ ] Capture 7 days of egress flow logs from baseline. - [ ] Enumerate destination CIDRs, DNS names, and ports. - [ ] Group by category: `internal` (other Allora namespaces), `infra` (cloud-provider metadata, DNS, NTP, GKE), `vendor-saas` (Datadog, Slack, etc.), `package-registries` (npm, pypi, docker.io, ghcr.io, quay.io — these become Harbor proxy-cache after DEVOP-589), `customer-traffic` (per-namespace). -- [ ] Document in this repo as `network-policies/discovery/.md` for future audit. +- [ ] **Explicitly flag suspect egress destinations** for incident review before they get added to any allowlist. Treat the following as suspect by default: + - Generic webhook receivers (`*.webhook.site`, `discord.com/api/webhooks/*`, `hooks.slack.com/services/*` that aren't ours, `*.pipedream.net`, `*.requestbin.com`, etc.). + - Pastebin-family services (`pastebin.com`, `paste.ee`, `hastebin.com`, `gist.githubusercontent.com` raw fetches from non-org accounts, `transfer.sh`, `0x0.st`). + - Tunnel / reverse-proxy services (`*.ngrok.io`, `*.ngrok-free.app`, `*.loca.lt`, `*.trycloudflare.com`, `*.serveo.net`). + - Cloud-instance metadata endpoints from inside a pod (`169.254.169.254`, `metadata.google.internal`, `100.100.100.200`) — these should be blocked outright unless a specific workload demonstrably needs them, and even then via an IRSA / Workload Identity allowlist, not raw IP. + - Anything resolving to a residential/dynamic-DNS provider (`*.duckdns.org`, `*.no-ip.com`, `*.dyndns.org`). + Each flagged destination needs an incident-response review: confirm a legitimate owner, document the use case, and either allowlist with a tight CIDR / FQDN or open a remediation ticket. Do NOT roll suspect destinations into the allowlist by default just because they appear in the 7-day baseline. +- [ ] Document in this repo as `network-policies/discovery/.md` for future audit, including the suspect-destination review notes. ## Phase 2 — Allowlist authoring (week 2, days 3–5) @@ -49,10 +56,20 @@ Patterns to standardize: ## Phase 3 — Staged rollout (week 3) -- [ ] Day 1: apply policies to **1 staging namespace** in **1 staging cluster**. Observe 24h. -- [ ] Day 2: apply to all staging namespaces in 1 cluster. Observe 24h. -- [ ] Day 3: apply to 1 production namespace (lowest-risk: docs site). Observe 24h. -- [ ] Days 4–5: roll forward through remaining production namespaces in priority-inverse order (lowest-blast-radius first). +DEVOP-579 specifies **48-hour soak windows** between rollout stages +(not 24h) so a full business-day cycle plus a quieter overnight cycle +both elapse before the next stage advances. This catches workloads +whose egress only fires on cron/batch schedules. + +- [ ] Days 1–2: apply policies to **1 staging namespace** in **1 staging cluster**. Soak 48h. +- [ ] Days 3–4: apply to all staging namespaces in 1 cluster. Soak 48h. +- [ ] Days 5–6: apply to 1 production namespace (lowest-risk: docs site). Soak 48h. +- [ ] Days 7+: roll forward through remaining production namespaces in priority-inverse order (lowest-blast-radius first), keeping a 48h soak between each cluster cohort. + +A stage may only advance if the prior soak completed with zero +NetworkPolicy-attributable incidents. If anything broke, hold the +window open until the root cause is fixed (or the policy is amended) +and restart the 48-hour clock for that stage. **Rollback procedure** (must be documented before Day 1): - `kubectl delete networkpolicy default-deny -n ` — un-breaks egress instantly. @@ -62,16 +79,54 @@ Patterns to standardize: - [ ] Add NetworkPolicy schemas to Kyverno (after DEVOP-588 lands) so any new namespace without a `default-deny` is auto-flagged. - [ ] Monthly review of `discovery/.md` for changes in legitimate egress (new vendor SaaS, etc.). +- [ ] **Document the rollout and steady-state policies in `SECURITY-RUNBOOK.md`** (DEVOP-571): add a NetworkPolicy section covering (a) the default-deny model, (b) where the per-namespace allowlists live in this repo, (c) the rollback command (`kubectl delete networkpolicy default-deny -n `), and (d) the on-call escalation path when a workload reports egress failures. Without this hook into the runbook, on-call has no reference for diagnosing "my pod can't reach X" pages once default-deny is org-wide. ## Dependencies - Harbor proxy-cache projects (DEVOP-589) must land **before** Phase 2, or the allowlists will be churn — they'd need to allow direct `ghcr.io` etc., then be rewritten to allow only `harbor.allora-network.io`. - Kyverno on all clusters (DEVOP-588) is a soft dependency: Phase 4 needs it but Phases 0–3 can proceed. +## Ingress default-deny — same model, separate rollout cohort + +DEVOP-579 requires default-deny for both egress **and** ingress. The +two share a rollout shape but have different blast-radius and +different discovery inputs, so they run as parallel cohorts rather +than as one combined sweep. + +For ingress, mirror Phases 0–4 above with these substitutions: + +- **Phase 1 (discovery)**: capture the *inbound* flow logs per + namespace for 7 days. Categorize sources by `internal` (other + Allora namespaces), `infra` (ingress controllers, load balancers, + health-check probes), `vendor-saas` (webhook callbacks, etc.), and + `public-traffic` (customer-facing routes). Apply the same suspect- + destination flagging in reverse: any inbound source that resolves + to a residential-DNS / tunnel service / cloud-metadata range is + reviewed before allowlisting. +- **Phase 2 (allowlist authoring)**: per namespace, write + `network-policies///default-deny-ingress.yaml` + plus `ingress-allowlist.yaml`. Pattern: deny all inbound by default, + allow from the ingress controller's pod selector, allow from + same-namespace pods, then explicit allow rules per legitimate + upstream. +- **Phase 3 (staged rollout)**: same 48-hour soak windows. Ingress + blast radius is generally *higher* than egress (a misconfigured + ingress policy can take a service offline for real users, not just + internal callouts), so the production cohort starts later and + proceeds slower than egress. +- **Phase 4 (steady state)**: Kyverno rule asserting both + `default-deny-egress` AND `default-deny-ingress` exist per namespace. + Runbook section covers both. + +Run egress first (it's lower-risk because the failure mode is +"workload can't reach Datadog" rather than "customers can't reach our +API"). Start ingress discovery in parallel during Phase 0–1 of the +egress rollout so the two cohorts can converge on Phase 4 around the +same time. + ## Out of scope for this ticket - IDS/anomaly detection on egress flow logs — separate ticket (Falco rules, DEVOP-570). -- Ingress NetworkPolicies — separate hardening pass, not in Shai-Hulud scope. ## Who runs this From 5fb3be492cc5868a6644d57d7c70dadb11185587 Mon Sep 17 00:00:00 2001 From: srt0422 Date: Sat, 30 May 2026 07:30:16 -0700 Subject: [PATCH 3/6] =?UTF-8?q?DEVOP-579:=20address=20@gh-allora=20?= =?UTF-8?q?=E2=80=94=20L3/L4=20flow=20logs=20don't=20carry=20FQDNs;=20add?= =?UTF-8?q?=20DNS-log=20enablement=20+=20join=20step?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit @gh-allora flagged that Hubble/Calico egress flow logs are L3/L4 only, so the Phase 1 line "enumerate destination CIDRs, DNS names, and ports" can't be satisfied from flow logs alone. Confirmed: Hubble flow records and Calico flow logs surface src/dst IP, port, and protocol — DNS names require either a CoreDNS query log feed or Cilium's L7 DNS visibility (which routes pod DNS through the proxy and records resolved FQDNs). Fix is structural, not cosmetic: - Phase 0 now has an explicit "enable verbose DNS query logging" step alongside flow log enablement, with concrete options for CoreDNS (`log` plugin) and Cilium (L7 DNS via `hubble observe --type=dns`), plus a retention check so the 7-day baseline is actually queryable before Phase 1 starts. - Phase 1 line 33 is split into two checklist items: enumerate CIDRs + ports from flow logs (the only fields they carry), then resolve to FQDNs by joining flow records against the Phase 0 DNS logs on (srcPodIP, dstIP) within a short window. Destinations with no DNS match (hard-coded IPs, 169.254.169.254, raw cloud-metadata) are carried through as IP-only and fall into the existing suspect- destination review. review-fix-loop iteration 1 reviewer(s): gh-allora (human PR thread) file: tickets/devop-579-network-policy-rollout.md:17,33 Co-Authored-By: Claude Opus 4.7 (review-fix-loop) --- tickets/devop-579-network-policy-rollout.md | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/tickets/devop-579-network-policy-rollout.md b/tickets/devop-579-network-policy-rollout.md index 2ba2d4a..61f5821 100644 --- a/tickets/devop-579-network-policy-rollout.md +++ b/tickets/devop-579-network-policy-rollout.md @@ -14,7 +14,11 @@ NetworkPolicies are stateless and additive — meaning a `default-deny` policy w - [ ] Confirm CNI on every cluster supports NetworkPolicy (Calico, Cilium, Antrea — yes; flannel without --network-policy — no). - [ ] Stand up `network-policy-engine` (Calico) or use Cilium's native NPL on any cluster that's still on flannel. -- [ ] Enable flow logs on at least one staging cluster: `cilium hubble enable` or `calicoctl flow logs enable`. We need ~7 days of baseline traffic to enumerate legitimate egress. +- [ ] Enable flow logs on at least one staging cluster: `cilium hubble enable` or `calicoctl flow logs enable`. We need ~7 days of baseline traffic to enumerate legitimate egress. Note: these are L3/L4 logs — they give source/destination IP, port, and protocol only, **not** FQDNs. +- [ ] Enable verbose DNS query logging on the same cluster so Phase 1 can map destination IPs back to FQDNs. Concretely: + - CoreDNS: add the `log` plugin to the Corefile (`log { class denial error success }`) and ship CoreDNS logs to the same sink as the flow logs so they can be joined on `(srcPodIP, dstIP, timestamp ± window)`. + - On Cilium clusters where we plan to author DNS-aware policies anyway, enable Cilium L7 DNS visibility (`hubble observe --type=dns` works once the DNS proxy is in path) — this gives per-pod resolved FQDNs directly and removes the join step. + - Confirm log retention covers the full 7-day baseline window before Phase 1 starts. ## Phase 1 — Discovery (week 1, days 3–5; week 2, days 1–2) @@ -30,7 +34,8 @@ For each namespace, in priority order (highest-value first): For each: - [ ] Capture 7 days of egress flow logs from baseline. -- [ ] Enumerate destination CIDRs, DNS names, and ports. +- [ ] Enumerate destination CIDRs and ports directly from the flow logs (these are the only fields L3/L4 actually carries). +- [ ] Resolve those destinations to FQDNs by joining the flow logs against the CoreDNS query logs (or Cilium L7 DNS events) enabled in Phase 0, on `(srcPodIP, dstIP)` within a short time window. Destinations that have no matching DNS lookup (e.g., hard-coded IP literals, `169.254.169.254`, raw cloud-metadata IPs) get carried through as IP-only entries and are scrutinized in the suspect-destination step below. - [ ] Group by category: `internal` (other Allora namespaces), `infra` (cloud-provider metadata, DNS, NTP, GKE), `vendor-saas` (Datadog, Slack, etc.), `package-registries` (npm, pypi, docker.io, ghcr.io, quay.io — these become Harbor proxy-cache after DEVOP-589), `customer-traffic` (per-namespace). - [ ] **Explicitly flag suspect egress destinations** for incident review before they get added to any allowlist. Treat the following as suspect by default: - Generic webhook receivers (`*.webhook.site`, `discord.com/api/webhooks/*`, `hooks.slack.com/services/*` that aren't ours, `*.pipedream.net`, `*.requestbin.com`, etc.). From ef121fa817680663e7c2f4bec16faa3bf6a598b6 Mon Sep 17 00:00:00 2001 From: srt0422 Date: Sat, 30 May 2026 07:33:02 -0700 Subject: [PATCH 4/6] fix(correctness): Phase 0 CNI/flow-log commands were wrong; replace with real per-CNI enablement MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two problems in the Phase 0 checklist that would have wasted an engineer's day before they figured out the doc was wrong: 1. `network-policy-engine (Calico)` and `Cilium's native NPL` are not real component names. Felix is Calico's per-node policy enforcer; Cilium ships NetworkPolicy enforcement built in (no separate "NPL" — NPL means NodePort Local in Antrea/Calico, unrelated to NetworkPolicy). The flannel-fallback bullet now correctly says the only path forward on flannel-without-policy is a CNI migration to Calico or Cilium, since flannel itself cannot enforce NetworkPolicies. 2. `calicoctl flow logs enable` is not a calicoctl subcommand. Calico OSS flow logs are turned on via the FelixConfiguration CR (`spec.flowLogsFileEnabled: true`), and the resulting files land under `/var/log/calico/flowlogs/` on each node. Also called out that OSS file-based flow logs cover allow/deny only — for richer flow context the team needs Calico Enterprise / Calico Cloud, and the recommendation is to prefer the Cilium staging cluster for baseline capture if the option exists. Antrea enablement (Flow Exporter feature gate + flow-aggregator) added for completeness since one of our clusters is on Antrea. review-fix-loop iteration 1 reviewer(s): review-fix-loop (correctness lens) file: tickets/devop-579-network-policy-rollout.md:15-17 Co-Authored-By: Claude Opus 4.7 (review-fix-loop) --- tickets/devop-579-network-policy-rollout.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/tickets/devop-579-network-policy-rollout.md b/tickets/devop-579-network-policy-rollout.md index 61f5821..6e6a682 100644 --- a/tickets/devop-579-network-policy-rollout.md +++ b/tickets/devop-579-network-policy-rollout.md @@ -12,9 +12,12 @@ NetworkPolicies are stateless and additive — meaning a `default-deny` policy w ## Phase 0 — Pre-flight (week 1, days 1–2) -- [ ] Confirm CNI on every cluster supports NetworkPolicy (Calico, Cilium, Antrea — yes; flannel without --network-policy — no). -- [ ] Stand up `network-policy-engine` (Calico) or use Cilium's native NPL on any cluster that's still on flannel. -- [ ] Enable flow logs on at least one staging cluster: `cilium hubble enable` or `calicoctl flow logs enable`. We need ~7 days of baseline traffic to enumerate legitimate egress. Note: these are L3/L4 logs — they give source/destination IP, port, and protocol only, **not** FQDNs. +- [ ] Confirm CNI on every cluster supports NetworkPolicy (Calico with Felix as the enforcer — yes; Cilium — yes, native; Antrea — yes; flannel without the `--network-policy` flag — no). +- [ ] For any cluster still on flannel-without-NetworkPolicy, plan a CNI migration to Calico or Cilium before proceeding. NetworkPolicy enforcement is unavailable otherwise; this rollout cannot land on those clusters until the CNI migration is done. +- [ ] Enable flow logs on at least one staging cluster. We need ~7 days of baseline traffic to enumerate legitimate egress. Note: these are L3/L4 logs — they give source/destination IP, port, and protocol only, **not** FQDNs. CNI-specific enablement: + - Cilium: `cilium hubble enable` (and ensure flow retention covers 7 days; `hubble-relay` persists in-memory by default, so for the baseline window export to a sink the team can query later — Loki, S3, or BigQuery). + - Calico OSS: patch the `default` `FelixConfiguration` CR with `spec.flowLogsFileEnabled: true` (plus `flowLogsFileIncludeLabels: true` so workload identity is queryable). Felix writes per-node JSON flow logs under `/var/log/calico/flowlogs/`; ship those to the same sink as above. OSS file-based flow logs cover allow/deny actions only — for richer flow context, Calico Enterprise / Calico Cloud is required, otherwise prefer the Cilium staging cluster for baseline capture. + - Antrea: enable the `FlowExporter` feature gate on the agent and run `flow-aggregator` to export to the sink. - [ ] Enable verbose DNS query logging on the same cluster so Phase 1 can map destination IPs back to FQDNs. Concretely: - CoreDNS: add the `log` plugin to the Corefile (`log { class denial error success }`) and ship CoreDNS logs to the same sink as the flow logs so they can be joined on `(srcPodIP, dstIP, timestamp ± window)`. - On Cilium clusters where we plan to author DNS-aware policies anyway, enable Cilium L7 DNS visibility (`hubble observe --type=dns` works once the DNS proxy is in path) — this gives per-pod resolved FQDNs directly and removes the join step. From 5281060c5f4f2c2264b358da133e2d13992a098b Mon Sep 17 00:00:00 2001 From: srt0422 Date: Sat, 30 May 2026 07:34:01 -0700 Subject: [PATCH 5/6] fix(reliability): pin NetworkPolicy naming convention; rollback runbook now matches actual resource names MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The rollback runbook command `kubectl delete networkpolicy default-deny -n ` would no-op (NotFound) once ingress lands, because the ingress section calls the ingress policy `default-deny-ingress` while the egress section never pinned the egress resource name. So: - An engineer authoring `default-deny.yaml` could legitimately name the resource `default-deny-egress`, `egress-default-deny`, or anything else. The runbook would silently fail to delete it in an incident. - Once both directions are deployed, the runbook needs both rollback commands, not one. - The Phase 4 Kyverno asserter needs to grep on a deterministic resource name to enforce "every namespace has both default-deny policies". Fix is structural: Phase 2 now contains a pinned naming convention table that the rollback runbook (Phase 3) and the Kyverno asserter (Phase 4) both reference by exact `metadata.name`. As a side effect of pinning, also split the egress baseline allows (DNS/NTP) into a separate generated policy (`egress-baseline-allow`) so the per-namespace `egress-allowlist` only contains workload-specific rules — resolves the Phase 2 ambiguity over which baseline rules live in default-deny vs allowlist. Changes: - New Phase 2 naming-convention table mapping filename ↔ metadata.name ↔ purpose for all five policy kinds (3 egress + 2 ingress). - Rollback runbook now lists both `default-deny-egress` and `default-deny-ingress` commands and calls out drift as an incident. - Phase 4 SECURITY-RUNBOOK hook now references both rollback commands. - Phase 4 Kyverno bullet now matches by exact metadata.name from the pinned table. - Ingress section's Phase 2 substitution now references the same table for both file name and resource name. review-fix-loop iteration 1 reviewer(s): review-fix-loop (reliability lens) file: tickets/devop-579-network-policy-rollout.md:52,80,87,112,122 Co-Authored-By: Claude Opus 4.7 (review-fix-loop) --- tickets/devop-579-network-policy-rollout.md | 55 ++++++++++++++------- 1 file changed, 36 insertions(+), 19 deletions(-) diff --git a/tickets/devop-579-network-policy-rollout.md b/tickets/devop-579-network-policy-rollout.md index 6e6a682..d16146b 100644 --- a/tickets/devop-579-network-policy-rollout.md +++ b/tickets/devop-579-network-policy-rollout.md @@ -51,14 +51,25 @@ For each: ## Phase 2 — Allowlist authoring (week 2, days 3–5) -Per namespace, write two files: -- `network-policies///default-deny.yaml` — applies to all pods in the namespace, blocks all egress except DNS. -- `network-policies///allowlist.yaml` — explicit egress rules derived from Phase 1. - -Patterns to standardize: -- DNS always allowed to kube-dns / coredns (53/udp, 53/tcp). -- NTP always allowed (123/udp). -- Cluster-internal pod-to-pod within same namespace: allow by default. +### NetworkPolicy naming convention (pinned — runbook depends on it) + +Every NetworkPolicy this rollout creates uses one of these exact `metadata.name` values, in the namespace it targets. The rollback runbook (Phase 3) and the Kyverno asserter (Phase 4) both grep on these names, so deviations break both. + +| Direction | File name | `metadata.name` | Purpose | +|---|---|---|---| +| Egress | `default-deny-egress.yaml` | `default-deny-egress` | Deny all egress except the baseline allows in `egress-baseline-allow.yaml`. | +| Egress | `egress-baseline-allow.yaml` | `egress-baseline-allow` | Cluster-wide always-on allows: DNS to kube-dns / CoreDNS (53/udp, 53/tcp) and NTP (123/udp). Lives in every namespace so clock sync and name resolution survive the default-deny. | +| Egress | `egress-allowlist.yaml` | `egress-allowlist` | Per-namespace workload-specific egress allows derived from Phase 1. | +| Ingress | `default-deny-ingress.yaml` | `default-deny-ingress` | Deny all ingress except the baseline allows below. | +| Ingress | `ingress-allowlist.yaml` | `ingress-allowlist` | Per-namespace ingress allows (ingress controller, same-namespace pods, explicit upstreams). | + +Per namespace, write the two egress files (`default-deny-egress.yaml` plus `egress-allowlist.yaml`) under `network-policies///`. The baseline-allow policy is generated from a single template applied to every namespace; do not hand-author it per-namespace. + +### Patterns derived from Phase 1 + +- DNS to kube-dns / CoreDNS (53/udp, 53/tcp): lives in `egress-baseline-allow`, never in per-workload allowlists. +- NTP (123/udp): lives in `egress-baseline-allow`. +- Cluster-internal pod-to-pod within same namespace: allow by default in the per-namespace `egress-allowlist`. - Outbound to other Allora namespaces: explicit per-namespace allow (no blanket). - Outbound to public internet: only through a designated egress proxy (or whitelist by CIDR). @@ -80,14 +91,16 @@ window open until the root cause is fixed (or the policy is amended) and restart the 48-hour clock for that stage. **Rollback procedure** (must be documented before Day 1): -- `kubectl delete networkpolicy default-deny -n ` — un-breaks egress instantly. -- Have this command ready as a runbook step in the on-call channel. +- Egress emergency: `kubectl delete networkpolicy default-deny-egress -n ` — restores all egress instantly. The per-namespace `egress-allowlist` and the `egress-baseline-allow` policies are additive-only and safe to leave in place. +- Ingress emergency (once Phase 3 has rolled the ingress cohort): `kubectl delete networkpolicy default-deny-ingress -n ` — restores all ingress instantly. +- Both names match the pinned naming convention in Phase 2. If you find a workload whose `default-deny-egress` policy has a different name, treat it as a policy-drift incident and fix the name before relying on the rollback command. +- Have both commands ready as runbook steps in the on-call channel. ## Phase 4 — Steady state -- [ ] Add NetworkPolicy schemas to Kyverno (after DEVOP-588 lands) so any new namespace without a `default-deny` is auto-flagged. +- [ ] Add Kyverno policies (after DEVOP-588 lands) that fail any new non-system namespace which is missing either a `default-deny-egress` or a `default-deny-ingress` NetworkPolicy. Match by `metadata.name` exactly — these are the names pinned in the Phase 2 convention table. - [ ] Monthly review of `discovery/.md` for changes in legitimate egress (new vendor SaaS, etc.). -- [ ] **Document the rollout and steady-state policies in `SECURITY-RUNBOOK.md`** (DEVOP-571): add a NetworkPolicy section covering (a) the default-deny model, (b) where the per-namespace allowlists live in this repo, (c) the rollback command (`kubectl delete networkpolicy default-deny -n `), and (d) the on-call escalation path when a workload reports egress failures. Without this hook into the runbook, on-call has no reference for diagnosing "my pod can't reach X" pages once default-deny is org-wide. +- [ ] **Document the rollout and steady-state policies in `SECURITY-RUNBOOK.md`** (DEVOP-571): add a NetworkPolicy section covering (a) the default-deny model, (b) where the per-namespace allowlists live in this repo, (c) the rollback commands (`kubectl delete networkpolicy default-deny-egress -n ` and `kubectl delete networkpolicy default-deny-ingress -n ` — both required, named per the Phase 2 convention), and (d) the on-call escalation path when a workload reports egress or ingress failures. Without this hook into the runbook, on-call has no reference for diagnosing "my pod can't reach X" pages once default-deny is org-wide. ## Dependencies @@ -113,18 +126,22 @@ For ingress, mirror Phases 0–4 above with these substitutions: reviewed before allowlisting. - **Phase 2 (allowlist authoring)**: per namespace, write `network-policies///default-deny-ingress.yaml` - plus `ingress-allowlist.yaml`. Pattern: deny all inbound by default, - allow from the ingress controller's pod selector, allow from - same-namespace pods, then explicit allow rules per legitimate - upstream. + (`metadata.name: default-deny-ingress`) plus + `ingress-allowlist.yaml` (`metadata.name: ingress-allowlist`). Both + names match the pinned convention in the egress Phase 2 table. + Pattern: deny all inbound by default, allow from the ingress + controller's pod selector, allow from same-namespace pods, then + explicit allow rules per legitimate upstream. - **Phase 3 (staged rollout)**: same 48-hour soak windows. Ingress blast radius is generally *higher* than egress (a misconfigured ingress policy can take a service offline for real users, not just internal callouts), so the production cohort starts later and proceeds slower than egress. -- **Phase 4 (steady state)**: Kyverno rule asserting both - `default-deny-egress` AND `default-deny-ingress` exist per namespace. - Runbook section covers both. +- **Phase 4 (steady state)**: Kyverno rule asserting that every + non-system namespace has both a `default-deny-egress` and a + `default-deny-ingress` NetworkPolicy (greps on the pinned + `metadata.name` values from the Phase 2 table). Runbook section + covers both rollback commands. Run egress first (it's lower-risk because the failure mode is "workload can't reach Datadog" rather than "customers can't reach our From ee15feee983e17847ffb646acc96d9b66e760755 Mon Sep 17 00:00:00 2001 From: srt0422 Date: Sat, 30 May 2026 07:34:55 -0700 Subject: [PATCH 6/6] =?UTF-8?q?fix(correctness):=20CoreDNS=20`log`=20plugi?= =?UTF-8?q?n=20doesn't=20carry=20response=20IPs=20=E2=80=94=20switch=20to?= =?UTF-8?q?=20dnstap=20(full)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit cubic flagged that my iter-1 Phase 0 DNS-log instruction was broken: the CoreDNS `log` plugin emits client IP + query name + response code but NOT the answer-section A/AAAA IPs, so the `(srcPodIP, dstIP)` join described in Phase 1 has nothing on the DNS side to match `dstIP` against. Confirmed — `log`'s format is per the CoreDNS docs, and resolved IPs only appear in the actual DNS message response (the answer section). Fix is to use the `dnstap` plugin with the `full` flag, which streams wire-format DNS messages (request + response, including the answer section) to a Unix socket or TCP collector. A dnstap collector (`golang-dnstap`, `dnstap-receiver`) decodes those into `(timestamp, client_pod_ip, query_name, response_ips[])` records that can actually be joined against flow-log destinations. The Cilium `hubble observe --type=dns` path was already correct because Hubble records FQDN and answer IPs together. Changes: - Phase 0 DNS-capture bullet now specifies `dnstap ... full` for CoreDNS, names the collector requirement, and calls out explicitly that the query-only `log` plugin is insufficient (so a future reader who has read the old docs doesn't reach for it). - Phase 1 resolve-to-FQDN bullet now describes the join key accurately: `srcPodIP == DNS client IP, dstIP ∈ DNS response answer IPs`, instead of pretending `log` output has the answer IPs. review-fix-loop iteration 2 reviewer(s): cubic-dev-ai (PR thread PRRT_kwDOLZ5Xss6F4Gnj) file: tickets/devop-579-network-policy-rollout.md:18-21,38 Co-Authored-By: Claude Opus 4.7 (review-fix-loop) --- tickets/devop-579-network-policy-rollout.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/tickets/devop-579-network-policy-rollout.md b/tickets/devop-579-network-policy-rollout.md index d16146b..0a35e60 100644 --- a/tickets/devop-579-network-policy-rollout.md +++ b/tickets/devop-579-network-policy-rollout.md @@ -18,10 +18,10 @@ NetworkPolicies are stateless and additive — meaning a `default-deny` policy w - Cilium: `cilium hubble enable` (and ensure flow retention covers 7 days; `hubble-relay` persists in-memory by default, so for the baseline window export to a sink the team can query later — Loki, S3, or BigQuery). - Calico OSS: patch the `default` `FelixConfiguration` CR with `spec.flowLogsFileEnabled: true` (plus `flowLogsFileIncludeLabels: true` so workload identity is queryable). Felix writes per-node JSON flow logs under `/var/log/calico/flowlogs/`; ship those to the same sink as above. OSS file-based flow logs cover allow/deny actions only — for richer flow context, Calico Enterprise / Calico Cloud is required, otherwise prefer the Cilium staging cluster for baseline capture. - Antrea: enable the `FlowExporter` feature gate on the agent and run `flow-aggregator` to export to the sink. -- [ ] Enable verbose DNS query logging on the same cluster so Phase 1 can map destination IPs back to FQDNs. Concretely: - - CoreDNS: add the `log` plugin to the Corefile (`log { class denial error success }`) and ship CoreDNS logs to the same sink as the flow logs so they can be joined on `(srcPodIP, dstIP, timestamp ± window)`. - - On Cilium clusters where we plan to author DNS-aware policies anyway, enable Cilium L7 DNS visibility (`hubble observe --type=dns` works once the DNS proxy is in path) — this gives per-pod resolved FQDNs directly and removes the join step. - - Confirm log retention covers the full 7-day baseline window before Phase 1 starts. +- [ ] Enable DNS message capture on the same cluster so Phase 1 can map destination IPs back to FQDNs. The capture mechanism must include the DNS *response* (specifically the A / AAAA answer-section IPs), not just the query — without those resolved IPs there is nothing to join the flow-log `dstIP` against. Concretely: + - CoreDNS: enable the `dnstap` plugin with the `full` flag (e.g., `dnstap /var/run/dnstap.sock full` or `dnstap tcp://:6000 full`) and run a dnstap collector (such as `golang-dnstap` or `dnstap-receiver`) that decodes the wire-format messages and ships `(timestamp, client_pod_ip, query_name, response_ips[])` records to the same sink as the flow logs. The query-only `log` plugin is insufficient — it emits client IP, query name, and response code but not the answer-section IPs, so the `dstIP` join cannot be made from `log` output alone. The DNS messages are then joined to the flow logs on `(srcPodIP == client_pod_ip, dstIP ∈ response_ips, timestamp ± window)`. + - On Cilium clusters where we plan to author DNS-aware policies anyway, enable Cilium L7 DNS visibility (`hubble observe --type=dns` works once the DNS proxy is in path) — Hubble records per-pod resolved FQDNs *and* the answer IPs together, which removes the cross-source join entirely. + - Confirm dnstap / Hubble retention covers the full 7-day baseline window before Phase 1 starts. ## Phase 1 — Discovery (week 1, days 3–5; week 2, days 1–2) @@ -38,7 +38,7 @@ For each namespace, in priority order (highest-value first): For each: - [ ] Capture 7 days of egress flow logs from baseline. - [ ] Enumerate destination CIDRs and ports directly from the flow logs (these are the only fields L3/L4 actually carries). -- [ ] Resolve those destinations to FQDNs by joining the flow logs against the CoreDNS query logs (or Cilium L7 DNS events) enabled in Phase 0, on `(srcPodIP, dstIP)` within a short time window. Destinations that have no matching DNS lookup (e.g., hard-coded IP literals, `169.254.169.254`, raw cloud-metadata IPs) get carried through as IP-only entries and are scrutinized in the suspect-destination step below. +- [ ] Resolve those destinations to FQDNs by joining the flow logs against the CoreDNS dnstap stream (or Cilium L7 DNS events) enabled in Phase 0, on `(srcPodIP == DNS client IP, dstIP ∈ DNS response answer IPs, timestamp ± window)`. Destinations that have no matching DNS lookup (e.g., hard-coded IP literals, `169.254.169.254`, raw cloud-metadata IPs) get carried through as IP-only entries and are scrutinized in the suspect-destination step below. - [ ] Group by category: `internal` (other Allora namespaces), `infra` (cloud-provider metadata, DNS, NTP, GKE), `vendor-saas` (Datadog, Slack, etc.), `package-registries` (npm, pypi, docker.io, ghcr.io, quay.io — these become Harbor proxy-cache after DEVOP-589), `customer-traffic` (per-namespace). - [ ] **Explicitly flag suspect egress destinations** for incident review before they get added to any allowlist. Treat the following as suspect by default: - Generic webhook receivers (`*.webhook.site`, `discord.com/api/webhooks/*`, `hooks.slack.com/services/*` that aren't ours, `*.pipedream.net`, `*.requestbin.com`, etc.).