Skip to content

feat(bench): add self-heal and ns-flip events, label source-update fields#93

Merged
be0x74a merged 1 commit into
mainfrom
feat/bench-events
May 8, 2026
Merged

feat(bench): add self-heal and ns-flip events, label source-update fields#93
be0x74a merged 1 commit into
mainfrom
feat/bench-events

Conversation

@be0x74a
Copy link
Copy Markdown
Member

@be0x74a be0x74a commented May 8, 2026

Why

PR 2 of the bench multi-PR sequence:

Today the bench measures one event — controller-side propagation of a stamp annotation patched onto the source CR. The fields in the report (`e2e_np_p50_ns`, `e2e_cp_sel_earliest_p50_ns`, …) are unlabeled, so a reader can't tell whether the latency is for create / source-update / self-heal / etc. Adding more event types forces the rename.

What

Three event measurements per topology:

Event NP CP-selector CP-list Sample size
source-update (existing, renamed) ✓ (earliest+slowest fanout) ✓ (earliest+slowest fanout) NP: `min(100, NPRefs)` • CP: 30 stamps
self-heal (NEW) NP: `min(100, NPRefs)` • CP: `min(20, dstSet)`
ns-flip cleanup + add (NEW) `min(20, SelectorDsts/2)`; skipped when set < 2

Rename, no compat aliases (pre-v1.0, no users):

  • `e2e_np_ns` → `e2e_np_source_update_ns`
  • `e2e_cp_sel_ns` → `e2e_cp_sel_source_update_ns`
  • `e2e_cp_list_ns` → `e2e_cp_list_source_update_ns`

Self-heal mechanics: capture original destination UID → delete the destination CR → poll until the destination is observed with a different UID. This times the controller's recreate path specifically; follow-up reconciles to align spec are measured by source-update. Per-destination latency, no fanout-aware earliest/slowest (each destination's recreate is independent).

ns-flip mechanics: for K = `min(20, SelectorDsts/2)` namespaces, in sequence: (1) cleanup phase removes the matching label, times destination delete; (2) add phase re-adds the label, times destination create. The `/2` floor keeps the rest of the fanout as a steady backdrop. Two paired distributions.

5s settle between events so the previous event's tail (controller queue drain, cache settling) doesn't leak into the next event's distribution.

Smoke comment script updates jq paths for the rename. Comment table stays source-update-only — self-heal and ns-flip distributions are tracked in `bench.json` but kept out of the comment for signal-to-noise. Note in the comment body explains this.

Local Kind verification (mixed-typical, 86s wall)

Real numbers from the projection-demo Kind cluster (2-vCPU equivalent):

Path Event p50 p95 p99
NP single-target source-update 13.9ms 17.7ms 1.38s ¹
NP single-target self-heal 13.7ms 101.7ms 109.1ms
CP-selector earliest source-update 27.1ms 67.8ms 71.8ms
CP-selector slowest source-update 55.0ms 90.3ms 91.3ms
CP-selector self-heal 99.0ms 108.7ms 108.7ms
CP-selector ns-flip cleanup 188.5ms 207.0ms 207.0ms
CP-selector ns-flip add 97.4ms 119.0ms 119.0ms
CP-list earliest source-update 23.1ms 28.4ms 37.8ms
CP-list slowest source-update 23.1ms 28.4ms 37.8ms
CP-list self-heal 100.1ms 110.3ms 110.3ms

¹ One tail outlier in NP source-update p99 — the projection-demo cluster has been up 14h with various workloads. Not specific to this PR.

Healthy numbers across all events. The ns-flip cleanup latency (delete latency, ~190ms) being slower than ns-flip add (create latency, ~97ms) is expected — the controller's selector-changed delete path is more involved than the create path.

Note: this PR will trigger the bench-smoke workflow

The trigger paths from PR #92 include `test/bench/**`, so the bench-smoke check will fire on this PR. First time we'll see real numbers from a 2-vCPU GHA runner.

Out of scope

  • Per-event controller-side histograms (Approach 2b, v0.4.0)
  • GHA release workflow + bench-history orphan branch (PR 3)

Test plan

  • `make build` clean
  • `make lint` clean (0 issues)
  • `go test ./test/bench/ -count=1 -race` passing including new `TestCapSample` + gating tests
  • `shellcheck .github/scripts/bench-comment.sh` clean
  • mixed-typical local Kind run: all events produce healthy numbers, teardown leaves no residue
  • bench-smoke workflow fires on this PR with sub-min runtime (first 2-vCPU runner test — will validate workflow design)

…elds

The harness already measured one event (timestamp annotation propagation)
but the report fields - e2e_np_p50_ns, e2e_cp_sel_earliest_p50_ns, etc.
- were unlabeled, conflating the metric with its semantics. Adding more
events forces the rename: source-update is now explicit in field names,
self-heal lives next to it for every topology, and ns-flip (cleanup +
add) covers the CP-selector watcher path.

Renames (no compat aliases - pre-v1.0, no users):
- e2e_np_*_ns           -> e2e_np_source_update_*_ns
- e2e_cp_sel_*_ns       -> e2e_cp_sel_source_update_*_ns
- e2e_cp_list_*_ns      -> e2e_cp_list_source_update_*_ns

New events:
- self-heal (NP, CP-sel, CP-list): delete K destinations, time recreation
  per-destination via UID change. K = min(100, NPRefs) for NP, min(20,
  fanout) for CP. Per-destination latency, no fan-out earliest/slowest -
  each recreate is independent.
- ns-flip (CP-selector only): K = min(20, dstSet/2) namespaces. Cleanup
  phase removes the matching label and times destination delete; add
  phase re-adds the label and times destination create. The /2 floor
  leaves the rest of the fanout as a steady backdrop.

Each event ends with a 5s settle before the next, so the previous
event's tail (controller queue drain, cache settling) doesn't leak into
the next distribution.

The smoke comment script keeps the table source-update-only - self-heal
and ns-flip are noise for a per-PR shape-break check. The full report
JSON carries every distribution.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

Bench smoke — mixed-typical

End-to-end source-update latency from a 2-vCPU GHA runner. Treat absolute numbers as a sanity check, not a perf claim — runner noise is high. The point of this check is to catch shape-break regressions on api/v1 / controller / bench changes before merge. (Self-heal and ns-flip distributions are recorded in bench.json but omitted here for signal-to-noise.)

Profile

100 namespaced Projections + 50 CP-selector destinations + 10 CP-list destinations, layered in one bootstrap.

Results — source-update latency

Path Samples p50 p95 p99
NP single-target 100 17.0ms 21.4ms 41.1ms
CP-selector earliest 30 62.2ms 105.0ms 116.1ms
CP-selector slowest 30 138.7ms 187.5ms 216.6ms
CP-list earliest 30 40.6ms 50.8ms 60.4ms
CP-list slowest 30 40.6ms 71.4ms 85.0ms

Total wall: 89s • Commit: 3bb6515Workflow run

@be0x74a be0x74a merged commit f97e494 into main May 8, 2026
16 checks passed
@be0x74a be0x74a deleted the feat/bench-events branch May 8, 2026 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant