feat(bench): add self-heal and ns-flip events, label source-update fields#93
Merged
Conversation
…elds The harness already measured one event (timestamp annotation propagation) but the report fields - e2e_np_p50_ns, e2e_cp_sel_earliest_p50_ns, etc. - were unlabeled, conflating the metric with its semantics. Adding more events forces the rename: source-update is now explicit in field names, self-heal lives next to it for every topology, and ns-flip (cleanup + add) covers the CP-selector watcher path. Renames (no compat aliases - pre-v1.0, no users): - e2e_np_*_ns -> e2e_np_source_update_*_ns - e2e_cp_sel_*_ns -> e2e_cp_sel_source_update_*_ns - e2e_cp_list_*_ns -> e2e_cp_list_source_update_*_ns New events: - self-heal (NP, CP-sel, CP-list): delete K destinations, time recreation per-destination via UID change. K = min(100, NPRefs) for NP, min(20, fanout) for CP. Per-destination latency, no fan-out earliest/slowest - each recreate is independent. - ns-flip (CP-selector only): K = min(20, dstSet/2) namespaces. Cleanup phase removes the matching label and times destination delete; add phase re-adds the label and times destination create. The /2 floor leaves the rest of the fanout as a steady backdrop. Each event ends with a 5s settle before the next, so the previous event's tail (controller queue drain, cache settling) doesn't leak into the next distribution. The smoke comment script keeps the table source-update-only - self-heal and ns-flip are noise for a per-PR shape-break check. The full report JSON carries every distribution.
Bench smoke —
|
| Path | Samples | p50 | p95 | p99 |
|---|---|---|---|---|
| NP single-target | 100 | 17.0ms | 21.4ms | 41.1ms |
| CP-selector earliest | 30 | 62.2ms | 105.0ms | 116.1ms |
| CP-selector slowest | 30 | 138.7ms | 187.5ms | 216.6ms |
| CP-list earliest | 30 | 40.6ms | 50.8ms | 60.4ms |
| CP-list slowest | 30 | 40.6ms | 71.4ms | 85.0ms |
Total wall: 89s • Commit: 3bb6515 • Workflow run
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
PR 2 of the bench multi-PR sequence:
Today the bench measures one event — controller-side propagation of a stamp annotation patched onto the source CR. The fields in the report (`e2e_np_p50_ns`, `e2e_cp_sel_earliest_p50_ns`, …) are unlabeled, so a reader can't tell whether the latency is for create / source-update / self-heal / etc. Adding more event types forces the rename.
What
Three event measurements per topology:
Rename, no compat aliases (pre-v1.0, no users):
Self-heal mechanics: capture original destination UID → delete the destination CR → poll until the destination is observed with a different UID. This times the controller's recreate path specifically; follow-up reconciles to align spec are measured by source-update. Per-destination latency, no fanout-aware earliest/slowest (each destination's recreate is independent).
ns-flip mechanics: for K = `min(20, SelectorDsts/2)` namespaces, in sequence: (1) cleanup phase removes the matching label, times destination delete; (2) add phase re-adds the label, times destination create. The `/2` floor keeps the rest of the fanout as a steady backdrop. Two paired distributions.
5s settle between events so the previous event's tail (controller queue drain, cache settling) doesn't leak into the next event's distribution.
Smoke comment script updates jq paths for the rename. Comment table stays source-update-only — self-heal and ns-flip distributions are tracked in `bench.json` but kept out of the comment for signal-to-noise. Note in the comment body explains this.
Local Kind verification (mixed-typical, 86s wall)
Real numbers from the projection-demo Kind cluster (2-vCPU equivalent):
¹ One tail outlier in NP source-update p99 — the projection-demo cluster has been up 14h with various workloads. Not specific to this PR.
Healthy numbers across all events. The ns-flip cleanup latency (delete latency, ~190ms) being slower than ns-flip add (create latency, ~97ms) is expected — the controller's selector-changed delete path is more involved than the create path.
Note: this PR will trigger the bench-smoke workflow
The trigger paths from PR #92 include `test/bench/**`, so the bench-smoke check will fire on this PR. First time we'll see real numbers from a 2-vCPU GHA runner.
Out of scope
Test plan