Skip to content

fix(bench): synchronously wait for teardown deletes to complete#96

Merged
be0x74a merged 1 commit into
mainfrom
fix/bench-teardown-sync
May 8, 2026
Merged

fix(bench): synchronously wait for teardown deletes to complete#96
be0x74a merged 1 commit into
mainfrom
fix/bench-teardown-sync

Conversation

@be0x74a
Copy link
Copy Markdown
Member

@be0x74a be0x74a commented May 8, 2026

Why

PR #95 fixed the namespace teardown→bootstrap race by waiting for any `Terminating` namespace inside `ensureNamespace`. The next end-to-end `--profile=full` run hit the same race class on a different resource type:

  • np-typical teardown async-deletes 10 CRDs
  • np-stress's `installCRDs` calls `Create` on the same CRD names → gets `IsAlreadyExists` (still-Terminating) → skips
  • 3s sleep (existing post-install settle)
  • During that sleep the finalizer completes; CRD drops from etcd
  • np-stress's `createSource` for GVK 0 → "the server could not find the requested resource"

Rather than play whack-a-mole (namespaces today via PR #95, CRDs tomorrow, ClusterProjections next week), this PR closes the whole class at the source.

What

`teardown` now synchronously polls until every namespace, CRD, and ClusterProjection it deleted is observed `NotFound`. Bounded at 120s; on timeout it returns silently (next bootstrap will surface genuinely-stuck state).

The PR #95 `ensureNamespace` wait stays in place as defense-in-depth — it covers external-actor deletes that happen during a run, not just the inter-profile teardown race this commit closes.

Trade-off

Teardown wall grows by however long the last finalizer takes. In practice:

  • typical profiles: ~1s extra (namespace finalizers are fast at small N)
  • stress profiles: ~5-15s extra (1000+ destinations cascade-deleting take longer)

Across an 8-profile `full` matrix that's maybe +1-2 min total wall. Acceptable for the predictability win.

Test plan

  • `make build` clean
  • `make lint` 0 issues
  • `go test ./test/bench/ -count=1 -race` passing
  • Bench-smoke on this PR (will fire — touches `test/bench/**`)
  • After merge: re-trigger release-bench against `main`, expect it to walk past np-stress and complete the full matrix

PR #95 fixed the namespace teardown→bootstrap race by waiting for any
Terminating namespace inside ensureNamespace. The next end-to-end run
hit the same race class on a different resource: profile np-stress's
installCRDs got Create→IsAlreadyExists on a still-Terminating CRD from
np-typical's teardown, skipped the Create, slept 3s, and during that
sleep the CRD finalizer completed. The follow-up createSource then saw
"the server could not find the requested resource".

Rather than playing whack-a-mole with one Terminating-aware Ensure per
resource type (namespaces today, CRDs tomorrow, ClusterProjections next
week), centralize the cleanup-completion wait in teardown itself. After
issuing every Delete, teardown now polls until every namespace, CRD,
and ClusterProjection it deleted is observed NotFound. Bounded at 120s;
on timeout the function returns silently (next bootstrap will surface
genuinely stuck state).

The PR #95 ensureNamespace wait stays in place as defense-in-depth — it
covers external-actor deletes that happen during a run, not just the
inter-profile teardown race this commit closes.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

Bench smoke — mixed-typical

End-to-end source-update latency from a 2-vCPU GHA runner. Treat absolute numbers as a sanity check, not a perf claim — runner noise is high. The point of this check is to catch shape-break regressions on api/v1 / controller / bench changes before merge. (Self-heal and ns-flip distributions are recorded in bench.json but omitted here for signal-to-noise.)

Profile

100 namespaced Projections + 50 CP-selector destinations + 10 CP-list destinations, layered in one bootstrap.

Results — source-update latency

Path Samples p50 p95 p99
NP single-target 100 16.4ms 21.4ms 36.7ms
CP-selector earliest 30 57.5ms 111.2ms 125.8ms
CP-selector slowest 30 139.3ms 176.8ms 191.1ms
CP-list earliest 30 38.1ms 71.2ms 82.7ms
CP-list slowest 30 38.1ms 77.9ms 82.8ms

Total wall: 88s • Commit: 29785f8Workflow run

@be0x74a be0x74a merged commit d357f36 into main May 8, 2026
16 checks passed
@be0x74a be0x74a deleted the fix/bench-teardown-sync branch May 8, 2026 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant