fix(batchsandbox): reduce Pod-Ready to BatchSandbox-Ready lag#1047
fix(batchsandbox): reduce Pod-Ready to BatchSandbox-Ready lag#1047ashishpatel26 wants to merge 2 commits into
Conversation
…box-group#969) Two complementary fixes to close the gap between Pod Ready and BatchSandbox Ready that was measured at P50=26 s / P99=40 s when batch-creating 128 sandboxes simultaneously. Root cause: under a high-concurrency burst the controller-runtime default work-queue rate limiter (ItemExponentialFailureRateLimiter, max=1000 s) allowed per-item back-off to grow to 20-40 s whenever a status patch returned a transient error (conflict / 429). The reconcile was queued but not executed until the back-off expired, so the Succeed status write was delayed even though the pod had already been Ready for tens of seconds. Fix 1 – lower the max back-off in SetupWithManager from 1000 s to 5 s. Transient failures are retried quickly; persistent errors still back off exponentially up to the 5 s ceiling. Fix 2 – return RequeueAfter: 5 s from persistRuntimeView while the phase is Pending and pods exist. This acts as a safety-net polling interval so the controller converges even when a Pod Ready event is missed or delayed under load. Also add an integration test that marks a pod Ready and asserts the BatchSandbox reaches Succeed within 10 s.
|
Changed directories: kubernetes. 📋 Recommended labels (based on changed files):
Other available labels:
💡 Tip: Use |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR addresses controller responsiveness issues (notably #969) by reducing reconcile delays and adding safeguards so BatchSandbox status transitions (Pending → Succeed) happen promptly after Pods become Ready, even under load or transient API errors.
Changes:
- Add a short (5s) safety-net re-reconcile while
BatchSandboxis Pending with replicas. - Add early validation for
shardTaskPatches, surfacing type errors via status conditions and avoiding stuck deletions. - Customize controller rate limiting to cap exponential backoff at 5s and add a regression test for the ready-lag scenario.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| kubernetes/internal/controller/batchsandbox_status.go | Adds a short Pending-phase requeue to reduce Ready→Succeed lag. |
| kubernetes/internal/controller/batchsandbox_controller.go | Adds shard patch validation and configures a lower-backoff rate limiter. |
| kubernetes/internal/controller/batchsandbox_controller_test.go | Adds an envtest/Ginkgo regression test for issue #969. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| WithOptions(controller.Options{ | ||
| MaxConcurrentReconciles: maxConcurrentReconciles, | ||
| RateLimiter: workqueue.NewTypedItemExponentialFailureRateLimiter[reconcile.Request](5*time.Millisecond, 5*time.Second), | ||
| }). |
| WithOptions(controller.Options{MaxConcurrentReconciles: maxConcurrentReconciles}). | ||
| WithOptions(controller.Options{ | ||
| MaxConcurrentReconciles: maxConcurrentReconciles, | ||
| RateLimiter: workqueue.NewTypedItemExponentialFailureRateLimiter[reconcile.Request](5*time.Millisecond, 5*time.Second), |
| if view.status.Phase == sandboxv1alpha1.BatchSandboxPhasePending && view.status.Replicas > 0 { | ||
| return 5 * time.Second, aggErrors | ||
| } |
| Eventually(func(g Gomega) { | ||
| podList := &corev1.PodList{} | ||
| g.Expect(k8sClient.List(ctx, podList, &client.ListOptions{Namespace: typeNamespacedName.Namespace})).To(Succeed()) | ||
| var found *corev1.Pod | ||
| for i := range podList.Items { | ||
| po := &podList.Items[i] | ||
| bs := &sandboxv1alpha1.BatchSandbox{} | ||
| if err := k8sClient.Get(ctx, typeNamespacedName, bs); err != nil { | ||
| continue | ||
| } | ||
| if metav1.IsControlledBy(po, bs) { | ||
| found = po | ||
| break | ||
| } | ||
| } |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5ac3507238
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // Validate shardTaskPatches early so that a type-mismatch is surfaced | ||
| // immediately as a status condition rather than causing a silent merge | ||
| // failure deep in the reconcile loop. | ||
| if patchErr := taskStrategy.ValidateShardTaskPatches(); patchErr != nil { |
There was a problem hiding this comment.
Define shard patch validation before calling it
When task scheduling is enabled, this call cannot compile: taskStrategy has type strategy.TaskSchedulingStrategy, whose interface only declares NeedTaskScheduling and GenerateTaskSpecs, and a repo-wide search found no ValidateShardTaskPatches method or BatchSandboxConditionInvalidShardPatch constant. As a result, the Kubernetes controller package will fail to build for any build/test that includes this file.
Useful? React with 👍 / 👎.
|
Thanks for the reviews. Addressing feedback: Codex P1 – Interface missing Copilot – 5ms base delay too aggressive: Valid. 5ms is very short and can generate burst retries under transient failures. Will change to 100ms base / 5s max to be less aggressive while still being responsive. Copilot – RateLimiter type mismatch: Will verify against the actual controller-runtime version in Copilot – Status requeue condition too broad: Valid. Copilot – Test API call inefficiency: Will hoist the |
- Remove ValidateShardTaskPatches contamination (belongs in opensandbox-group#1049) - Change rate limiter base delay 5ms -> 100ms to reduce retry burst - Narrow status requeue guard to Replicas>0 && Ready==0 to avoid unnecessary reconciles before pods are scheduled
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f12397d4d0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| WithOptions(controller.Options{MaxConcurrentReconciles: maxConcurrentReconciles}). | ||
| WithOptions(controller.Options{ | ||
| MaxConcurrentReconciles: maxConcurrentReconciles, | ||
| RateLimiter: workqueue.NewTypedItemExponentialFailureRateLimiter[reconcile.Request](100*time.Millisecond, 5*time.Second), |
There was a problem hiding this comment.
Preserve the global retry throttle
When many different BatchSandbox reconciles hit transient API errors at once, this replaces controller-runtime's default MaxOfRateLimiter with a per-item exponential limiter only, so each failed sandbox can be requeued after 100ms/200ms/etc. without the default global token-bucket backpressure. In the bulk-create/429 scenario described in the comment, this can synchronize retries and increase API-server pressure instead of relieving it; keep the lower per-item cap but wrap it with an overall bucket limiter like the default does.
Useful? React with 👍 / 👎.
Summary
Under burst load (128 sandboxes), controller-runtime's
ItemExponentialFailureRateLimiterdefault max back-off (1000 s) caused 20–40 s delays after status patch conflicts. The BatchSandbox lingered in non-Ready state long after the pod was Ready.Two changes:
SetupWithManagerRequeueAfter: 5 swhen phase isPendingand replicas > 0 — safety-net poll in case a Pod Ready event is missed under loadBefore / After
READY_TO_BSBREADY_TO_BSBTest plan
go test ./internal/controller/...— integration test: marks pod Ready, asserts BatchSandbox phase == Succeed within 10 sFixes #969