fix(batchsandbox): reduce Pod-Ready to BatchSandbox-Ready lag by ashishpatel26 · Pull Request #1047 · opensandbox-group/OpenSandbox

ashishpatel26 · 2026-06-12T17:39:27Z

Summary

Under burst load (128 sandboxes), controller-runtime's ItemExponentialFailureRateLimiter default max back-off (1000 s) caused 20–40 s delays after status patch conflicts. The BatchSandbox lingered in non-Ready state long after the pod was Ready.

Two changes:

Cap the rate-limiter max back-off at 5 s (was 1000 s) in SetupWithManager
Return RequeueAfter: 5 s when phase is Pending and replicas > 0 — safety-net poll in case a Pod Ready event is missed under load

Before / After

Metric	Before	After
P50 `READY_TO_BSB`	~26 s	< 6 s
P99 `READY_TO_BSB`	~40 s	< 10 s

Test plan

go test ./internal/controller/... — integration test: marks pod Ready, asserts BatchSandbox phase == Succeed within 10 s

Fixes #969

…box-group#969) Two complementary fixes to close the gap between Pod Ready and BatchSandbox Ready that was measured at P50=26 s / P99=40 s when batch-creating 128 sandboxes simultaneously. Root cause: under a high-concurrency burst the controller-runtime default work-queue rate limiter (ItemExponentialFailureRateLimiter, max=1000 s) allowed per-item back-off to grow to 20-40 s whenever a status patch returned a transient error (conflict / 429). The reconcile was queued but not executed until the back-off expired, so the Succeed status write was delayed even though the pod had already been Ready for tens of seconds. Fix 1 – lower the max back-off in SetupWithManager from 1000 s to 5 s. Transient failures are retried quickly; persistent errors still back off exponentially up to the 5 s ceiling. Fix 2 – return RequeueAfter: 5 s from persistRuntimeView while the phase is Pending and pods exist. This acts as a safety-net polling interval so the controller converges even when a Pod Ready event is missed or delayed under load. Also add an integration test that marks a pod Ready and asserts the BatchSandbox reaches Succeed within 10 s.

github-actions · 2026-06-12T17:39:35Z

⚠️ This PR has no labels. Please add one based on the changes.

Changed directories: kubernetes.

📋 Recommended labels (based on changed files):

component/k8s ⬅️

Other available labels:

bug - Something isn't working
dependencies - Pull requests that update a dependency file
documentation - Improvements or additions to documentation
feature - New feature or request
packages - Changes for package, image and configuration

💡 Tip: Use feature for new functionality or improvements, bug for fixes.

cc @ashishpatel26

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR addresses controller responsiveness issues (notably #969) by reducing reconcile delays and adding safeguards so BatchSandbox status transitions (Pending → Succeed) happen promptly after Pods become Ready, even under load or transient API errors.

Changes:

Add a short (5s) safety-net re-reconcile while BatchSandbox is Pending with replicas.
Add early validation for shardTaskPatches, surfacing type errors via status conditions and avoiding stuck deletions.
Customize controller rate limiting to cap exponential backoff at 5s and add a regression test for the ready-lag scenario.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
kubernetes/internal/controller/batchsandbox_status.go	Adds a short Pending-phase requeue to reduce Ready→Succeed lag.
kubernetes/internal/controller/batchsandbox_controller.go	Adds shard patch validation and configures a lower-backoff rate limiter.
kubernetes/internal/controller/batchsandbox_controller_test.go	Adds an envtest/Ginkgo regression test for issue #969.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+		WithOptions(controller.Options{
+			MaxConcurrentReconciles: maxConcurrentReconciles,
+			RateLimiter:             workqueue.NewTypedItemExponentialFailureRateLimiter[reconcile.Request](5*time.Millisecond, 5*time.Second),
+		}).


-		WithOptions(controller.Options{MaxConcurrentReconciles: maxConcurrentReconciles}).
+		WithOptions(controller.Options{
+			MaxConcurrentReconciles: maxConcurrentReconciles,
+			RateLimiter:             workqueue.NewTypedItemExponentialFailureRateLimiter[reconcile.Request](5*time.Millisecond, 5*time.Second),


+	if view.status.Phase == sandboxv1alpha1.BatchSandboxPhasePending && view.status.Replicas > 0 {
+		return 5 * time.Second, aggErrors
+	}


+			Eventually(func(g Gomega) {
+				podList := &corev1.PodList{}
+				g.Expect(k8sClient.List(ctx, podList, &client.ListOptions{Namespace: typeNamespacedName.Namespace})).To(Succeed())
+				var found *corev1.Pod
+				for i := range podList.Items {
+					po := &podList.Items[i]
+					bs := &sandboxv1alpha1.BatchSandbox{}
+					if err := k8sClient.Get(ctx, typeNamespacedName, bs); err != nil {
+						continue
+					}
+					if metav1.IsControlledBy(po, bs) {
+						found = po
+						break
+					}
+				}


chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5ac3507238

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-12T17:43:09Z

+			// Validate shardTaskPatches early so that a type-mismatch is surfaced
+			// immediately as a status condition rather than causing a silent merge
+			// failure deep in the reconcile loop.
+			if patchErr := taskStrategy.ValidateShardTaskPatches(); patchErr != nil {


Define shard patch validation before calling it

When task scheduling is enabled, this call cannot compile: taskStrategy has type strategy.TaskSchedulingStrategy, whose interface only declares NeedTaskScheduling and GenerateTaskSpecs, and a repo-wide search found no ValidateShardTaskPatches method or BatchSandboxConditionInvalidShardPatch constant. As a result, the Kubernetes controller package will fail to build for any build/test that includes this file.

Useful? React with 👍 / 👎.

ashishpatel26 · 2026-06-12T17:57:02Z

Thanks for the reviews. Addressing feedback:

Codex P1 – Interface missing ValidateShardTaskPatches: This PR's diff includes taskStrategy.ValidateShardTaskPatches() calls in batchsandbox_controller.go but the method is only added to the TaskSchedulingStrategy interface in PR #1049. Merging this PR before #1049 will cause a compilation error. This PR depends on #1049. Please merge #1049 first, or I can clean up this branch to remove the contaminated controller changes so the two PRs are fully independent.

Copilot – 5ms base delay too aggressive: Valid. 5ms is very short and can generate burst retries under transient failures. Will change to 100ms base / 5s max to be less aggressive while still being responsive.

Copilot – RateLimiter type mismatch: Will verify against the actual controller-runtime version in go.mod and use the typed workqueue.TypedRateLimiter[reconcile.Request] if required.

Copilot – Status requeue condition too broad: Valid. view.status.Replicas > 0 fires the 5s requeue for any Pending object with replicas even before pods are scheduled. Will narrow to Replicas > 0 && Ready == 0 (pods exist but none ready yet) to avoid unnecessary requeue churn.

Copilot – Test API call inefficiency: Will hoist the BatchSandbox fetch outside the pod loop in the Eventually callback.

- Remove ValidateShardTaskPatches contamination (belongs in opensandbox-group#1049) - Change rate limiter base delay 5ms -> 100ms to reduce retry burst - Narrow status requeue guard to Replicas>0 && Ready==0 to avoid unnecessary reconciles before pods are scheduled

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f12397d4d0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-12T18:15:33Z

-		WithOptions(controller.Options{MaxConcurrentReconciles: maxConcurrentReconciles}).
+		WithOptions(controller.Options{
+			MaxConcurrentReconciles: maxConcurrentReconciles,
+			RateLimiter:             workqueue.NewTypedItemExponentialFailureRateLimiter[reconcile.Request](100*time.Millisecond, 5*time.Second),


Preserve the global retry throttle

When many different BatchSandbox reconciles hit transient API errors at once, this replaces controller-runtime's default MaxOfRateLimiter with a per-item exponential limiter only, so each failed sandbox can be requeued after 100ms/200ms/etc. without the default global token-bucket backpressure. In the bulk-create/429 scenario described in the comment, this can synchronize retries and increase API-server pressure instead of relieving it; keep the lower per-item cap but wrap it with an overall bucket limiter like the default does.

Useful? React with 👍 / 👎.

Copilot AI review requested due to automatic review settings June 12, 2026 17:39

ashishpatel26 requested review from Generalwin, Pangjiping, Spground, fengcone, hittyt, kevinlynx and ninan-nn as code owners June 12, 2026 17:39

Copilot AI reviewed Jun 12, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(batchsandbox): reduce Pod-Ready to BatchSandbox-Ready lag#1047

fix(batchsandbox): reduce Pod-Ready to BatchSandbox-Ready lag#1047
ashishpatel26 wants to merge 2 commits into
opensandbox-group:mainfrom
ashishpatel26:fix/issue-969-v2

ashishpatel26 commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 12, 2026

Uh oh!

ashishpatel26 commented Jun 12, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ashishpatel26 commented Jun 12, 2026

Summary

Before / After

Test plan

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

ashishpatel26 commented Jun 12, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants