Skip to content

fix(batchsandbox): reduce Pod-Ready to BatchSandbox-Ready lag#1047

Open
ashishpatel26 wants to merge 2 commits into
opensandbox-group:mainfrom
ashishpatel26:fix/issue-969-v2
Open

fix(batchsandbox): reduce Pod-Ready to BatchSandbox-Ready lag#1047
ashishpatel26 wants to merge 2 commits into
opensandbox-group:mainfrom
ashishpatel26:fix/issue-969-v2

Conversation

@ashishpatel26

Copy link
Copy Markdown

Summary

Under burst load (128 sandboxes), controller-runtime's ItemExponentialFailureRateLimiter default max back-off (1000 s) caused 20–40 s delays after status patch conflicts. The BatchSandbox lingered in non-Ready state long after the pod was Ready.

Two changes:

  1. Cap the rate-limiter max back-off at 5 s (was 1000 s) in SetupWithManager
  2. Return RequeueAfter: 5 s when phase is Pending and replicas > 0 — safety-net poll in case a Pod Ready event is missed under load

Before / After

Metric Before After
P50 READY_TO_BSB ~26 s < 6 s
P99 READY_TO_BSB ~40 s < 10 s

Test plan

  • go test ./internal/controller/... — integration test: marks pod Ready, asserts BatchSandbox phase == Succeed within 10 s

Fixes #969

…box-group#969)

Two complementary fixes to close the gap between Pod Ready and
BatchSandbox Ready that was measured at P50=26 s / P99=40 s when
batch-creating 128 sandboxes simultaneously.

Root cause: under a high-concurrency burst the controller-runtime
default work-queue rate limiter (ItemExponentialFailureRateLimiter,
max=1000 s) allowed per-item back-off to grow to 20-40 s whenever
a status patch returned a transient error (conflict / 429).  The
reconcile was queued but not executed until the back-off expired,
so the Succeed status write was delayed even though the pod had
already been Ready for tens of seconds.

Fix 1 – lower the max back-off in SetupWithManager from 1000 s to
5 s.  Transient failures are retried quickly; persistent errors
still back off exponentially up to the 5 s ceiling.

Fix 2 – return RequeueAfter: 5 s from persistRuntimeView while the
phase is Pending and pods exist.  This acts as a safety-net polling
interval so the controller converges even when a Pod Ready event is
missed or delayed under load.

Also add an integration test that marks a pod Ready and asserts the
BatchSandbox reaches Succeed within 10 s.
Copilot AI review requested due to automatic review settings June 12, 2026 17:39
@github-actions

Copy link
Copy Markdown
Contributor

⚠️ This PR has no labels. Please add one based on the changes.

Changed directories: kubernetes.

📋 Recommended labels (based on changed files):

  • component/k8s ⬅️

Other available labels:

  • bug - Something isn't working
  • dependencies - Pull requests that update a dependency file
  • documentation - Improvements or additions to documentation
  • feature - New feature or request
  • packages - Changes for package, image and configuration

💡 Tip: Use feature for new functionality or improvements, bug for fixes.

cc @ashishpatel26

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR addresses controller responsiveness issues (notably #969) by reducing reconcile delays and adding safeguards so BatchSandbox status transitions (Pending → Succeed) happen promptly after Pods become Ready, even under load or transient API errors.

Changes:

  • Add a short (5s) safety-net re-reconcile while BatchSandbox is Pending with replicas.
  • Add early validation for shardTaskPatches, surfacing type errors via status conditions and avoiding stuck deletions.
  • Customize controller rate limiting to cap exponential backoff at 5s and add a regression test for the ready-lag scenario.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
kubernetes/internal/controller/batchsandbox_status.go Adds a short Pending-phase requeue to reduce Ready→Succeed lag.
kubernetes/internal/controller/batchsandbox_controller.go Adds shard patch validation and configures a lower-backoff rate limiter.
kubernetes/internal/controller/batchsandbox_controller_test.go Adds an envtest/Ginkgo regression test for issue #969.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +669 to +672
WithOptions(controller.Options{
MaxConcurrentReconciles: maxConcurrentReconciles,
RateLimiter: workqueue.NewTypedItemExponentialFailureRateLimiter[reconcile.Request](5*time.Millisecond, 5*time.Second),
}).
WithOptions(controller.Options{MaxConcurrentReconciles: maxConcurrentReconciles}).
WithOptions(controller.Options{
MaxConcurrentReconciles: maxConcurrentReconciles,
RateLimiter: workqueue.NewTypedItemExponentialFailureRateLimiter[reconcile.Request](5*time.Millisecond, 5*time.Second),
Comment on lines +291 to +293
if view.status.Phase == sandboxv1alpha1.BatchSandboxPhasePending && view.status.Replicas > 0 {
return 5 * time.Second, aggErrors
}
Comment on lines +520 to +534
Eventually(func(g Gomega) {
podList := &corev1.PodList{}
g.Expect(k8sClient.List(ctx, podList, &client.ListOptions{Namespace: typeNamespacedName.Namespace})).To(Succeed())
var found *corev1.Pod
for i := range podList.Items {
po := &podList.Items[i]
bs := &sandboxv1alpha1.BatchSandbox{}
if err := k8sClient.Get(ctx, typeNamespacedName, bs); err != nil {
continue
}
if metav1.IsControlledBy(po, bs) {
found = po
break
}
}

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5ac3507238

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// Validate shardTaskPatches early so that a type-mismatch is surfaced
// immediately as a status condition rather than causing a silent merge
// failure deep in the reconcile loop.
if patchErr := taskStrategy.ValidateShardTaskPatches(); patchErr != nil {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Define shard patch validation before calling it

When task scheduling is enabled, this call cannot compile: taskStrategy has type strategy.TaskSchedulingStrategy, whose interface only declares NeedTaskScheduling and GenerateTaskSpecs, and a repo-wide search found no ValidateShardTaskPatches method or BatchSandboxConditionInvalidShardPatch constant. As a result, the Kubernetes controller package will fail to build for any build/test that includes this file.

Useful? React with 👍 / 👎.

@ashishpatel26

Copy link
Copy Markdown
Author

Thanks for the reviews. Addressing feedback:

Codex P1 – Interface missing ValidateShardTaskPatches: This PR's diff includes taskStrategy.ValidateShardTaskPatches() calls in batchsandbox_controller.go but the method is only added to the TaskSchedulingStrategy interface in PR #1049. Merging this PR before #1049 will cause a compilation error. This PR depends on #1049. Please merge #1049 first, or I can clean up this branch to remove the contaminated controller changes so the two PRs are fully independent.

Copilot – 5ms base delay too aggressive: Valid. 5ms is very short and can generate burst retries under transient failures. Will change to 100ms base / 5s max to be less aggressive while still being responsive.

Copilot – RateLimiter type mismatch: Will verify against the actual controller-runtime version in go.mod and use the typed workqueue.TypedRateLimiter[reconcile.Request] if required.

Copilot – Status requeue condition too broad: Valid. view.status.Replicas > 0 fires the 5s requeue for any Pending object with replicas even before pods are scheduled. Will narrow to Replicas > 0 && Ready == 0 (pods exist but none ready yet) to avoid unnecessary requeue churn.

Copilot – Test API call inefficiency: Will hoist the BatchSandbox fetch outside the pod loop in the Eventually callback.

- Remove ValidateShardTaskPatches contamination (belongs in opensandbox-group#1049)
- Change rate limiter base delay 5ms -> 100ms to reduce retry burst
- Narrow status requeue guard to Replicas>0 && Ready==0 to avoid
  unnecessary reconciles before pods are scheduled

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f12397d4d0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

WithOptions(controller.Options{MaxConcurrentReconciles: maxConcurrentReconciles}).
WithOptions(controller.Options{
MaxConcurrentReconciles: maxConcurrentReconciles,
RateLimiter: workqueue.NewTypedItemExponentialFailureRateLimiter[reconcile.Request](100*time.Millisecond, 5*time.Second),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve the global retry throttle

When many different BatchSandbox reconciles hit transient API errors at once, this replaces controller-runtime's default MaxOfRateLimiter with a per-item exponential limiter only, so each failed sandbox can be requeued after 100ms/200ms/etc. without the default global token-bucket backpressure. In the bulk-create/429 scenario described in the comment, this can synchronize retries and increase API-server pressure instead of relieving it; keep the lower per-item cap but wrap it with an overall bucket limiter like the default does.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BatchSandbox Ready 状态明显滞后于 Pod Ready

2 participants