Skip to content

fix(allocator): rebind replacement Pod when allocated Pod is deleted externally#1048

Open
ashishpatel26 wants to merge 2 commits into
opensandbox-group:mainfrom
ashishpatel26:fix/issue-954-v2
Open

fix(allocator): rebind replacement Pod when allocated Pod is deleted externally#1048
ashishpatel26 wants to merge 2 commits into
opensandbox-group:mainfrom
ashishpatel26:fix/issue-954-v2

Conversation

@ashishpatel26

Copy link
Copy Markdown

Summary

In Pool mode, when an allocated Pod is deleted externally (manual delete, node eviction, OOM kill), the BatchSandbox never recovers.

Root cause: getSandboxRequest computed supplement = replicas - len(allocated) where allocated came from the alloc-status annotation — stale pod names stayed in the annotation after deletion, keeping supplement = 0.

A TODO comment in the source confirms this gap:

// TODO consider supply Pods if Pods is deleted unexpectedly

Fix: build a livePodSet from spec.Pods (the pool's live pod list) and count only pods present in that set as effectively allocated. Stale annotation entries are excluded, making supplement > 0 and triggering re-allocation.

Test plan

  • go test ./internal/controller/... -run TestSchedule — new table case: allocated pod absent from live set → supplement = 1, replacement scheduled

Fixes #954

…externally

When a Pod recorded in the alloc-status annotation no longer exists in the
pool's live Pod set (manual delete, node eviction, OOM kill), exclude it from
the effective-allocation count so that supplement > 0 and the pool schedules
a replacement. Previously, stale annotation entries kept supplement at zero,
leaving the BatchSandbox permanently unavailable.

Fixes opensandbox-group#954
Copilot AI review requested due to automatic review settings June 12, 2026 17:39
@github-actions

Copy link
Copy Markdown
Contributor

⚠️ This PR has no labels. Please add one based on the changes.

Changed directories: kubernetes.

📋 Recommended labels (based on changed files):

  • component/k8s ⬅️

Other available labels:

  • bug - Something isn't working
  • dependencies - Pull requests that update a dependency file
  • documentation - Improvements or additions to documentation
  • feature - New feature or request
  • packages - Changes for package, image and configuration

💡 Tip: Use feature for new functionality or improvements, bug for fixes.

cc @ashishpatel26

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR fixes a scheduling edge case where sandboxes could fail to replace pods that were deleted externally but still recorded in allocation annotations/state, by excluding stale allocation entries from the effective allocation count.

Changes:

  • Compute a live-pod name set from the pool’s current pod list and thread it through request construction.
  • Recompute sandbox “effective allocated” based on live pods so supplement triggers replacement when allocations are stale.
  • Add a regression test covering externally deleted allocated pods (issue #954).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
kubernetes/internal/controller/allocator.go Builds/uses livePodSet to ignore stale alloc-status pod names when computing supplement.
kubernetes/internal/controller/allocator_test.go Adds regression test ensuring replacements are scheduled when allocated pods were deleted externally.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +601 to +603
log.Info("Detected stale allocation entry: pod no longer exists, excluding from effective count",
"sandbox", sandbox.Name, "pod", p)
}
Comment on lines +592 to 609
// Count only pods that still exist in the live pool pod set.
// Pods deleted externally (eviction, OOM kill, manual delete) remain in the alloc-status
// annotation but are absent from livePodSet; excluding them makes supplement > 0 so the
// pool schedules a replacement (fixes issue #954).
effectiveAllocated := int32(0)
for _, p := range allocated {
if _, alive := livePodSet[p]; alive {
effectiveAllocated++
} else {
log.Info("Detected stale allocation entry: pod no longer exists, excluding from effective count",
"sandbox", sandbox.Name, "pod", p)
}
}

supplement := int32(0)
if replica-int32(len(allocated)) > 0 {
supplement = replica - int32(len(allocated))
if replica-effectiveAllocated > 0 {
supplement = replica - effectiveAllocated
}

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 546af02017

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +607 to +608
if replica-effectiveAllocated > 0 {
supplement = replica - effectiveAllocated

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Drop stale pods before allocating replacements

When an allocated pod disappears and a replacement is scheduled, this path only lowers the supplement count; it still returns the original allocated slice, so getLatestAllocated appends the new pod to the stale annotation and SyncSandboxAllocation writes both the deleted pod and the replacement back into the in-memory pool allocation. In the inspected pool reconcile flow, LatestAllocation is then used for allocatedCnt and status, so a one-replica sandbox whose old pod was deleted can permanently look like it has two allocated pods and can over-scale/retain extra idle pods. The stale names should be removed from the allocation being persisted or released from the store when they are detected.

Useful? React with 👍 / 👎.

@ashishpatel26

Copy link
Copy Markdown
Author

Thanks for the reviews. Addressing feedback:

Codex P2 / Copilot – Return filtered liveAllocated slice: Valid. The current code computes effectiveAllocated count correctly but still returns the original (unfiltered) allocated slice to the caller. Downstream code that iterates allocated will see stale pod names. Will replace allocated with a filtered slice containing only pods present in livePodSet.

Copilot – Log verbosity for stale pods: Valid in busy clusters. Will downgrade the per-pod stale log to log.V(1).Info and emit a single summary log at Info level (e.g. "filtered N stale allocated pods") per reconcile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: BatchSandbox does not rebind a new Pod from Pool after allocated Pod is deleted

2 participants