fix(allocator): rebind replacement Pod when allocated Pod is deleted externally by ashishpatel26 · Pull Request #1048 · opensandbox-group/OpenSandbox

ashishpatel26 · 2026-06-12T17:39:43Z

Summary

In Pool mode, when an allocated Pod is deleted externally (manual delete, node eviction, OOM kill), the BatchSandbox never recovers.

Root cause: getSandboxRequest computed supplement = replicas - len(allocated) where allocated came from the alloc-status annotation — stale pod names stayed in the annotation after deletion, keeping supplement = 0.

A TODO comment in the source confirms this gap:

// TODO consider supply Pods if Pods is deleted unexpectedly

Fix: build a livePodSet from spec.Pods (the pool's live pod list) and count only pods present in that set as effectively allocated. Stale annotation entries are excluded, making supplement > 0 and triggering re-allocation.

Test plan

go test ./internal/controller/... -run TestSchedule — new table case: allocated pod absent from live set → supplement = 1, replacement scheduled

Fixes #954

…externally When a Pod recorded in the alloc-status annotation no longer exists in the pool's live Pod set (manual delete, node eviction, OOM kill), exclude it from the effective-allocation count so that supplement > 0 and the pool schedules a replacement. Previously, stale annotation entries kept supplement at zero, leaving the BatchSandbox permanently unavailable. Fixes opensandbox-group#954

github-actions · 2026-06-12T17:39:54Z

⚠️ This PR has no labels. Please add one based on the changes.

Changed directories: kubernetes.

📋 Recommended labels (based on changed files):

component/k8s ⬅️

Other available labels:

bug - Something isn't working
dependencies - Pull requests that update a dependency file
documentation - Improvements or additions to documentation
feature - New feature or request
packages - Changes for package, image and configuration

💡 Tip: Use feature for new functionality or improvements, bug for fixes.

cc @ashishpatel26

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR fixes a scheduling edge case where sandboxes could fail to replace pods that were deleted externally but still recorded in allocation annotations/state, by excluding stale allocation entries from the effective allocation count.

Changes:

Compute a live-pod name set from the pool’s current pod list and thread it through request construction.
Recompute sandbox “effective allocated” based on live pods so supplement triggers replacement when allocations are stale.
Add a regression test covering externally deleted allocated pods (issue #954).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
kubernetes/internal/controller/allocator.go	Builds/uses `livePodSet` to ignore stale alloc-status pod names when computing supplement.
kubernetes/internal/controller/allocator_test.go	Adds regression test ensuring replacements are scheduled when allocated pods were deleted externally.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+			log.Info("Detected stale allocation entry: pod no longer exists, excluding from effective count",
+				"sandbox", sandbox.Name, "pod", p)
+		}


+	// Count only pods that still exist in the live pool pod set.
+	// Pods deleted externally (eviction, OOM kill, manual delete) remain in the alloc-status
+	// annotation but are absent from livePodSet; excluding them makes supplement > 0 so the
+	// pool schedules a replacement (fixes issue #954).
+	effectiveAllocated := int32(0)
+	for _, p := range allocated {
+		if _, alive := livePodSet[p]; alive {
+			effectiveAllocated++
+		} else {
+			log.Info("Detected stale allocation entry: pod no longer exists, excluding from effective count",
+				"sandbox", sandbox.Name, "pod", p)
+		}
+	}
+
 	supplement := int32(0)
-	if replica-int32(len(allocated)) > 0 {
-		supplement = replica - int32(len(allocated))
+	if replica-effectiveAllocated > 0 {
+		supplement = replica - effectiveAllocated
 	}


chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 546af02017

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-12T17:42:57Z

+	if replica-effectiveAllocated > 0 {
+		supplement = replica - effectiveAllocated


Drop stale pods before allocating replacements

When an allocated pod disappears and a replacement is scheduled, this path only lowers the supplement count; it still returns the original allocated slice, so getLatestAllocated appends the new pod to the stale annotation and SyncSandboxAllocation writes both the deleted pod and the replacement back into the in-memory pool allocation. In the inspected pool reconcile flow, LatestAllocation is then used for allocatedCnt and status, so a one-replica sandbox whose old pod was deleted can permanently look like it has two allocated pods and can over-scale/retain extra idle pods. The stale names should be removed from the allocation being persisted or released from the store when they are detected.

Useful? React with 👍 / 👎.

ashishpatel26 · 2026-06-12T17:56:47Z

Thanks for the reviews. Addressing feedback:

Codex P2 / Copilot – Return filtered liveAllocated slice: Valid. The current code computes effectiveAllocated count correctly but still returns the original (unfiltered) allocated slice to the caller. Downstream code that iterates allocated will see stale pod names. Will replace allocated with a filtered slice containing only pods present in livePodSet.

Copilot – Log verbosity for stale pods: Valid in busy clusters. Will downgrade the per-pod stale log to log.V(1).Info and emit a single summary log at Info level (e.g. "filtered N stale allocated pods") per reconcile.

…ation entries

Copilot AI review requested due to automatic review settings June 12, 2026 17:39

ashishpatel26 requested review from Generalwin, Pangjiping, Spground, fengcone, hittyt, kevinlynx and ninan-nn as code owners June 12, 2026 17:39

Copilot AI reviewed Jun 12, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 12, 2026

View reviewed changes

fix(allocator): return filtered live-pod slice to prevent stale alloc…

6dd36af

…ation entries

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(allocator): rebind replacement Pod when allocated Pod is deleted externally#1048

fix(allocator): rebind replacement Pod when allocated Pod is deleted externally#1048
ashishpatel26 wants to merge 2 commits into
opensandbox-group:mainfrom
ashishpatel26:fix/issue-954-v2

ashishpatel26 commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 12, 2026

Uh oh!

ashishpatel26 commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if replica-effectiveAllocated > 0 {
		supplement = replica - effectiveAllocated

Conversation

ashishpatel26 commented Jun 12, 2026

Summary

Test plan

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

ashishpatel26 commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants