Skip to content

fix(allocator): BatchSandbox does not rebind a new Pod from Pool after allocated Pod is deleted#953

Open
longsuizhi wants to merge 3 commits into
opensandbox-group:mainfrom
longsuizhi:fix/pool-rebind-deleted-pod
Open

fix(allocator): BatchSandbox does not rebind a new Pod from Pool after allocated Pod is deleted#953
longsuizhi wants to merge 3 commits into
opensandbox-group:mainfrom
longsuizhi:fix/pool-rebind-deleted-pod

Conversation

@longsuizhi

@longsuizhi longsuizhi commented May 28, 2026

Copy link
Copy Markdown
Contributor

Fixes #954

Problem

In Pool mode, when a Pod allocated to a BatchSandbox is externally deleted (manual delete, node eviction, OOM Kill, etc.), the BatchSandbox never receives a replacement Pod from the Pool, leaving the sandbox permanently unavailable.

Root Cause

getSandboxRequest in allocator.go computes the supplement from the alloc-status annotation:

supplement = replicas - len(allocated)

After a Pod is deleted, its name remains in the annotation, so len(allocated) stays unchanged, supplement is always 0, and no re-allocation is triggered.

Fix

Add liveness detection in getSandboxRequest:

  1. Build a live Pod set (non-terminal, non-terminating) from the Pool and pass it into getSandboxRequest
  2. Compare Pods in alloc-status against actually existing Pods, categorize into live / dead / terminal
  3. Exclude dead and terminal Pods from effective allocation count (so supplement > 0 triggers re-allocation)
  4. Queue dead Pods into ToRelease to clean up allocation records
  5. Released Pods that no longer exist do not occupy slots, preventing them from blocking subsequent re-allocations
  6. Terminal Pods (Failed/Succeeded) and terminating Pods (DeletionTimestamp set) are queued for release to ensure alloc-status cleanup and finalizer removal
  7. Running but transiently unready Pods are still treated as live to avoid unnecessary sandbox churn
  8. Deduplicate dead/terminal Pods against existing alloc-release entries to prevent duplicate recycle operations

Test Results

All unit tests pass (including 3 new cases):

--- PASS: TestSchedule/allocate_normally_-_2_pods_for_2_sandboxes
--- PASS: TestSchedule/skip_non-running_pods
--- PASS: TestSchedule/partial_allocated_-_allocate_remaining
--- PASS: TestSchedule/with_release_-_pods_to_release
--- PASS: TestSchedule/partial_release_-_only_unreleased_pods_in_ToRelease
--- PASS: TestSchedule/not_enough_pods_-_PodSupplement_>_0
--- PASS: TestSchedule/skip_already_allocated_pod_-_other_sandbox_gets_supplement
--- PASS: TestSchedule/terminating_sandbox_-_queue_unreleased_pods_for_release,_no_supplement
--- PASS: TestSchedule/dead_pod_triggers_re-allocation_-_allocated_pod_deleted_externally (new)
--- PASS: TestSchedule/released_pod_does_not_block_second_re-allocation (new)
--- PASS: TestSchedule/terminal_pod_(Failed)_triggers_re-allocation (new)
--- PASS: TestSchedule/orphan_sandbox_-_pods_in_store_but_sandbox_no_longer_in_spec
PASS
ok  github.com/alibaba/OpenSandbox/sandbox-k8s/internal/controller  1.173s

@CLAassistant

CLAassistant commented May 28, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@longsuizhi longsuizhi changed the title fix: Pool 模式下 Pod 被删除后 BatchSandbox 不会重新分配新 Pod fix(allocator): Pool 模式下 Pod 被删除后 BatchSandbox 不会重新分配新 Pod May 28, 2026
@longsuizhi longsuizhi changed the title fix(allocator): Pool 模式下 Pod 被删除后 BatchSandbox 不会重新分配新 Pod fix(allocator): BatchSandbox does not rebind a new Pod from Pool after allocated Pod is deleted May 28, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e567cc496f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread kubernetes/internal/controller/allocator.go Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d4b8e95d88

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread kubernetes/internal/controller/allocator.go
@github-actions

Copy link
Copy Markdown
Contributor

⚠️ This PR has no labels. Please add one based on the changes.

Changed directories: kubernetes.

📋 Recommended labels (based on changed files):

  • component/k8s ⬅️

Other available labels:

  • bug - Something isn't working
  • dependencies - Pull requests that update a dependency file
  • documentation - Improvements or additions to documentation
  • feature - New feature or request
  • packages - Changes for package, image and configuration

💡 Tip: Use feature for new functionality or improvements, bug for fixes.

cc @longsuizhi

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5c9e7537fe

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread kubernetes/internal/controller/allocator.go Outdated
Comment thread kubernetes/internal/controller/allocator.go Outdated
@longsuizhi longsuizhi force-pushed the fix/pool-rebind-deleted-pod branch 2 times, most recently from 53be88c to 56f2ed0 Compare May 28, 2026 08:18

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 56f2ed04a4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread kubernetes/internal/controller/allocator.go
Comment thread kubernetes/internal/controller/allocator.go Outdated
Comment thread kubernetes/internal/controller/allocator.go Outdated
@longsuizhi longsuizhi force-pushed the fix/pool-rebind-deleted-pod branch from 56f2ed0 to ecdd9ea Compare May 28, 2026 08:33

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ecdd9eadef

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread kubernetes/internal/controller/allocator.go Outdated
Comment thread kubernetes/internal/controller/allocator.go
Comment thread kubernetes/internal/controller/allocator.go Outdated
@longsuizhi longsuizhi force-pushed the fix/pool-rebind-deleted-pod branch from ecdd9ea to 20956e7 Compare May 28, 2026 09:14

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 20956e73ce

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread kubernetes/internal/controller/allocator.go
当 Pool 中已分配给 BatchSandbox 的 Pod 被外部删除时,alloc-status 注解中仍保留
已删除 Pod 的名称,导致 supplement 计算为 0,无法触发重新分配。

本次修复在 getSandboxRequest 中增加了存活检测:将已删除的 Pod 从有效分配中
排除并加入 ToRelease 队列,使 supplement > 0 从而触发 Pool 重新分配新 Pod。
…ods from live allocation

- P1: Released pods no longer count toward liveAllocated, preventing
  stale released entries from blocking subsequent re-allocations.
- P2: Only Running+Ready pods are added to livePodSet, so terminal
  pods (Failed/Evicted) that still have their object present also
  trigger re-allocation.
@longsuizhi longsuizhi force-pushed the fix/pool-rebind-deleted-pod branch from 20956e7 to 46d6222 Compare June 1, 2026 09:33

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 46d6222c71

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread kubernetes/internal/controller/allocator.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: BatchSandbox does not rebind a new Pod from Pool after allocated Pod is deleted

2 participants