fix(allocator): BatchSandbox does not rebind a new Pod from Pool after allocated Pod is deleted#953
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e567cc496f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d4b8e95d88
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
Changed directories: kubernetes. 📋 Recommended labels (based on changed files):
Other available labels:
💡 Tip: Use cc @longsuizhi |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5c9e7537fe
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
53be88c to
56f2ed0
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 56f2ed04a4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
56f2ed0 to
ecdd9ea
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ecdd9eadef
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
ecdd9ea to
20956e7
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 20956e73ce
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
当 Pool 中已分配给 BatchSandbox 的 Pod 被外部删除时,alloc-status 注解中仍保留 已删除 Pod 的名称,导致 supplement 计算为 0,无法触发重新分配。 本次修复在 getSandboxRequest 中增加了存活检测:将已删除的 Pod 从有效分配中 排除并加入 ToRelease 队列,使 supplement > 0 从而触发 Pool 重新分配新 Pod。
…ods from live allocation - P1: Released pods no longer count toward liveAllocated, preventing stale released entries from blocking subsequent re-allocations. - P2: Only Running+Ready pods are added to livePodSet, so terminal pods (Failed/Evicted) that still have their object present also trigger re-allocation.
20956e7 to
46d6222
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 46d6222c71
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Fixes #954
Problem
In Pool mode, when a Pod allocated to a BatchSandbox is externally deleted (manual delete, node eviction, OOM Kill, etc.), the BatchSandbox never receives a replacement Pod from the Pool, leaving the sandbox permanently unavailable.
Root Cause
getSandboxRequestinallocator.gocomputes the supplement from thealloc-statusannotation:After a Pod is deleted, its name remains in the annotation, so
len(allocated)stays unchanged,supplementis always 0, and no re-allocation is triggered.Fix
Add liveness detection in
getSandboxRequest:getSandboxRequestalloc-statusagainst actually existing Pods, categorize into live / dead / terminalsupplement > 0triggers re-allocation)ToReleaseto clean up allocation recordsalloc-releaseentries to prevent duplicate recycle operationsTest Results
All unit tests pass (including 3 new cases):