fix: prefer active job pods when selecting by label#920
Conversation
|
Welcome to AICR, @immanuwell! Thanks for your first pull request. Before review, please ensure:
A maintainer will review this soon. |
📝 WalkthroughWalkthroughThis PR enhances pod selection in Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/k8s/pod/find_test.go`:
- Around line 76-114: Add a test case in the table in pkg/k8s/pod/find_test.go
that covers the new branch which skips pods with DeletionTimestamp != nil:
create a corev1.Pod in the objects slice with ObjectMeta.DeletionTimestamp set
(e.g., metav1.Unix(...)) and another active pod (Status.Phase =
corev1.PodRunning) with a newer CreationTimestamp, then set wantName to the
active pod's name; ensure the test case name and wantName assert that the
terminating pod is ignored.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 9bf3cf8d-8c73-4781-bb29-0af563583140
📒 Files selected for processing (3)
pkg/k8s/pod/consts.gopkg/k8s/pod/find.gopkg/k8s/pod/find_test.go
| { | ||
| name: "prefers youngest active pod over stale failed pod", | ||
| objects: []runtime.Object{ | ||
| &corev1.Pod{ | ||
| ObjectMeta: metav1.ObjectMeta{ | ||
| Name: "validator-job-aaa-stale", | ||
| Namespace: ns, | ||
| Labels: map[string]string{"batch.kubernetes.io/job-name": jobName}, | ||
| CreationTimestamp: metav1.Unix(10, 0), | ||
| }, | ||
| Status: corev1.PodStatus{Phase: corev1.PodFailed}, | ||
| }, | ||
| &corev1.Pod{ | ||
| ObjectMeta: metav1.ObjectMeta{ | ||
| Name: "validator-job-zzz-current", | ||
| Namespace: ns, | ||
| Labels: map[string]string{"batch.kubernetes.io/job-name": jobName}, | ||
| CreationTimestamp: metav1.Unix(20, 0), | ||
| }, | ||
| Status: corev1.PodStatus{Phase: corev1.PodRunning}, | ||
| }, | ||
| }, | ||
| wantName: "validator-job-zzz-current", | ||
| }, | ||
| { | ||
| name: "falls back to failed pod when no active pod exists", | ||
| objects: []runtime.Object{ | ||
| &corev1.Pod{ | ||
| ObjectMeta: metav1.ObjectMeta{ | ||
| Name: "validator-job-failed", | ||
| Namespace: ns, | ||
| Labels: map[string]string{"batch.kubernetes.io/job-name": jobName}, | ||
| CreationTimestamp: metav1.Unix(30, 0), | ||
| }, | ||
| Status: corev1.PodStatus{Phase: corev1.PodFailed}, | ||
| }, | ||
| }, | ||
| wantName: "validator-job-failed", | ||
| }, |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | ⚡ Quick win
Add a terminating-pod test case for the new selection branch.
The new logic explicitly skips pods with DeletionTimestamp != nil, but this branch is not covered in the table yet.
Proposed test-case addition
{
name: "falls back to failed pod when no active pod exists",
objects: []runtime.Object{
&corev1.Pod{
ObjectMeta: metav1.ObjectMeta{
Name: "validator-job-failed",
Namespace: ns,
Labels: map[string]string{"batch.kubernetes.io/job-name": jobName},
CreationTimestamp: metav1.Unix(30, 0),
},
Status: corev1.PodStatus{Phase: corev1.PodFailed},
},
},
wantName: "validator-job-failed",
},
+ {
+ name: "ignores terminating pods and returns NotFound when none remain",
+ objects: []runtime.Object{
+ &corev1.Pod{
+ ObjectMeta: metav1.ObjectMeta{
+ Name: "validator-job-terminating",
+ Namespace: ns,
+ Labels: map[string]string{"batch.kubernetes.io/job-name": jobName},
+ CreationTimestamp: metav1.Unix(40, 0),
+ DeletionTimestamp: &metav1.Time{Time: metav1.Unix(41, 0).Time},
+ },
+ Status: corev1.PodStatus{Phase: corev1.PodRunning},
+ },
+ },
+ wantErr: true,
+ wantCode: errors.ErrCodeNotFound,
+ },
}As per coding guidelines: "Explicitly test error conditions and edge cases."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/k8s/pod/find_test.go` around lines 76 - 114, Add a test case in the table
in pkg/k8s/pod/find_test.go that covers the new branch which skips pods with
DeletionTimestamp != nil: create a corev1.Pod in the objects slice with
ObjectMeta.DeletionTimestamp set (e.g., metav1.Unix(...)) and another active pod
(Status.Phase = corev1.PodRunning) with a newer CreationTimestamp, then set
wantName to the active pod's name; ensure the test case name and wantName assert
that the terminating pod is ignored.
|
Add the terminating-pod test case that CodeRabbit flagged — it's the one real coverage gap in the PR as written. |
Summary
Pick the best pod for a Job instead of whatever pod happens to be first in the list. Tiny fix, but it avoids a stale failed pod winning the race.
Motivation / Context
Kubernetes can have multiple pods with the same
batch.kubernetes.io/job-namelabel while Jobs replace pods or old pods are still around. Returningpods.Items[0]is a footgun: a failed/old pod can beat the current one, bruh.Repro before the fix:
go test ./pkg/k8s/pod -run TestGetPodForJob/prefers -count=1Fixes: N/A
Related: #881
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)Implementation Notes
GetPodForJobnow skips terminating pods, prefers the newest non-failed pod, and falls back to the newest failed pod. That keeps failed-job result extraction working, no drama.Testing
Also tried
make qualify; it currently stops at repo-wide coverage74.6%vs threshold75%. The affected package is79.3%, and the remaining qualify targets pass when run directly.Risk Assessment
Rollout notes: N/A
Checklist
make testwith-race)make lint)git commit -S) — GPG signing info