Skip to content

fix: prefer active job pods when selecting by label#920

Open
immanuwell wants to merge 1 commit into
NVIDIA:mainfrom
immanuwell:fix/job-pod-selection
Open

fix: prefer active job pods when selecting by label#920
immanuwell wants to merge 1 commit into
NVIDIA:mainfrom
immanuwell:fix/job-pod-selection

Conversation

@immanuwell
Copy link
Copy Markdown

@immanuwell immanuwell commented May 15, 2026

Summary

Pick the best pod for a Job instead of whatever pod happens to be first in the list. Tiny fix, but it avoids a stale failed pod winning the race.

Motivation / Context

Kubernetes can have multiple pods with the same batch.kubernetes.io/job-name label while Jobs replace pods or old pods are still around. Returning pods.Items[0] is a footgun: a failed/old pod can beat the current one, bruh.

Repro before the fix:

go test ./pkg/k8s/pod -run TestGetPodForJob/prefers -count=1

Fixes: N/A
Related: #881

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: ____________

Implementation Notes

GetPodForJob now skips terminating pods, prefers the newest non-failed pod, and falls back to the newest failed pod. That keeps failed-job result extraction working, no drama.

Testing

go test ./pkg/k8s/pod -run TestGetPodForJob -count=1
go test ./pkg/k8s/pod -count=1
go test -race ./pkg/k8s/pod -count=1
go test -cover ./pkg/k8s/pod
GOFLAGS="-mod=vendor" golangci-lint -c .golangci.yaml run --timeout=10m ./pkg/k8s/pod
make test
make lint
make e2e
make scan
make license-check

Also tried make qualify; it currently stops at repo-wide coverage 74.6% vs threshold 75%. The affected package is 79.3%, and the remaining qualify targets pass when run directly.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes: N/A

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

@immanuwell immanuwell requested a review from a team as a code owner May 15, 2026 14:22
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 15, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown
Contributor

Welcome to AICR, @immanuwell! Thanks for your first pull request.

Before review, please ensure:

  • All commits are signed off per the DCO
  • CI checks pass (tests, lint, security scan)
  • The PR description explains the why behind your changes

A maintainer will review this soon.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR enhances pod selection in GetPodForJob by replacing first-match logic with intelligent selection based on pod phase and recency. The implementation now prefers the newest non-failed pod when available, falls back to the newest failed pod otherwise, and returns ErrCodeNotFound when no non-terminating pods exist. Helper functions selectPodForJob and isNewerPod encapsulate the selection and timestamp comparison logic. A new logging constant keyJob supports the implementation, and new test cases validate the selection behavior across multiple scenarios.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: improving pod selection to prefer active job pods over stale ones.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the motivation, implementation approach, testing performed, and risk assessment.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/k8s/pod/find_test.go`:
- Around line 76-114: Add a test case in the table in pkg/k8s/pod/find_test.go
that covers the new branch which skips pods with DeletionTimestamp != nil:
create a corev1.Pod in the objects slice with ObjectMeta.DeletionTimestamp set
(e.g., metav1.Unix(...)) and another active pod (Status.Phase =
corev1.PodRunning) with a newer CreationTimestamp, then set wantName to the
active pod's name; ensure the test case name and wantName assert that the
terminating pod is ignored.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 9bf3cf8d-8c73-4781-bb29-0af563583140

📥 Commits

Reviewing files that changed from the base of the PR and between 9776d9f and df56d97.

📒 Files selected for processing (3)
  • pkg/k8s/pod/consts.go
  • pkg/k8s/pod/find.go
  • pkg/k8s/pod/find_test.go

Comment thread pkg/k8s/pod/find_test.go
Comment on lines +76 to +114
{
name: "prefers youngest active pod over stale failed pod",
objects: []runtime.Object{
&corev1.Pod{
ObjectMeta: metav1.ObjectMeta{
Name: "validator-job-aaa-stale",
Namespace: ns,
Labels: map[string]string{"batch.kubernetes.io/job-name": jobName},
CreationTimestamp: metav1.Unix(10, 0),
},
Status: corev1.PodStatus{Phase: corev1.PodFailed},
},
&corev1.Pod{
ObjectMeta: metav1.ObjectMeta{
Name: "validator-job-zzz-current",
Namespace: ns,
Labels: map[string]string{"batch.kubernetes.io/job-name": jobName},
CreationTimestamp: metav1.Unix(20, 0),
},
Status: corev1.PodStatus{Phase: corev1.PodRunning},
},
},
wantName: "validator-job-zzz-current",
},
{
name: "falls back to failed pod when no active pod exists",
objects: []runtime.Object{
&corev1.Pod{
ObjectMeta: metav1.ObjectMeta{
Name: "validator-job-failed",
Namespace: ns,
Labels: map[string]string{"batch.kubernetes.io/job-name": jobName},
CreationTimestamp: metav1.Unix(30, 0),
},
Status: corev1.PodStatus{Phase: corev1.PodFailed},
},
},
wantName: "validator-job-failed",
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Add a terminating-pod test case for the new selection branch.

The new logic explicitly skips pods with DeletionTimestamp != nil, but this branch is not covered in the table yet.

Proposed test-case addition
 		{
 			name: "falls back to failed pod when no active pod exists",
 			objects: []runtime.Object{
 				&corev1.Pod{
 					ObjectMeta: metav1.ObjectMeta{
 						Name:              "validator-job-failed",
 						Namespace:         ns,
 						Labels:            map[string]string{"batch.kubernetes.io/job-name": jobName},
 						CreationTimestamp: metav1.Unix(30, 0),
 					},
 					Status: corev1.PodStatus{Phase: corev1.PodFailed},
 				},
 			},
 			wantName: "validator-job-failed",
 		},
+		{
+			name: "ignores terminating pods and returns NotFound when none remain",
+			objects: []runtime.Object{
+				&corev1.Pod{
+					ObjectMeta: metav1.ObjectMeta{
+						Name:              "validator-job-terminating",
+						Namespace:         ns,
+						Labels:            map[string]string{"batch.kubernetes.io/job-name": jobName},
+						CreationTimestamp: metav1.Unix(40, 0),
+						DeletionTimestamp: &metav1.Time{Time: metav1.Unix(41, 0).Time},
+					},
+					Status: corev1.PodStatus{Phase: corev1.PodRunning},
+				},
+			},
+			wantErr:  true,
+			wantCode: errors.ErrCodeNotFound,
+		},
 	}

As per coding guidelines: "Explicitly test error conditions and edge cases."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/k8s/pod/find_test.go` around lines 76 - 114, Add a test case in the table
in pkg/k8s/pod/find_test.go that covers the new branch which skips pods with
DeletionTimestamp != nil: create a corev1.Pod in the objects slice with
ObjectMeta.DeletionTimestamp set (e.g., metav1.Unix(...)) and another active pod
(Status.Phase = corev1.PodRunning) with a newer CreationTimestamp, then set
wantName to the active pod's name; ensure the test case name and wantName assert
that the terminating pod is ignored.

@xdu31
Copy link
Copy Markdown
Contributor

xdu31 commented May 15, 2026

Add the terminating-pod test case that CodeRabbit flagged — it's the one real coverage gap in the PR as written.

@xdu31 xdu31 self-requested a review May 15, 2026 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants