fix: prefer active job pods when selecting by label by immanuwell · Pull Request #920 · NVIDIA/aicr

immanuwell · 2026-05-15T14:22:11Z

Summary

Pick the best pod for a Job instead of whatever pod happens to be first in the list. Tiny fix, but it avoids a stale failed pod winning the race.

Motivation / Context

Kubernetes can have multiple pods with the same batch.kubernetes.io/job-name label while Jobs replace pods or old pods are still around. Returning pods.Items[0] is a footgun: a failed/old pod can beat the current one, bruh.

Repro before the fix:

go test ./pkg/k8s/pod -run TestGetPodForJob/prefers -count=1

Fixes: N/A
Related: #881

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

CLI (cmd/aicr, pkg/cli)
API server (cmd/aicrd, pkg/api, pkg/server)
Recipe engine / data (pkg/recipe)
Bundlers (pkg/bundler, pkg/component/*)
Collectors / snapshotter (pkg/collector, pkg/snapshotter)
Validator (pkg/validator)
Core libraries (pkg/errors, pkg/k8s)
Docs/examples (docs/, examples/)
Other: ____________

Implementation Notes

GetPodForJob now skips terminating pods, prefers the newest non-failed pod, and falls back to the newest failed pod. That keeps failed-job result extraction working, no drama.

Testing

go test ./pkg/k8s/pod -run TestGetPodForJob -count=1
go test ./pkg/k8s/pod -count=1
go test -race ./pkg/k8s/pod -count=1
go test -cover ./pkg/k8s/pod
GOFLAGS="-mod=vendor" golangci-lint -c .golangci.yaml run --timeout=10m ./pkg/k8s/pod
make test
make lint
make e2e
make scan
make license-check

Also tried make qualify; it currently stops at repo-wide coverage 74.6% vs threshold 75%. The affected package is 79.3%, and the remaining qualify targets pass when run directly.

Risk Assessment

Low — Isolated change, well-tested, easy to revert
Medium — Touches multiple components or has broader impact
High — Breaking change, affects critical paths, or complex rollout

Rollout notes: N/A

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S) — GPG signing info

copy-pr-bot · 2026-05-15T14:22:14Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-05-15T14:22:27Z

Welcome to AICR, @immanuwell! Thanks for your first pull request.

Before review, please ensure:

All commits are signed off per the DCO
CI checks pass (tests, lint, security scan)
The PR description explains the why behind your changes

A maintainer will review this soon.

coderabbitai · 2026-05-15T14:25:23Z

📝 Walkthrough

Walkthrough

This PR enhances pod selection in GetPodForJob by replacing first-match logic with intelligent selection based on pod phase and recency. The implementation now prefers the newest non-failed pod when available, falls back to the newest failed pod otherwise, and returns ErrCodeNotFound when no non-terminating pods exist. Helper functions selectPodForJob and isNewerPod encapsulate the selection and timestamp comparison logic. A new logging constant keyJob supports the implementation, and new test cases validate the selection behavior across multiple scenarios.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: improving pod selection to prefer active job pods over stale ones.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, explaining the motivation, implementation approach, testing performed, and risk assessment.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/k8s/pod/find_test.go`:
- Around line 76-114: Add a test case in the table in pkg/k8s/pod/find_test.go
that covers the new branch which skips pods with DeletionTimestamp != nil:
create a corev1.Pod in the objects slice with ObjectMeta.DeletionTimestamp set
(e.g., metav1.Unix(...)) and another active pod (Status.Phase =
corev1.PodRunning) with a newer CreationTimestamp, then set wantName to the
active pod's name; ensure the test case name and wantName assert that the
terminating pod is ignored.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 9bf3cf8d-8c73-4781-bb29-0af563583140

📥 Commits

Reviewing files that changed from the base of the PR and between 9776d9f and df56d97.

📒 Files selected for processing (3)

pkg/k8s/pod/consts.go
pkg/k8s/pod/find.go
pkg/k8s/pod/find_test.go

coderabbitai · 2026-05-15T14:25:27Z

+		{
+			name: "prefers youngest active pod over stale failed pod",
+			objects: []runtime.Object{
+				&corev1.Pod{
+					ObjectMeta: metav1.ObjectMeta{
+						Name:              "validator-job-aaa-stale",
+						Namespace:         ns,
+						Labels:            map[string]string{"batch.kubernetes.io/job-name": jobName},
+						CreationTimestamp: metav1.Unix(10, 0),
+					},
+					Status: corev1.PodStatus{Phase: corev1.PodFailed},
+				},
+				&corev1.Pod{
+					ObjectMeta: metav1.ObjectMeta{
+						Name:              "validator-job-zzz-current",
+						Namespace:         ns,
+						Labels:            map[string]string{"batch.kubernetes.io/job-name": jobName},
+						CreationTimestamp: metav1.Unix(20, 0),
+					},
+					Status: corev1.PodStatus{Phase: corev1.PodRunning},
+				},
+			},
+			wantName: "validator-job-zzz-current",
+		},
+		{
+			name: "falls back to failed pod when no active pod exists",
+			objects: []runtime.Object{
+				&corev1.Pod{
+					ObjectMeta: metav1.ObjectMeta{
+						Name:              "validator-job-failed",
+						Namespace:         ns,
+						Labels:            map[string]string{"batch.kubernetes.io/job-name": jobName},
+						CreationTimestamp: metav1.Unix(30, 0),
+					},
+					Status: corev1.PodStatus{Phase: corev1.PodFailed},
+				},
+			},
+			wantName: "validator-job-failed",
+		},


🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Add a terminating-pod test case for the new selection branch.

The new logic explicitly skips pods with DeletionTimestamp != nil, but this branch is not covered in the table yet.

Proposed test-case addition

{ name: "falls back to failed pod when no active pod exists", objects: []runtime.Object{ &corev1.Pod{ ObjectMeta: metav1.ObjectMeta{ Name: "validator-job-failed", Namespace: ns, Labels: map[string]string{"batch.kubernetes.io/job-name": jobName}, CreationTimestamp: metav1.Unix(30, 0), }, Status: corev1.PodStatus{Phase: corev1.PodFailed}, }, }, wantName: "validator-job-failed", }, + { + name: "ignores terminating pods and returns NotFound when none remain", + objects: []runtime.Object{ + &corev1.Pod{ + ObjectMeta: metav1.ObjectMeta{ + Name: "validator-job-terminating", + Namespace: ns, + Labels: map[string]string{"batch.kubernetes.io/job-name": jobName}, + CreationTimestamp: metav1.Unix(40, 0), + DeletionTimestamp: &metav1.Time{Time: metav1.Unix(41, 0).Time}, + }, + Status: corev1.PodStatus{Phase: corev1.PodRunning}, + }, + }, + wantErr: true, + wantCode: errors.ErrCodeNotFound, + }, }

As per coding guidelines: "Explicitly test error conditions and edge cases."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/k8s/pod/find_test.go` around lines 76 - 114, Add a test case in the table in pkg/k8s/pod/find_test.go that covers the new branch which skips pods with DeletionTimestamp != nil: create a corev1.Pod in the objects slice with ObjectMeta.DeletionTimestamp set (e.g., metav1.Unix(...)) and another active pod (Status.Phase = corev1.PodRunning) with a newer CreationTimestamp, then set wantName to the active pod's name; ensure the test case name and wantName assert that the terminating pod is ignored.

xdu31 · 2026-05-15T20:31:55Z

Add the terminating-pod test case that CodeRabbit flagged — it's the one real coverage gap in the PR as written.

fix: prefer active job pods when selecting by label

df56d97

immanuwell requested a review from a team as a code owner May 15, 2026 14:22

github-actions Bot added the size/M label May 15, 2026

coderabbitai Bot reviewed May 15, 2026

View reviewed changes

mchmarny assigned immanuwell May 15, 2026

xdu31 self-requested a review May 15, 2026 22:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prefer active job pods when selecting by label#920

fix: prefer active job pods when selecting by label#920
immanuwell wants to merge 1 commit into
NVIDIA:mainfrom
immanuwell:fix/job-pod-selection

immanuwell commented May 15, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

coderabbitai Bot commented May 15, 2026

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 15, 2026

Uh oh!

xdu31 commented May 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

immanuwell commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

copy-pr-bot Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

coderabbitai Bot commented May 15, 2026

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

xdu31 commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

immanuwell commented May 15, 2026 •

edited

Loading

xdu31 commented May 15, 2026 •

edited

Loading