Skip to content

chore(recipes): drop NCCL perf checks from gb200-eks-inference#678

Merged
mchmarny merged 1 commit into
NVIDIA:mainfrom
njhensley:chore/remove-nccl-from-gb200-eks-inference
Apr 24, 2026
Merged

chore(recipes): drop NCCL perf checks from gb200-eks-inference#678
mchmarny merged 1 commit into
NVIDIA:mainfrom
njhensley:chore/remove-nccl-from-gb200-eks-inference

Conversation

@njhensley
Copy link
Copy Markdown
Member

Summary

Remove the validation.performance block (NCCL NET and NVLS all-reduce bandwidth checks) from the gb200-eks-inference overlay.

Motivation / Context

The NCCL fabric health checks (nccl-all-reduce-bw-net, nccl-all-reduce-bw-nvls) are not needed by default for GB200/EKS inference deployments. Single-node serving hits the WorkerCount < 2 skip path; multi-node serving that actually crosses the fabric can opt in via its own overlay rather than paying for these checks on every inference recipe. The training sibling (gb200-eks-training) keeps them.

Fixes: N/A
Related: N/A

Type of Change

  • Refactoring (no functional changes)

Component(s) Affected

  • Recipe engine / data (pkg/recipe)

Implementation Notes

  • Dropped the validation.performance block and its preamble comment from recipes/overlays/gb200-eks-inference.yaml.
  • The downstream gb200-eks-ubuntu-inference overlay inherits from this one and had no NCCL block of its own, so it picks up the removal automatically.

Testing

YAML-only change to a single overlay. No Go code touched; the registry/overlay schema still validates.

Risk Assessment

  • Low — Isolated data change, easily reverted by restoring the block.

Rollout notes: N/A — users who still want NCCL gating on GB200/EKS inference can add the block back in a private overlay.

Checklist

  • I did not skip/disable tests to make CI green
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

Removes the validation.performance block (nccl-all-reduce-bw-net and
nccl-all-reduce-bw-nvls) from the gb200-eks-inference overlay. Inference
deployments on this platform don't need the NCCL fabric health gate by
default — single-node serving would skip them and multi-node serving can
opt in via its own overlay. The inheriting gb200-eks-ubuntu-inference
overlay picks up the removal automatically.
@njhensley njhensley added the enhancement New feature or request label Apr 24, 2026
@njhensley njhensley requested a review from a team as a code owner April 24, 2026 23:02
@njhensley njhensley self-assigned this Apr 24, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 9e1e2638-30c9-48e8-b2fa-c575928c12bc

📥 Commits

Reviewing files that changed from the base of the PR and between f032582 and 575e6da.

📒 Files selected for processing (1)
  • recipes/overlays/gb200-eks-inference.yaml
💤 Files with no reviewable changes (1)
  • recipes/overlays/gb200-eks-inference.yaml

📝 Walkthrough

Walkthrough

The pull request removes performance validation checks from the GB200 EKS inference recipe configuration. Specifically, it eliminates NCCL fabric bandwidth performance validation entries, including checks for network (EFA) and NVLS/NVL72 paths along with their associated constraint thresholds. This represents a removal of 17 lines from the recipe's validation specification without adding new configuration.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title clearly and concisely summarizes the main change: removing NCCL performance checks from the GB200 EKS inference recipe overlay.
Description check ✅ Passed The pull request description is comprehensive and directly related to the changeset, providing clear motivation, context, implementation details, and risk assessment.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@mchmarny mchmarny enabled auto-merge (squash) April 24, 2026 23:03
@mchmarny mchmarny merged commit 7db4275 into NVIDIA:main Apr 24, 2026
71 of 72 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants