chore(recipes): drop NCCL perf checks from gb200-eks-inference#678
Conversation
Removes the validation.performance block (nccl-all-reduce-bw-net and nccl-all-reduce-bw-nvls) from the gb200-eks-inference overlay. Inference deployments on this platform don't need the NCCL fabric health gate by default — single-node serving would skip them and multi-node serving can opt in via its own overlay. The inheriting gb200-eks-ubuntu-inference overlay picks up the removal automatically.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (1)
💤 Files with no reviewable changes (1)
📝 WalkthroughWalkthroughThe pull request removes performance validation checks from the GB200 EKS inference recipe configuration. Specifically, it eliminates NCCL fabric bandwidth performance validation entries, including checks for network (EFA) and NVLS/NVL72 paths along with their associated constraint thresholds. This represents a removal of 17 lines from the recipe's validation specification without adding new configuration. Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Summary
Remove the
validation.performanceblock (NCCL NET and NVLS all-reduce bandwidth checks) from thegb200-eks-inferenceoverlay.Motivation / Context
The NCCL fabric health checks (
nccl-all-reduce-bw-net,nccl-all-reduce-bw-nvls) are not needed by default for GB200/EKS inference deployments. Single-node serving hits theWorkerCount < 2skip path; multi-node serving that actually crosses the fabric can opt in via its own overlay rather than paying for these checks on every inference recipe. The training sibling (gb200-eks-training) keeps them.Fixes: N/A
Related: N/A
Type of Change
Component(s) Affected
pkg/recipe)Implementation Notes
validation.performanceblock and its preamble comment fromrecipes/overlays/gb200-eks-inference.yaml.gb200-eks-ubuntu-inferenceoverlay inherits from this one and had no NCCL block of its own, so it picks up the removal automatically.Testing
YAML-only change to a single overlay. No Go code touched; the registry/overlay schema still validates.
Risk Assessment
Rollout notes: N/A — users who still want NCCL gating on GB200/EKS inference can add the block back in a private overlay.
Checklist
git commit -S)