Skip to content

feat(validation): scope strict perf floor to accelerator-bound recipes#1009

Merged
mchmarny merged 2 commits into
NVIDIA:mainfrom
yuanchen8911:feat/1005-addressable-perf-floor-eks-gke
May 23, 2026
Merged

feat(validation): scope strict perf floor to accelerator-bound recipes#1009
mchmarny merged 2 commits into
NVIDIA:mainfrom
yuanchen8911:feat/1005-addressable-perf-floor-eks-gke

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented May 21, 2026

Summary

Refine pkg/recipe/validation_phase_floor_test.go::requiresPerformance to exempt accelerator-unbound classifications (Criteria.Accelerator unset or CriteriaAcceleratorAny). The perf threshold is accelerator-specific (H100 EKS ≥ 300 GB/s NCCL, GB200 EKS ≥ 40 / 500 GB/s NET / NVLS, RTX Pro 6000 ≥ TBD) so no meaningful constraint exists at the intent layer; concrete-leaf descendants continue to carry their own threshold via per-phase replace.

Drops AICR_VALIDATION_FLOOR_STRICT=1 warnings from 10 → 5 by exempting the 5 accelerator-unbound intermediates (aks-training, eks-training, gke-cos-training, lke-training, oke-training).

Motivation / Context

Closes #1005. Original PR description claimed 10 → 4 by also adding an inference-perf placeholder to h100-eks-ubuntu-inference-nim. Codex's review (P2) flagged that the inference-perf check in validators/performance/inference_perf_constraint.go:195-197 short-circuits with "skipped - dynamo-platform not in recipe components" whenever the resolved recipe lacks dynamo-platform. NIM recipes declare k8s-nim-operator instead, so the placeholder would satisfy the floor-test letter but never actually run. The NIM block was reverted; the validator extension is tracked in #1010.

Fixes: #1005
Related: #969, #1001, #1006, #1007, #1008, #1010

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change
  • Documentation update
  • Refactoring (no functional changes) — test-only refinement
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe) — floor test refinement
  • Bundlers
  • Collectors / snapshotter
  • Validator (pkg/validator) — floor contract change
  • Core libraries
  • Docs/examples

Implementation Notes

  • requiresPerformance ordering — accelerator-unset check sits after the Kind check (Kind is already returned-false earlier) and before the intent-specific branches. The Accelerator field uses CriteriaAcceleratorType; empty string covers the YAML-unset case and CriteriaAcceleratorAny covers the explicit accelerator: any case (used in wildcard overlays, though those are already filtered out at enumerateGateableOverlays).
  • TestClassifyOverlay updated with the new accelerator field on existing cases (all CriteriaAcceleratorH100 since their resolved-criteria origin all bind H100) and three new cases covering the exempted paths.
  • No data changes in this PR. The gb200-eks-ubuntu-inference-dynamo overlay already has a real inference-perf block (added out-of-band), and the NIM equivalent waits on feat(validator): support NIM workloads in inference-perf so NIM recipes can ship a real performance gate #1010.

Remaining strict-mode failures after this PR (5)

Overlay Blocker Tracker
gb200-oke-ubuntu-inference-dynamo OCI testbed #1007
h100-aks-ubuntu-inference-dynamo Azure testbed #1006
h100-eks-ubuntu-inference-nim NIM validator support #1010
rtx-pro-6000-lke-training Linode testbed #1008
rtx-pro-6000-lke-ubuntu-training Linode testbed #1008

Testing

# Default mode — 56/56 subtests pass, knownGaps unchanged
go test ./pkg/recipe/... -run TestOverlayValidationPhaseFloor -v

# Strict mode — failures drop from 10 to 5
AICR_VALIDATION_FLOOR_STRICT=1 go test ./pkg/recipe/... -run TestOverlayValidationPhaseFloor

# Unit tests (with new accelerator-unset cases)
go test ./pkg/recipe/... -run TestClassifyOverlay -v

# Full gate
make qualify   # PASS

Risk Assessment

  • Low — Test-only refinement; no production code path touched. The exemption codifies an existing convention (perf thresholds live at the accelerator-bound leaf) that was previously enforced only by the warn-only mode.
  • Medium
  • High

Rollout notes: No migration. The floor-test contract is internal CI machinery; no user-facing API surface changes.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality — 3 new cases in TestClassifyOverlay
  • I updated docs if user-facing behavior changed — floor-test contract is internal, no user-facing change
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: ea8ca5c6-854e-4884-a78d-ee6dec725f85

📥 Commits

Reviewing files that changed from the base of the PR and between bb5ffbe and 60a0903.

📒 Files selected for processing (1)
  • pkg/recipe/validation_phase_floor_test.go

📝 Walkthrough

Walkthrough

The change adds an Accelerator field to overlay classification and surfaces it in classification.String(). requiresPerformance() now returns false when Accelerator is empty or equals CriteriaAcceleratorAny (in addition to IsKind cases). classifyOverlay copies Criteria.Accelerator into classification.Accelerator. TestClassifyOverlay is expanded with an accelerator dimension and updated expectations so performance is required only for concrete, non-any accelerators.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly summarizes the main change: scoping strict performance floor validation to accelerator-bound recipes, which is the core refactoring in this PR.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, covering the rationale for the accelerator-unbound exemption, test updates, and validation impact.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@yuanchen8911 yuanchen8911 requested review from mchmarny and xdu31 May 21, 2026 17:25
@yuanchen8911 yuanchen8911 force-pushed the feat/1005-addressable-perf-floor-eks-gke branch from 02944c0 to bb5ffbe Compare May 21, 2026 17:47
@yuanchen8911 yuanchen8911 changed the title feat(validation): close addressable perf-phase gaps on EKS and GKE feat(validation): exempt accelerator-unbound intermediates from strict performance floor May 21, 2026
@yuanchen8911 yuanchen8911 marked this pull request as draft May 21, 2026 17:51
@yuanchen8911 yuanchen8911 changed the title feat(validation): exempt accelerator-unbound intermediates from strict performance floor feat(validation): scope strict perf floor to accelerator-bound recipes May 21, 2026
@yuanchen8911 yuanchen8911 marked this pull request as ready for review May 21, 2026 18:14
@yuanchen8911 yuanchen8911 requested a review from lockwobr May 22, 2026 15:42
…VIDIA#1005)

Refine pkg/recipe/validation_phase_floor_test.go::requiresPerformance
to exempt accelerator-unbound classifications (Criteria.Accelerator
unset or CriteriaAcceleratorAny). The perf threshold is
accelerator-specific (H100 EKS >= 300 GB/s NCCL, GB200 EKS >= 40 / 500
GB/s NET / NVLS, etc.) so no meaningful constraint exists at the
intent layer. Concrete-leaf descendants continue to carry their own
threshold via per-phase replace.

Drops strict-mode warnings: 10 -> 5 by exempting the 5
accelerator-unbound intermediates (aks-training, eks-training,
gke-cos-training, lke-training, oke-training).

NIM perf block deferred
-----------------------

An earlier revision of this PR added an inference-perf block to
recipes/overlays/h100-eks-ubuntu-inference-nim.yaml mirroring the
H100 Dynamo placeholder. Codex's review flagged that the
inference-perf check in
validators/performance/inference_perf_constraint.go:195-197
short-circuits with "skipped - dynamo-platform not in recipe
components" whenever the resolved recipe lacks dynamo-platform. NIM
recipes declare k8s-nim-operator instead, so the placeholder would
satisfy the floor-test letter but never actually run. Reverted that
edit; tracking the validator extension in NVIDIA#1010.

Remaining 5 strict-mode failures
--------------------------------

- gb200-oke-ubuntu-inference-dynamo (OCI testbed, NVIDIA#1007)
- h100-aks-ubuntu-inference-dynamo (Azure testbed, NVIDIA#1006)
- h100-eks-ubuntu-inference-nim (validator support, NVIDIA#1010)
- rtx-pro-6000-lke-training (Linode testbed, NVIDIA#1008)
- rtx-pro-6000-lke-ubuntu-training (Linode testbed, NVIDIA#1008)
@yuanchen8911 yuanchen8911 force-pushed the feat/1005-addressable-perf-floor-eks-gke branch from bb5ffbe to 60a0903 Compare May 22, 2026 15:44
@mchmarny mchmarny enabled auto-merge (squash) May 23, 2026 11:12
@mchmarny mchmarny merged commit 8435e1f into NVIDIA:main May 23, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request size/M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(validation): close addressable performance-phase gaps on EKS and GKE

2 participants