feat(recipe): deliver deployment-phase floor at per-accelerator wildcards by yuanchen8911 · Pull Request #1001 · NVIDIA/aicr

yuanchen8911 · 2026-05-21T16:40:12Z

Summary

Push the deployment-phase floor — both the 4 standard checks AND the `Deployment.gpu-operator.version` constraint — to per-accelerator wildcard overlays (`h100-any.yaml`, `gb200-any.yaml`, `rtx-pro-6000-any.yaml`). Every accelerator-bound query now inherits a constraint-backed deployment phase. Tighten `TestOverlayValidationPhaseFloor` to require the constraint, not just phase presence; exempt accelerator-unbound classifications (whose threshold can't be meaningful without an accelerator).

Closes the gap Codex flagged on v1: declaring `gpu-operator-version` at the service-root let accelerator-bound descendants like `gb200-oke-training` inherit the check name without a `Deployment.gpu-operator.version` constraint, so the check ran as a no-op.

Scope (explicit)

In scope: deployment-phase floor for accelerator-bound recipes — every recipe a user can actually deploy (specifies `--accelerator` or has an `accelerator:` criterion).

Explicitly out of scope: accelerator-unbound intermediates (e.g., `eks`, `eks-training`, `aks-inference`). These resolve to incomplete recipes — running `aicr recipe --service eks --intent training` without `--accelerator` produces a bundle with no accelerator-specific GPU operator config (no driver version pin, no chart values per accelerator). The deployment-phase `Deployment.gpu-operator.version` constraint is accelerator-specific (H100 ≥ v24.6.0, GB200 ≥ v25.10.0, RTX Pro 6000 ≥ v25.10.0); a generic version pin at the intent layer would either be over-permissive (lowest common floor) or wrong for some accelerators. The floor test exempts these classifications via `requiresDeployment()` checking `isAcceleratorBound()`. Issue #969 has been updated to reflect this narrowing.

Motivation / Context

Fixes #969 — every accelerator-bound recipe now ships with a working version-pin assertion, not just a check name.

Add CI guard to enforce per-intent validation phase coverage on recipe overlays #970 — CI guard (closed; tightened here)
feat(recipe): per-field union merge for validation phase checks #1000 — per-field union merge follow-up (will eventually let concrete leaves declare just the incremental delta over the wildcard)
Codex P2 review on v1 (resolved by v2; explicit reply in PR comments)

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

CLI (`cmd/aicr`, `pkg/cli`)
API server (`cmd/aicrd`, `pkg/api`, `pkg/server`)
Recipe engine / data (`pkg/recipe`) — floor test refinement
Bundlers (`pkg/bundler`, `pkg/component/*`)
Collectors / snapshotter
Validator (`pkg/validator`) — accelerator-wildcard validation phase
Core libraries
Docs/examples

Implementation Notes

Design choice — wildcards, not service-roots: The resolver applies overlays in order `base → wildcards → base chain`. Per-phase replace means whichever layer declares `deployment:` last wins. v1 added the block at service-root, which ran later than per-accelerator wildcards and shadowed any wildcard-level constraint. v2 inverts: service-roots own the service-scoped phase (`conformance`); per-accelerator wildcards own the accelerator-scoped phase (`deployment` with version pin). Concrete leaves with their own `deployment:` block continue to win via per-phase replace; their constraint values match the wildcard floor so the override is a no-op today.

Why accelerator-unbound classifications are exempt: An intent-level intermediate without an `accelerator:` criterion (e.g., `eks-training`) is a degenerate query — the resulting recipe has no accelerator-specific GPU-operator config either, so a deployment-phase version pin would have no meaningful value at the intent layer. Concrete leaves carry the constraint via wildcard composition. Issue #969 narrowed to make this explicit.

Floor-test tightening:

New `isAcceleratorBound()` helper drives both `requiresDeployment()` (new) and `requiresPerformance()` (refined).
New `hasGPUOperatorVersionConstraint()` helper asserts `Deployment.gpu-operator.version` is present on the resolved phase.
`TestClassifyOverlay` expanded with `Accelerator` field, `wantRequiresDeploy` expectation, and three accelerator-unbound exemption cases.

Files Changed

`recipes/overlays/h100-any.yaml`, `gb200-any.yaml`, `rtx-pro-6000-any.yaml` — new wildcards, full deployment block + version pin.
`pkg/recipe/validation_phase_floor_test.go` — `Accelerator` field on `classification`, deployment-required gating, constraint assertion, expanded `TestClassifyOverlay`.

Testing

```bash

Default mode — all 56 subtests pass, knownGaps stays empty

go test ./pkg/recipe/... -run TestOverlayValidationPhaseFloor -v

Classification cases (10 subtests, all pass)

go test ./pkg/recipe/... -run TestClassifyOverlay -v

Codex's verification command — now resolves with the constraint

go run ./cmd/aicr recipe --service oke --accelerator gb200 --intent training --format yaml

yields:

validation.deployment.constraints:

- name: Deployment.gpu-operator.version

value: '>= v25.10.0'

Spot-checks

go run ./cmd/aicr recipe --service lke --accelerator rtx-pro-6000 --intent training --format yaml
go run ./cmd/aicr recipe --service kind --accelerator h100 --intent inference --format yaml

Full gate

make qualify # PASS
```

Strict mode (`AICR_VALIDATION_FLOOR_STRICT=1`) status at this PR head: still fails on 5 known performance-phase gaps — none of them deployment-side, none of them introduced or made worse by this PR. They are tracked in dedicated follow-up issues:

`gb200-oke-ubuntu-inference-dynamo` → feat(validation): add performance-phase constraints to OKE overlays once OCI testbed lands #1007 (OCI testbed)
`h100-aks-ubuntu-inference-dynamo` → feat(validation): add performance-phase constraints to AKS overlays once Azure testbed lands #1006 (Azure testbed)
`h100-eks-ubuntu-inference-nim` → feat(validator): support NIM workloads in inference-perf so NIM recipes can ship a real performance gate #1010 (NIM validator support)
`rtx-pro-6000-lke-training` → feat(validation): add performance-phase constraints to LKE overlays once Linode testbed lands #1008 (Linode testbed)
`rtx-pro-6000-lke-ubuntu-training` → feat(validation): add performance-phase constraints to LKE overlays once Linode testbed lands #1008 (Linode testbed)

PR #1009 closes the accelerator-unbound intermediate side of strict-mode perf. Once both #1001 and #1009 merge, strict-mode failures drop to the 5 above; remaining work is the dedicated follow-ups.

Risk Assessment

Low — Data-only edits + a test refinement. The 9 leaf overlays with their own `deployment:` block are unaffected (per-phase replace, identical constraint values). New wildcards target accelerator-bound queries which were previously a no-op `gpu-operator-version` check.
Medium
High

Rollout notes: No migration. Resolved recipes for previously-uncovered accelerator-bound queries (e.g., all `gb200-oke-`, `h100-aks--inference`, `h100-kind-`, `rtx-pro-6000-lke-`) now ship with a real version assertion. Operators running `aicr validate --phase deployment` against these recipes will see the constraint evaluated against the deployed gpu-operator version.

Checklist

Tests pass locally (`make test` with `-race`)
Linter passes (`make lint`)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality — 3 new `TestClassifyOverlay` cases + tightened `TestOverlayValidationPhaseFloor` assertions
I updated docs if user-facing behavior changed — internal floor-test contract; no user-facing API surface affected
Changes follow existing patterns in the codebase
Commits are cryptographically signed (`git commit -S`)

coderabbitai · 2026-05-21T16:43:14Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 264b8585-141d-45e3-9ea2-c72413ae2415

📥 Commits

Reviewing files that changed from the base of the PR and between 1aafaf0 and 38f08af.

📒 Files selected for processing (4)

pkg/recipe/validation_phase_floor_test.go
recipes/overlays/gb200-any.yaml
recipes/overlays/h100-any.yaml
recipes/overlays/rtx-pro-6000-any.yaml

📝 Walkthrough

Walkthrough

This PR extends recipe classification with accelerator awareness and establishes deployment-phase validation baselines for three new accelerator-specific recipe overlays. The test infrastructure gains an Accelerator field and helpers to require deployment-phase gpu-operator-version gating for accelerator-bound overlays. Three overlays (GB200, H100, RTX Pro 6000) provide a standardized deployment-phase floor including operator-version constraints. The test matrix is rewritten to cover accelerator combinations and the knownGaps allowlist is cleared.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: moving the deployment-phase floor to per-accelerator wildcard overlays.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, covering scope, implementation, testing, and context.
Linked Issues check	✅ Passed	All objectives from linked issue `#969` are met: per-accelerator wildcards added with deployment checks and version constraints, floor test tightened with new helpers to require deployment only for accelerator-bound cases, and accelerator-unbound intermediates explicitly exempted.
Out of Scope Changes check	✅ Passed	All changes are directly aligned with linked issue `#969` scope: three wildcard overlay files added and test file updated; no unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…ards (NVIDIA#969) Push the deployment-phase floor — both the 4 standard checks AND the gpu-operator version constraint — to per-accelerator wildcard overlays so every accelerator-bound query inherits a constraint-backed phase. Closes the gap surfaced by Codex's review: PR NVIDIA#1001 v1 added the check name at the service-root, but per-phase replace semantics meant descendants without their own deployment block (e.g., gb200-oke-training) inherited the check name without a Deployment.gpu-operator.version constraint. The gpu-operator-version check is a no-op without that constraint, so the recipe shipped with the check name but no version assertion. Changes - recipes/overlays/{h100,gb200,rtx-pro-6000}-any.yaml: new wildcard overlays (service: any, accelerator: <X>) carrying the full deployment block (4 checks + version constraint). Concrete leaves that declare their own deployment block continue to override via per-phase replace; constraint values match the wildcard floor. - recipes/overlays/{aks,eks,gke-cos,kind,lke,oke}.yaml: revert the service-root deployment block added in v1. Service-roots own the service-scoped phase (conformance); accelerator-scoped phases live on the per-accelerator wildcards. The applied-overlay order (wildcards then base chain) means a service-root deployment block would shadow the wildcard's, so v1's service-root block was the wrong location. - pkg/recipe/validation_phase_floor_test.go: assert Deployment.gpu-operator.version constraint is present whenever the deployment phase is required; exempt accelerator-unbound classifications from requiring deployment + performance phases (intermediates without an accelerator: criterion have no meaningful accelerator-specific threshold and concrete leaves carry the constraint via wildcard composition). Add isAcceleratorBound, requiresDeployment, and hasGPUOperatorVersionConstraint helpers. Update TestClassifyOverlay to cover the accelerator-unbound exempt path. Verification - TestOverlayValidationPhaseFloor passes for all 56 gateable overlays in both default and strict modes; knownGaps stays empty. - aicr recipe --service oke --accelerator gb200 --intent training (the Codex-reported regression case) now resolves with Deployment.gpu-operator.version >= v25.10.0. - Same query against RTX Pro 6000 LKE and H100 Kind also resolves with their respective version constraints. - make qualify passes. Follow-up NVIDIA#1000 will switch validation.{phase}.checks/constraints to per-field union merge so concrete leaves can declare just the incremental additions instead of repeating the wildcard's full block.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/recipe/validation_phase_floor_test.go`:
- Around line 121-136: The helper hasGPUOperatorVersionConstraint currently only
checks for a Deployment.gpu-operator.version constraint; change it to require
both the deployment check named "gpu-operator-version" and the constraint
"Deployment.gpu-operator.version" are present on the ValidationPhase (i.e.,
iterate p.Checks and p.Constraints and return true only if both are found).
Update the same helper/duplicate occurrence elsewhere in this file (the other
hasGPUOperatorVersionConstraint implementation) so tests assert both the
presence of the check and its version constraint.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: acbde46d-9041-478c-9d9c-5dded44874d5

📥 Commits

Reviewing files that changed from the base of the PR and between 116cd6e and 9595676.

📒 Files selected for processing (4)

pkg/recipe/validation_phase_floor_test.go
recipes/overlays/gb200-any.yaml
recipes/overlays/h100-any.yaml
recipes/overlays/rtx-pro-6000-any.yaml

…ards (NVIDIA#969) Push the deployment-phase floor — both the 4 standard checks AND the gpu-operator version constraint — to per-accelerator wildcard overlays so every accelerator-bound query inherits a constraint-backed phase. Closes the gap surfaced by Codex's review: PR NVIDIA#1001 v1 added the check name at the service-root, but per-phase replace semantics meant descendants without their own deployment block (e.g., gb200-oke-training) inherited the check name without a Deployment.gpu-operator.version constraint. The gpu-operator-version check is a no-op without that constraint, so the recipe shipped with the check name but no version assertion. Changes - recipes/overlays/{h100,gb200,rtx-pro-6000}-any.yaml: new wildcard overlays (service: any, accelerator: <X>) carrying the full deployment block (4 checks + version constraint). Concrete leaves that declare their own deployment block continue to override via per-phase replace; constraint values match the wildcard floor. - recipes/overlays/{aks,eks,gke-cos,kind,lke,oke}.yaml: revert the service-root deployment block added in v1. Service-roots own the service-scoped phase (conformance); accelerator-scoped phases live on the per-accelerator wildcards. The applied-overlay order (wildcards then base chain) means a service-root deployment block would shadow the wildcard's, so v1's service-root block was the wrong location. - pkg/recipe/validation_phase_floor_test.go: assert Deployment.gpu-operator.version constraint is present whenever the deployment phase is required; exempt accelerator-unbound classifications from requiring deployment + performance phases (intermediates without an accelerator: criterion have no meaningful accelerator-specific threshold and concrete leaves carry the constraint via wildcard composition). Add isAcceleratorBound, requiresDeployment, and hasGPUOperatorVersionConstraint helpers. Update TestClassifyOverlay to cover the accelerator-unbound exempt path. Verification - TestOverlayValidationPhaseFloor passes for all 56 gateable overlays in both default and strict modes; knownGaps stays empty. - aicr recipe --service oke --accelerator gb200 --intent training (the Codex-reported regression case) now resolves with Deployment.gpu-operator.version >= v25.10.0. - Same query against RTX Pro 6000 LKE and H100 Kind also resolves with their respective version constraints. - make qualify passes. Follow-up NVIDIA#1000 will switch validation.{phase}.checks/constraints to per-field union merge so concrete leaves can declare just the incremental additions instead of repeating the wildcard's full block.

yuanchen8911 requested review from a team as code owners May 21, 2026 16:40

yuanchen8911 added enhancement New feature or request area/recipes area/validator labels May 21, 2026

github-actions Bot added size/M and removed area/validator labels May 21, 2026

yuanchen8911 requested review from mchmarny and xdu31 May 21, 2026 17:25

yuanchen8911 force-pushed the fix/969-deployment-phase-floor-service-root branch from 116cd6e to 9595676 Compare May 21, 2026 17:43

github-actions Bot added size/L and removed size/M labels May 21, 2026

yuanchen8911 changed the title ~~feat(recipe): add deployment-phase floor at service-root overlays~~ feat(recipe): deliver deployment-phase floor at per-accelerator wildcards May 21, 2026

yuanchen8911 added the area/validator label May 21, 2026

coderabbitai Bot reviewed May 21, 2026

View reviewed changes

Comment thread pkg/recipe/validation_phase_floor_test.go

yuanchen8911 marked this pull request as draft May 21, 2026 17:51

yuanchen8911 force-pushed the fix/969-deployment-phase-floor-service-root branch from 9595676 to 1aafaf0 Compare May 21, 2026 18:00

github-actions Bot removed the area/validator label May 21, 2026

yuanchen8911 marked this pull request as ready for review May 21, 2026 18:13

mchmarny assigned yuanchen8911 May 22, 2026

yuanchen8911 force-pushed the fix/969-deployment-phase-floor-service-root branch from 1aafaf0 to 38f08af Compare May 22, 2026 15:39

yuanchen8911 requested a review from lockwobr May 22, 2026 15:42

Merge branch 'main' into fix/969-deployment-phase-floor-service-root

4d15392

mchmarny enabled auto-merge (squash) May 23, 2026 11:13

mchmarny approved these changes May 23, 2026

View reviewed changes

mchmarny disabled auto-merge May 23, 2026 11:25

mchmarny self-requested a review May 23, 2026 11:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(recipe): deliver deployment-phase floor at per-accelerator wildcards#1001

feat(recipe): deliver deployment-phase floor at per-accelerator wildcards#1001
yuanchen8911 wants to merge 2 commits into
NVIDIA:mainfrom
yuanchen8911:fix/969-deployment-phase-floor-service-root

yuanchen8911 commented May 21, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 21, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yuanchen8911 commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope (explicit)

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Files Changed

Testing

Default mode — all 56 subtests pass, knownGaps stays empty

Classification cases (10 subtests, all pass)

Codex's verification command — now resolves with the constraint

yields:

validation.deployment.constraints:

- name: Deployment.gpu-operator.version

value: '>= v25.10.0'

Spot-checks

Full gate

Risk Assessment

Checklist

Uh oh!

coderabbitai Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuanchen8911 commented May 21, 2026 •

edited

Loading

coderabbitai Bot commented May 21, 2026 •

edited

Loading