Skip to content

feat(recipe): deliver deployment-phase floor at per-accelerator wildcards#1001

Open
yuanchen8911 wants to merge 2 commits into
NVIDIA:mainfrom
yuanchen8911:fix/969-deployment-phase-floor-service-root
Open

feat(recipe): deliver deployment-phase floor at per-accelerator wildcards#1001
yuanchen8911 wants to merge 2 commits into
NVIDIA:mainfrom
yuanchen8911:fix/969-deployment-phase-floor-service-root

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented May 21, 2026

Summary

Push the deployment-phase floor — both the 4 standard checks AND the `Deployment.gpu-operator.version` constraint — to per-accelerator wildcard overlays (`h100-any.yaml`, `gb200-any.yaml`, `rtx-pro-6000-any.yaml`). Every accelerator-bound query now inherits a constraint-backed deployment phase. Tighten `TestOverlayValidationPhaseFloor` to require the constraint, not just phase presence; exempt accelerator-unbound classifications (whose threshold can't be meaningful without an accelerator).

Closes the gap Codex flagged on v1: declaring `gpu-operator-version` at the service-root let accelerator-bound descendants like `gb200-oke-training` inherit the check name without a `Deployment.gpu-operator.version` constraint, so the check ran as a no-op.

Scope (explicit)

In scope: deployment-phase floor for accelerator-bound recipes — every recipe a user can actually deploy (specifies `--accelerator` or has an `accelerator:` criterion).

Explicitly out of scope: accelerator-unbound intermediates (e.g., `eks`, `eks-training`, `aks-inference`). These resolve to incomplete recipes — running `aicr recipe --service eks --intent training` without `--accelerator` produces a bundle with no accelerator-specific GPU operator config (no driver version pin, no chart values per accelerator). The deployment-phase `Deployment.gpu-operator.version` constraint is accelerator-specific (H100 ≥ v24.6.0, GB200 ≥ v25.10.0, RTX Pro 6000 ≥ v25.10.0); a generic version pin at the intent layer would either be over-permissive (lowest common floor) or wrong for some accelerators. The floor test exempts these classifications via `requiresDeployment()` checking `isAcceleratorBound()`. Issue #969 has been updated to reflect this narrowing.

Motivation / Context

Fixes #969 — every accelerator-bound recipe now ships with a working version-pin assertion, not just a check name.

Related:

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (`cmd/aicr`, `pkg/cli`)
  • API server (`cmd/aicrd`, `pkg/api`, `pkg/server`)
  • Recipe engine / data (`pkg/recipe`) — floor test refinement
  • Bundlers (`pkg/bundler`, `pkg/component/*`)
  • Collectors / snapshotter
  • Validator (`pkg/validator`) — accelerator-wildcard validation phase
  • Core libraries
  • Docs/examples

Implementation Notes

Design choice — wildcards, not service-roots: The resolver applies overlays in order `base → wildcards → base chain`. Per-phase replace means whichever layer declares `deployment:` last wins. v1 added the block at service-root, which ran later than per-accelerator wildcards and shadowed any wildcard-level constraint. v2 inverts: service-roots own the service-scoped phase (`conformance`); per-accelerator wildcards own the accelerator-scoped phase (`deployment` with version pin). Concrete leaves with their own `deployment:` block continue to win via per-phase replace; their constraint values match the wildcard floor so the override is a no-op today.

Why accelerator-unbound classifications are exempt: An intent-level intermediate without an `accelerator:` criterion (e.g., `eks-training`) is a degenerate query — the resulting recipe has no accelerator-specific GPU-operator config either, so a deployment-phase version pin would have no meaningful value at the intent layer. Concrete leaves carry the constraint via wildcard composition. Issue #969 narrowed to make this explicit.

Floor-test tightening:

  • New `isAcceleratorBound()` helper drives both `requiresDeployment()` (new) and `requiresPerformance()` (refined).
  • New `hasGPUOperatorVersionConstraint()` helper asserts `Deployment.gpu-operator.version` is present on the resolved phase.
  • `TestClassifyOverlay` expanded with `Accelerator` field, `wantRequiresDeploy` expectation, and three accelerator-unbound exemption cases.

Files Changed

  • `recipes/overlays/h100-any.yaml`, `gb200-any.yaml`, `rtx-pro-6000-any.yaml` — new wildcards, full deployment block + version pin.
  • `pkg/recipe/validation_phase_floor_test.go` — `Accelerator` field on `classification`, deployment-required gating, constraint assertion, expanded `TestClassifyOverlay`.

Testing

```bash

Default mode — all 56 subtests pass, knownGaps stays empty

go test ./pkg/recipe/... -run TestOverlayValidationPhaseFloor -v

Classification cases (10 subtests, all pass)

go test ./pkg/recipe/... -run TestClassifyOverlay -v

Codex's verification command — now resolves with the constraint

go run ./cmd/aicr recipe --service oke --accelerator gb200 --intent training --format yaml

yields:

validation.deployment.constraints:

- name: Deployment.gpu-operator.version

value: '>= v25.10.0'

Spot-checks

go run ./cmd/aicr recipe --service lke --accelerator rtx-pro-6000 --intent training --format yaml
go run ./cmd/aicr recipe --service kind --accelerator h100 --intent inference --format yaml

Full gate

make qualify # PASS
```

Strict mode (`AICR_VALIDATION_FLOOR_STRICT=1`) status at this PR head: still fails on 5 known performance-phase gaps — none of them deployment-side, none of them introduced or made worse by this PR. They are tracked in dedicated follow-up issues:

PR #1009 closes the accelerator-unbound intermediate side of strict-mode perf. Once both #1001 and #1009 merge, strict-mode failures drop to the 5 above; remaining work is the dedicated follow-ups.

Risk Assessment

  • Low — Data-only edits + a test refinement. The 9 leaf overlays with their own `deployment:` block are unaffected (per-phase replace, identical constraint values). New wildcards target accelerator-bound queries which were previously a no-op `gpu-operator-version` check.
  • Medium
  • High

Rollout notes: No migration. Resolved recipes for previously-uncovered accelerator-bound queries (e.g., all `gb200-oke-`, `h100-aks--inference`, `h100-kind-`, `rtx-pro-6000-lke-`) now ship with a real version assertion. Operators running `aicr validate --phase deployment` against these recipes will see the constraint evaluated against the deployed gpu-operator version.

Checklist

  • Tests pass locally (`make test` with `-race`)
  • Linter passes (`make lint`)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality — 3 new `TestClassifyOverlay` cases + tightened `TestOverlayValidationPhaseFloor` assertions
  • I updated docs if user-facing behavior changed — internal floor-test contract; no user-facing API surface affected
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (`git commit -S`)

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 264b8585-141d-45e3-9ea2-c72413ae2415

📥 Commits

Reviewing files that changed from the base of the PR and between 1aafaf0 and 38f08af.

📒 Files selected for processing (4)
  • pkg/recipe/validation_phase_floor_test.go
  • recipes/overlays/gb200-any.yaml
  • recipes/overlays/h100-any.yaml
  • recipes/overlays/rtx-pro-6000-any.yaml

📝 Walkthrough

Walkthrough

This PR extends recipe classification with accelerator awareness and establishes deployment-phase validation baselines for three new accelerator-specific recipe overlays. The test infrastructure gains an Accelerator field and helpers to require deployment-phase gpu-operator-version gating for accelerator-bound overlays. Three overlays (GB200, H100, RTX Pro 6000) provide a standardized deployment-phase floor including operator-version constraints. The test matrix is rewritten to cover accelerator combinations and the knownGaps allowlist is cleared.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: moving the deployment-phase floor to per-accelerator wildcard overlays.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, covering scope, implementation, testing, and context.
Linked Issues check ✅ Passed All objectives from linked issue #969 are met: per-accelerator wildcards added with deployment checks and version constraints, floor test tightened with new helpers to require deployment only for accelerator-bound cases, and accelerator-unbound intermediates explicitly exempted.
Out of Scope Changes check ✅ Passed All changes are directly aligned with linked issue #969 scope: three wildcard overlay files added and test file updated; no unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@yuanchen8911 yuanchen8911 requested review from mchmarny and xdu31 May 21, 2026 17:25
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 21, 2026
…ards (NVIDIA#969)

Push the deployment-phase floor — both the 4 standard checks AND the
gpu-operator version constraint — to per-accelerator wildcard overlays
so every accelerator-bound query inherits a constraint-backed phase.
Closes the gap surfaced by Codex's review: PR NVIDIA#1001 v1 added the
check name at the service-root, but per-phase replace semantics meant
descendants without their own deployment block (e.g., gb200-oke-training)
inherited the check name without a Deployment.gpu-operator.version
constraint. The gpu-operator-version check is a no-op without that
constraint, so the recipe shipped with the check name but no version
assertion.

Changes
- recipes/overlays/{h100,gb200,rtx-pro-6000}-any.yaml: new wildcard
  overlays (service: any, accelerator: <X>) carrying the full
  deployment block (4 checks + version constraint). Concrete leaves
  that declare their own deployment block continue to override via
  per-phase replace; constraint values match the wildcard floor.
- recipes/overlays/{aks,eks,gke-cos,kind,lke,oke}.yaml: revert the
  service-root deployment block added in v1. Service-roots own the
  service-scoped phase (conformance); accelerator-scoped phases live
  on the per-accelerator wildcards. The applied-overlay order
  (wildcards then base chain) means a service-root deployment block
  would shadow the wildcard's, so v1's service-root block was the
  wrong location.
- pkg/recipe/validation_phase_floor_test.go: assert
  Deployment.gpu-operator.version constraint is present whenever the
  deployment phase is required; exempt accelerator-unbound
  classifications from requiring deployment + performance phases
  (intermediates without an accelerator: criterion have no meaningful
  accelerator-specific threshold and concrete leaves carry the
  constraint via wildcard composition). Add isAcceleratorBound,
  requiresDeployment, and hasGPUOperatorVersionConstraint helpers.
  Update TestClassifyOverlay to cover the accelerator-unbound exempt
  path.

Verification
- TestOverlayValidationPhaseFloor passes for all 56 gateable overlays
  in both default and strict modes; knownGaps stays empty.
- aicr recipe --service oke --accelerator gb200 --intent training
  (the Codex-reported regression case) now resolves with
  Deployment.gpu-operator.version >= v25.10.0.
- Same query against RTX Pro 6000 LKE and H100 Kind also resolves
  with their respective version constraints.
- make qualify passes.

Follow-up NVIDIA#1000 will switch validation.{phase}.checks/constraints to
per-field union merge so concrete leaves can declare just the
incremental additions instead of repeating the wildcard's full block.
@yuanchen8911 yuanchen8911 force-pushed the fix/969-deployment-phase-floor-service-root branch from 116cd6e to 9595676 Compare May 21, 2026 17:43
@github-actions github-actions Bot added size/L and removed size/M labels May 21, 2026
@yuanchen8911 yuanchen8911 changed the title feat(recipe): add deployment-phase floor at service-root overlays feat(recipe): deliver deployment-phase floor at per-accelerator wildcards May 21, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/recipe/validation_phase_floor_test.go`:
- Around line 121-136: The helper hasGPUOperatorVersionConstraint currently only
checks for a Deployment.gpu-operator.version constraint; change it to require
both the deployment check named "gpu-operator-version" and the constraint
"Deployment.gpu-operator.version" are present on the ValidationPhase (i.e.,
iterate p.Checks and p.Constraints and return true only if both are found).
Update the same helper/duplicate occurrence elsewhere in this file (the other
hasGPUOperatorVersionConstraint implementation) so tests assert both the
presence of the check and its version constraint.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: acbde46d-9041-478c-9d9c-5dded44874d5

📥 Commits

Reviewing files that changed from the base of the PR and between 116cd6e and 9595676.

📒 Files selected for processing (4)
  • pkg/recipe/validation_phase_floor_test.go
  • recipes/overlays/gb200-any.yaml
  • recipes/overlays/h100-any.yaml
  • recipes/overlays/rtx-pro-6000-any.yaml

Comment thread pkg/recipe/validation_phase_floor_test.go
@yuanchen8911 yuanchen8911 marked this pull request as draft May 21, 2026 17:51
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 21, 2026
…ards (NVIDIA#969)

Push the deployment-phase floor — both the 4 standard checks AND the
gpu-operator version constraint — to per-accelerator wildcard overlays
so every accelerator-bound query inherits a constraint-backed phase.
Closes the gap surfaced by Codex's review: PR NVIDIA#1001 v1 added the
check name at the service-root, but per-phase replace semantics meant
descendants without their own deployment block (e.g., gb200-oke-training)
inherited the check name without a Deployment.gpu-operator.version
constraint. The gpu-operator-version check is a no-op without that
constraint, so the recipe shipped with the check name but no version
assertion.

Changes
- recipes/overlays/{h100,gb200,rtx-pro-6000}-any.yaml: new wildcard
  overlays (service: any, accelerator: <X>) carrying the full
  deployment block (4 checks + version constraint). Concrete leaves
  that declare their own deployment block continue to override via
  per-phase replace; constraint values match the wildcard floor.
- recipes/overlays/{aks,eks,gke-cos,kind,lke,oke}.yaml: revert the
  service-root deployment block added in v1. Service-roots own the
  service-scoped phase (conformance); accelerator-scoped phases live
  on the per-accelerator wildcards. The applied-overlay order
  (wildcards then base chain) means a service-root deployment block
  would shadow the wildcard's, so v1's service-root block was the
  wrong location.
- pkg/recipe/validation_phase_floor_test.go: assert
  Deployment.gpu-operator.version constraint is present whenever the
  deployment phase is required; exempt accelerator-unbound
  classifications from requiring deployment + performance phases
  (intermediates without an accelerator: criterion have no meaningful
  accelerator-specific threshold and concrete leaves carry the
  constraint via wildcard composition). Add isAcceleratorBound,
  requiresDeployment, and hasGPUOperatorVersionConstraint helpers.
  Update TestClassifyOverlay to cover the accelerator-unbound exempt
  path.

Verification
- TestOverlayValidationPhaseFloor passes for all 56 gateable overlays
  in both default and strict modes; knownGaps stays empty.
- aicr recipe --service oke --accelerator gb200 --intent training
  (the Codex-reported regression case) now resolves with
  Deployment.gpu-operator.version >= v25.10.0.
- Same query against RTX Pro 6000 LKE and H100 Kind also resolves
  with their respective version constraints.
- make qualify passes.

Follow-up NVIDIA#1000 will switch validation.{phase}.checks/constraints to
per-field union merge so concrete leaves can declare just the
incremental additions instead of repeating the wildcard's full block.
@yuanchen8911 yuanchen8911 force-pushed the fix/969-deployment-phase-floor-service-root branch from 9595676 to 1aafaf0 Compare May 21, 2026 18:00
@yuanchen8911 yuanchen8911 marked this pull request as ready for review May 21, 2026 18:13
…ards (NVIDIA#969)

Push the deployment-phase floor — both the 4 standard checks AND the
gpu-operator version constraint — to per-accelerator wildcard overlays
so every accelerator-bound query inherits a constraint-backed phase.
Closes the gap surfaced by Codex's review: PR NVIDIA#1001 v1 added the
check name at the service-root, but per-phase replace semantics meant
descendants without their own deployment block (e.g., gb200-oke-training)
inherited the check name without a Deployment.gpu-operator.version
constraint. The gpu-operator-version check is a no-op without that
constraint, so the recipe shipped with the check name but no version
assertion.

Changes
- recipes/overlays/{h100,gb200,rtx-pro-6000}-any.yaml: new wildcard
  overlays (service: any, accelerator: <X>) carrying the full
  deployment block (4 checks + version constraint). Concrete leaves
  that declare their own deployment block continue to override via
  per-phase replace; constraint values match the wildcard floor.
- recipes/overlays/{aks,eks,gke-cos,kind,lke,oke}.yaml: revert the
  service-root deployment block added in v1. Service-roots own the
  service-scoped phase (conformance); accelerator-scoped phases live
  on the per-accelerator wildcards. The applied-overlay order
  (wildcards then base chain) means a service-root deployment block
  would shadow the wildcard's, so v1's service-root block was the
  wrong location.
- pkg/recipe/validation_phase_floor_test.go: assert
  Deployment.gpu-operator.version constraint is present whenever the
  deployment phase is required; exempt accelerator-unbound
  classifications from requiring deployment + performance phases
  (intermediates without an accelerator: criterion have no meaningful
  accelerator-specific threshold and concrete leaves carry the
  constraint via wildcard composition). Add isAcceleratorBound,
  requiresDeployment, and hasGPUOperatorVersionConstraint helpers.
  Update TestClassifyOverlay to cover the accelerator-unbound exempt
  path.

Verification
- TestOverlayValidationPhaseFloor passes for all 56 gateable overlays
  in both default and strict modes; knownGaps stays empty.
- aicr recipe --service oke --accelerator gb200 --intent training
  (the Codex-reported regression case) now resolves with
  Deployment.gpu-operator.version >= v25.10.0.
- Same query against RTX Pro 6000 LKE and H100 Kind also resolves
  with their respective version constraints.
- make qualify passes.

Follow-up NVIDIA#1000 will switch validation.{phase}.checks/constraints to
per-field union merge so concrete leaves can declare just the
incremental additions instead of repeating the wildcard's full block.
@yuanchen8911 yuanchen8911 force-pushed the fix/969-deployment-phase-floor-service-root branch from 1aafaf0 to 38f08af Compare May 22, 2026 15:39
@yuanchen8911 yuanchen8911 requested a review from lockwobr May 22, 2026 15:42
@mchmarny mchmarny enabled auto-merge (squash) May 23, 2026 11:13
@mchmarny mchmarny disabled auto-merge May 23, 2026 11:25
@mchmarny mchmarny self-requested a review May 23, 2026 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Close deployment-phase validation coverage gaps for accelerator-bound GPU recipes

2 participants