feat(recipe): deliver deployment-phase floor at per-accelerator wildcards#1001
feat(recipe): deliver deployment-phase floor at per-accelerator wildcards#1001yuanchen8911 wants to merge 2 commits into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThis PR extends recipe classification with accelerator awareness and establishes deployment-phase validation baselines for three new accelerator-specific recipe overlays. The test infrastructure gains an Accelerator field and helpers to require deployment-phase gpu-operator-version gating for accelerator-bound overlays. Three overlays (GB200, H100, RTX Pro 6000) provide a standardized deployment-phase floor including operator-version constraints. The test matrix is rewritten to cover accelerator combinations and the knownGaps allowlist is cleared. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
…ards (NVIDIA#969) Push the deployment-phase floor — both the 4 standard checks AND the gpu-operator version constraint — to per-accelerator wildcard overlays so every accelerator-bound query inherits a constraint-backed phase. Closes the gap surfaced by Codex's review: PR NVIDIA#1001 v1 added the check name at the service-root, but per-phase replace semantics meant descendants without their own deployment block (e.g., gb200-oke-training) inherited the check name without a Deployment.gpu-operator.version constraint. The gpu-operator-version check is a no-op without that constraint, so the recipe shipped with the check name but no version assertion. Changes - recipes/overlays/{h100,gb200,rtx-pro-6000}-any.yaml: new wildcard overlays (service: any, accelerator: <X>) carrying the full deployment block (4 checks + version constraint). Concrete leaves that declare their own deployment block continue to override via per-phase replace; constraint values match the wildcard floor. - recipes/overlays/{aks,eks,gke-cos,kind,lke,oke}.yaml: revert the service-root deployment block added in v1. Service-roots own the service-scoped phase (conformance); accelerator-scoped phases live on the per-accelerator wildcards. The applied-overlay order (wildcards then base chain) means a service-root deployment block would shadow the wildcard's, so v1's service-root block was the wrong location. - pkg/recipe/validation_phase_floor_test.go: assert Deployment.gpu-operator.version constraint is present whenever the deployment phase is required; exempt accelerator-unbound classifications from requiring deployment + performance phases (intermediates without an accelerator: criterion have no meaningful accelerator-specific threshold and concrete leaves carry the constraint via wildcard composition). Add isAcceleratorBound, requiresDeployment, and hasGPUOperatorVersionConstraint helpers. Update TestClassifyOverlay to cover the accelerator-unbound exempt path. Verification - TestOverlayValidationPhaseFloor passes for all 56 gateable overlays in both default and strict modes; knownGaps stays empty. - aicr recipe --service oke --accelerator gb200 --intent training (the Codex-reported regression case) now resolves with Deployment.gpu-operator.version >= v25.10.0. - Same query against RTX Pro 6000 LKE and H100 Kind also resolves with their respective version constraints. - make qualify passes. Follow-up NVIDIA#1000 will switch validation.{phase}.checks/constraints to per-field union merge so concrete leaves can declare just the incremental additions instead of repeating the wildcard's full block.
116cd6e to
9595676
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/recipe/validation_phase_floor_test.go`:
- Around line 121-136: The helper hasGPUOperatorVersionConstraint currently only
checks for a Deployment.gpu-operator.version constraint; change it to require
both the deployment check named "gpu-operator-version" and the constraint
"Deployment.gpu-operator.version" are present on the ValidationPhase (i.e.,
iterate p.Checks and p.Constraints and return true only if both are found).
Update the same helper/duplicate occurrence elsewhere in this file (the other
hasGPUOperatorVersionConstraint implementation) so tests assert both the
presence of the check and its version constraint.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: acbde46d-9041-478c-9d9c-5dded44874d5
📒 Files selected for processing (4)
pkg/recipe/validation_phase_floor_test.gorecipes/overlays/gb200-any.yamlrecipes/overlays/h100-any.yamlrecipes/overlays/rtx-pro-6000-any.yaml
…ards (NVIDIA#969) Push the deployment-phase floor — both the 4 standard checks AND the gpu-operator version constraint — to per-accelerator wildcard overlays so every accelerator-bound query inherits a constraint-backed phase. Closes the gap surfaced by Codex's review: PR NVIDIA#1001 v1 added the check name at the service-root, but per-phase replace semantics meant descendants without their own deployment block (e.g., gb200-oke-training) inherited the check name without a Deployment.gpu-operator.version constraint. The gpu-operator-version check is a no-op without that constraint, so the recipe shipped with the check name but no version assertion. Changes - recipes/overlays/{h100,gb200,rtx-pro-6000}-any.yaml: new wildcard overlays (service: any, accelerator: <X>) carrying the full deployment block (4 checks + version constraint). Concrete leaves that declare their own deployment block continue to override via per-phase replace; constraint values match the wildcard floor. - recipes/overlays/{aks,eks,gke-cos,kind,lke,oke}.yaml: revert the service-root deployment block added in v1. Service-roots own the service-scoped phase (conformance); accelerator-scoped phases live on the per-accelerator wildcards. The applied-overlay order (wildcards then base chain) means a service-root deployment block would shadow the wildcard's, so v1's service-root block was the wrong location. - pkg/recipe/validation_phase_floor_test.go: assert Deployment.gpu-operator.version constraint is present whenever the deployment phase is required; exempt accelerator-unbound classifications from requiring deployment + performance phases (intermediates without an accelerator: criterion have no meaningful accelerator-specific threshold and concrete leaves carry the constraint via wildcard composition). Add isAcceleratorBound, requiresDeployment, and hasGPUOperatorVersionConstraint helpers. Update TestClassifyOverlay to cover the accelerator-unbound exempt path. Verification - TestOverlayValidationPhaseFloor passes for all 56 gateable overlays in both default and strict modes; knownGaps stays empty. - aicr recipe --service oke --accelerator gb200 --intent training (the Codex-reported regression case) now resolves with Deployment.gpu-operator.version >= v25.10.0. - Same query against RTX Pro 6000 LKE and H100 Kind also resolves with their respective version constraints. - make qualify passes. Follow-up NVIDIA#1000 will switch validation.{phase}.checks/constraints to per-field union merge so concrete leaves can declare just the incremental additions instead of repeating the wildcard's full block.
9595676 to
1aafaf0
Compare
…ards (NVIDIA#969) Push the deployment-phase floor — both the 4 standard checks AND the gpu-operator version constraint — to per-accelerator wildcard overlays so every accelerator-bound query inherits a constraint-backed phase. Closes the gap surfaced by Codex's review: PR NVIDIA#1001 v1 added the check name at the service-root, but per-phase replace semantics meant descendants without their own deployment block (e.g., gb200-oke-training) inherited the check name without a Deployment.gpu-operator.version constraint. The gpu-operator-version check is a no-op without that constraint, so the recipe shipped with the check name but no version assertion. Changes - recipes/overlays/{h100,gb200,rtx-pro-6000}-any.yaml: new wildcard overlays (service: any, accelerator: <X>) carrying the full deployment block (4 checks + version constraint). Concrete leaves that declare their own deployment block continue to override via per-phase replace; constraint values match the wildcard floor. - recipes/overlays/{aks,eks,gke-cos,kind,lke,oke}.yaml: revert the service-root deployment block added in v1. Service-roots own the service-scoped phase (conformance); accelerator-scoped phases live on the per-accelerator wildcards. The applied-overlay order (wildcards then base chain) means a service-root deployment block would shadow the wildcard's, so v1's service-root block was the wrong location. - pkg/recipe/validation_phase_floor_test.go: assert Deployment.gpu-operator.version constraint is present whenever the deployment phase is required; exempt accelerator-unbound classifications from requiring deployment + performance phases (intermediates without an accelerator: criterion have no meaningful accelerator-specific threshold and concrete leaves carry the constraint via wildcard composition). Add isAcceleratorBound, requiresDeployment, and hasGPUOperatorVersionConstraint helpers. Update TestClassifyOverlay to cover the accelerator-unbound exempt path. Verification - TestOverlayValidationPhaseFloor passes for all 56 gateable overlays in both default and strict modes; knownGaps stays empty. - aicr recipe --service oke --accelerator gb200 --intent training (the Codex-reported regression case) now resolves with Deployment.gpu-operator.version >= v25.10.0. - Same query against RTX Pro 6000 LKE and H100 Kind also resolves with their respective version constraints. - make qualify passes. Follow-up NVIDIA#1000 will switch validation.{phase}.checks/constraints to per-field union merge so concrete leaves can declare just the incremental additions instead of repeating the wildcard's full block.
1aafaf0 to
38f08af
Compare
Summary
Push the deployment-phase floor — both the 4 standard checks AND the `Deployment.gpu-operator.version` constraint — to per-accelerator wildcard overlays (`h100-any.yaml`, `gb200-any.yaml`, `rtx-pro-6000-any.yaml`). Every accelerator-bound query now inherits a constraint-backed deployment phase. Tighten `TestOverlayValidationPhaseFloor` to require the constraint, not just phase presence; exempt accelerator-unbound classifications (whose threshold can't be meaningful without an accelerator).
Closes the gap Codex flagged on v1: declaring `gpu-operator-version` at the service-root let accelerator-bound descendants like `gb200-oke-training` inherit the check name without a `Deployment.gpu-operator.version` constraint, so the check ran as a no-op.
Scope (explicit)
In scope: deployment-phase floor for accelerator-bound recipes — every recipe a user can actually deploy (specifies `--accelerator` or has an `accelerator:` criterion).
Explicitly out of scope: accelerator-unbound intermediates (e.g., `eks`, `eks-training`, `aks-inference`). These resolve to incomplete recipes — running `aicr recipe --service eks --intent training` without `--accelerator` produces a bundle with no accelerator-specific GPU operator config (no driver version pin, no chart values per accelerator). The deployment-phase `Deployment.gpu-operator.version` constraint is accelerator-specific (H100 ≥ v24.6.0, GB200 ≥ v25.10.0, RTX Pro 6000 ≥ v25.10.0); a generic version pin at the intent layer would either be over-permissive (lowest common floor) or wrong for some accelerators. The floor test exempts these classifications via `requiresDeployment()` checking `isAcceleratorBound()`. Issue #969 has been updated to reflect this narrowing.
Motivation / Context
Fixes #969 — every accelerator-bound recipe now ships with a working version-pin assertion, not just a check name.
Related:
Type of Change
Component(s) Affected
Implementation Notes
Design choice — wildcards, not service-roots: The resolver applies overlays in order `base → wildcards → base chain`. Per-phase replace means whichever layer declares `deployment:` last wins. v1 added the block at service-root, which ran later than per-accelerator wildcards and shadowed any wildcard-level constraint. v2 inverts: service-roots own the service-scoped phase (`conformance`); per-accelerator wildcards own the accelerator-scoped phase (`deployment` with version pin). Concrete leaves with their own `deployment:` block continue to win via per-phase replace; their constraint values match the wildcard floor so the override is a no-op today.
Why accelerator-unbound classifications are exempt: An intent-level intermediate without an `accelerator:` criterion (e.g., `eks-training`) is a degenerate query — the resulting recipe has no accelerator-specific GPU-operator config either, so a deployment-phase version pin would have no meaningful value at the intent layer. Concrete leaves carry the constraint via wildcard composition. Issue #969 narrowed to make this explicit.
Floor-test tightening:
Files Changed
Testing
```bash
Default mode — all 56 subtests pass, knownGaps stays empty
go test ./pkg/recipe/... -run TestOverlayValidationPhaseFloor -v
Classification cases (10 subtests, all pass)
go test ./pkg/recipe/... -run TestClassifyOverlay -v
Codex's verification command — now resolves with the constraint
go run ./cmd/aicr recipe --service oke --accelerator gb200 --intent training --format yaml
yields:
validation.deployment.constraints:
- name: Deployment.gpu-operator.version
value: '>= v25.10.0'
Spot-checks
go run ./cmd/aicr recipe --service lke --accelerator rtx-pro-6000 --intent training --format yaml
go run ./cmd/aicr recipe --service kind --accelerator h100 --intent inference --format yaml
Full gate
make qualify # PASS
```
Strict mode (`AICR_VALIDATION_FLOOR_STRICT=1`) status at this PR head: still fails on 5 known performance-phase gaps — none of them deployment-side, none of them introduced or made worse by this PR. They are tracked in dedicated follow-up issues:
PR #1009 closes the accelerator-unbound intermediate side of strict-mode perf. Once both #1001 and #1009 merge, strict-mode failures drop to the 5 above; remaining work is the dedicated follow-ups.
Risk Assessment
Rollout notes: No migration. Resolved recipes for previously-uncovered accelerator-bound queries (e.g., all `gb200-oke-`, `h100-aks--inference`, `h100-kind-`, `rtx-pro-6000-lke-`) now ship with a real version assertion. Operators running `aicr validate --phase deployment` against these recipes will see the constraint evaluated against the deployed gpu-operator version.
Checklist