Skip to content

Close deployment-phase validation coverage gaps for accelerator-bound GPU recipes #969

@yuanchen8911

Description

@yuanchen8911

Summary

Audit of `recipes/overlays/` showed that accelerator-bound GPU overlays (h100-*, gb200-*, rtx-pro-6000-*) resolved to conformance-only validation, missing the deployment-phase `operator-health` / `expected-resources` / `gpu-operator-version` / `check-nvidia-smi` checks and — critically — the `Deployment.gpu-operator.version` constraint that makes `gpu-operator-version` a real assertion rather than a no-op.

Scope (rescoped):

  • In scope: deployment-phase coverage for accelerator-bound recipes — every recipe a user can actually deploy (specifies `--accelerator` on the CLI, or has an `accelerator:` criterion that selects one).
  • Out of scope: accelerator-unbound intermediates (e.g., `eks`, `eks-training`, `aks-inference`). These are not user-facing deployable recipes — `aicr recipe --service eks --intent training` without `--accelerator` produces a bundle with no accelerator-specific GPU operator config (no driver pin, no chart values per accelerator). The deployment-phase version pin has no meaningful value at the intent layer because the pin is accelerator-specific (H100 ≥ v24.6.0, GB200 ≥ v25.10.0, RTX Pro 6000 ≥ v25.10.0). The original 37-entry audit listed these intermediates, but covering them would require either inventing a service-scoped version pin (no anchor in reality) or shipping accelerator-unspecified deployment checks that don't help operators.
  • Performance-phase work: tracked separately (see "Tracked separately" below).

Background — how validation resolves

The validation block merges per-phase across the `base:` chain (`pkg/recipe/metadata.go:459-477`). Each leaf overlay's effective coverage is the OR of `deployment` / `performance` / `conformance` phases set anywhere in its chain.

Service-root overlays historically defined only `conformance:`. Deployment-phase coverage was therefore absent from every descendant that did not declare its own `deployment:` block — which was most of them.

Resolution

PR #1001 introduces per-accelerator wildcard overlays (`recipes/overlays/h100-any.yaml`, `gb200-any.yaml`, `rtx-pro-6000-any.yaml`) carrying the full deployment block — both the 4 standard checks AND the `Deployment.gpu-operator.version` constraint. The resolver picks these up for every query that binds the matching accelerator. Concrete leaves that declare their own `deployment:` block continue to override via per-phase replace; their constraint values match the wildcard floor so the override is a no-op.

The floor test (`pkg/recipe/validation_phase_floor_test.go::TestOverlayValidationPhaseFloor`) is tightened to assert the `Deployment.gpu-operator.version` constraint is present whenever the deployment phase is required, via new `isAcceleratorBound` / `requiresDeployment` / `hasGPUOperatorVersionConstraint` helpers. Accelerator-unbound classifications are explicitly exempt from deployment + performance gating.

Performance-phase work (tracked separately)

The original audit also surfaced performance-phase gaps. These have different blockers (testbed availability per cloud, validator support per platform) and are tracked in dedicated issues:

Out of scope (track separately)

Related

Metadata

Metadata

Assignees

Labels

area/recipesarea/validatorenhancementNew feature or requesttheme/recipesRecipe expansion, overlays, mixins, and component registrytheme/validationConstraint evaluation, health checks, and conformance evidence
No fields configured for Enhancement.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions