Close deployment-phase validation coverage gaps for accelerator-bound GPU recipes

## Summary

Audit of \`recipes/overlays/\` showed that accelerator-bound GPU overlays (h100-\*, gb200-\*, rtx-pro-6000-\*) resolved to conformance-only validation, missing the deployment-phase \`operator-health\` / \`expected-resources\` / \`gpu-operator-version\` / \`check-nvidia-smi\` checks and — critically — the \`Deployment.gpu-operator.version\` constraint that makes \`gpu-operator-version\` a real assertion rather than a no-op.

**Scope (rescoped):**
- **In scope:** deployment-phase coverage for **accelerator-bound recipes** — every recipe a user can actually deploy (specifies \`--accelerator\` on the CLI, or has an \`accelerator:\` criterion that selects one).
- **Out of scope:** accelerator-unbound intermediates (e.g., \`eks\`, \`eks-training\`, \`aks-inference\`). These are not user-facing deployable recipes — \`aicr recipe --service eks --intent training\` without \`--accelerator\` produces a bundle with no accelerator-specific GPU operator config (no driver pin, no chart values per accelerator). The deployment-phase version pin has no meaningful value at the intent layer because the pin is accelerator-specific (H100 ≥ v24.6.0, GB200 ≥ v25.10.0, RTX Pro 6000 ≥ v25.10.0). The original 37-entry audit listed these intermediates, but covering them would require either inventing a service-scoped version pin (no anchor in reality) or shipping accelerator-unspecified deployment checks that don't help operators.
- **Performance-phase work:** tracked separately (see "Tracked separately" below).

## Background — how validation resolves

The validation block merges per-phase across the \`base:\` chain (\`pkg/recipe/metadata.go:459-477\`). Each leaf overlay's effective coverage is the OR of \`deployment\` / \`performance\` / \`conformance\` phases set anywhere in its chain.

Service-root overlays historically defined only \`conformance:\`. Deployment-phase coverage was therefore absent from every descendant that did not declare its own \`deployment:\` block — which was most of them.

## Resolution

PR #1001 introduces per-accelerator wildcard overlays (\`recipes/overlays/h100-any.yaml\`, \`gb200-any.yaml\`, \`rtx-pro-6000-any.yaml\`) carrying the full deployment block — both the 4 standard checks AND the \`Deployment.gpu-operator.version\` constraint. The resolver picks these up for every query that binds the matching accelerator. Concrete leaves that declare their own \`deployment:\` block continue to override via per-phase replace; their constraint values match the wildcard floor so the override is a no-op.

The floor test (\`pkg/recipe/validation_phase_floor_test.go::TestOverlayValidationPhaseFloor\`) is tightened to assert the \`Deployment.gpu-operator.version\` constraint is present whenever the deployment phase is required, via new \`isAcceleratorBound\` / \`requiresDeployment\` / \`hasGPUOperatorVersionConstraint\` helpers. Accelerator-unbound classifications are explicitly exempt from deployment + performance gating.

## Performance-phase work (tracked separately)

The original audit also surfaced performance-phase gaps. These have different blockers (testbed availability per cloud, validator support per platform) and are tracked in dedicated issues:

- **#1005** — Addressable EKS + GKE performance work: floor-test refinement for accelerator-unbound intermediates (PR #1009).
- **#1006** — AKS performance constraints (blocked on Azure testbed).
- **#1007** — OKE performance constraints (blocked on OCI testbed, 4 GB200 / OKE overlays).
- **#1008** — LKE performance constraints (blocked on Linode testbed, 2 RTX Pro 6000 overlays).
- **#1010** — NIM workload support in \`inference-perf\` validator (so \`h100-eks-ubuntu-inference-nim\` can ship a non-skipping perf gate).

## Out of scope (track separately)

- \`a100\` and \`l40\` are declared in \`pkg/recipe/criteria.go\` but have zero overlays anywhere in \`recipes/overlays/\` — tracked in #1002 (a100) and #1003 (l40).
- \`b200\` has only \`b200-any-training\` (wildcard stub, no concrete service-bound leaves) — tracked in #1004.

## Related

- #970 — CI guard (closed; tightened in this PR's floor-test update).
- #1000 — per-field union merge for validation checks (orthogonal merge-semantics fix; would let concrete leaves declare incremental \`checks:\` deltas instead of repeating the wildcard's full list).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Close deployment-phase validation coverage gaps for accelerator-bound GPU recipes #969

Summary

Background — how validation resolves

Resolution

Performance-phase work (tracked separately)

Out of scope (track separately)

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Close deployment-phase validation coverage gaps for accelerator-bound GPU recipes #969

Description

Summary

Background — how validation resolves

Resolution

Performance-phase work (tracked separately)

Out of scope (track separately)

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions