feat(recipes): add concrete A100 service-bound overlays

## Summary

The A100 accelerator is declared in `pkg/recipe/criteria.go` (`CriteriaAcceleratorA100 = \"a100\"`) but has **zero** overlays in `recipes/overlays/`. A user running `aicr recipe --accelerator a100 --service <any>` cannot resolve a usable recipe.

## Motivation / Context

Surfaced as an explicit "out of scope (track separately)" item in #969 — the validation phase coverage audit could not include A100 because no overlays exist to audit. Filing this so the gap has a dedicated tracker.

A100 is the most broadly deployed datacenter GPU in NVIDIA's lineup; absence here means a large user segment cannot use AICR without manually authoring overlays.

## A100 cloud SKU availability

| Cloud | SKU |
|---|---|
| AWS EKS | `p4d.24xlarge` (8x A100 40GB), `p4de.24xlarge` (8x A100 80GB) |
| GCP GKE | `a2-highgpu-*` (A100 40GB), `a2-ultragpu-*` (A100 80GB) |
| Azure AKS | `Standard_ND96asr_v4` (A100 40GB), `Standard_ND96amsr_A100_v4` (A100 80GB) |
| OCI OKE | `BM.GPU.A100-v2.8` (A100 80GB), `BM.GPU4.8` (A100 40GB) |

A100 is widely available across all four hyperscalers, so the eventual overlay set should match H100 breadth.

## Suggested scope

Minimum viable for the first PR (deliver one usable cloud), with extensions tracked as follow-up PRs against this issue:

**PR 1 (minimum):** A100 on the cloud with the strongest current test coverage (suggest EKS, mirroring `h100-eks-*`):
- `a100-eks-training.yaml`, `a100-eks-inference.yaml`
- `a100-eks-ubuntu-training.yaml`, `a100-eks-ubuntu-inference.yaml`
- One platform variant (`a100-eks-ubuntu-training-kubeflow.yaml`)
- Per-accelerator constraint: `Deployment.gpu-operator.version` floor — A100 stabilized in v22.9; recommend `>= v23.6.0` baseline
- NCCL bandwidth threshold for training (A100 P2P NVLink: ~200 GB/s intra-node, ~100 GB/s inter-node via EFA — needs empirical tuning)
- Optional accelerator-wide wildcard `a100-any-training.yaml` for the NCCL threshold (mirroring `gb200-any-training.yaml`)

**PR 2+:** Same patterns for GKE, AKS, OKE.

Each PR should:
- Add the overlays
- Update `recipes/registry.yaml` if accelerator-specific component pins are needed
- Pass `TestOverlayValidationPhaseFloor` (deployment + conformance inherited from service-root via PR #1001)
- Add an entry in the BOM regen (`make bom-docs`) if any chart pin differs

## Out of scope (file separately)

- **MIG profile recipes** — A100 supports 1g.5gb, 2g.10gb, 3g.20gb, 7g.40gb partitioning; deferred.
- **Single-node inference-only overlays** for cost-optimized SKUs — possibly worth a `g4dn`-class follow-up if demand exists.

## Related

- #969 — validation phase coverage audit (A100 explicitly out-of-scope there)
- H100 overlay set: `recipes/overlays/h100-*.yaml` (reference pattern)
- GB200 overlay set: `recipes/overlays/gb200-*.yaml` (multi-PR build-up reference)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(recipes): add concrete A100 service-bound overlays #1002

Summary

Motivation / Context

A100 cloud SKU availability

Suggested scope

Out of scope (file separately)

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Cloud	SKU
AWS EKS	`p4d.24xlarge` (8x A100 40GB), `p4de.24xlarge` (8x A100 80GB)
GCP GKE	`a2-highgpu-` (A100 40GB), `a2-ultragpu-` (A100 80GB)
Azure AKS	`Standard_ND96asr_v4` (A100 40GB), `Standard_ND96amsr_A100_v4` (A100 80GB)
OCI OKE	`BM.GPU.A100-v2.8` (A100 80GB), `BM.GPU4.8` (A100 40GB)

feat(recipes): add concrete A100 service-bound overlays #1002

Description

Summary

Motivation / Context

A100 cloud SKU availability

Suggested scope

Out of scope (file separately)

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions