Summary
The A100 accelerator is declared in pkg/recipe/criteria.go (CriteriaAcceleratorA100 = \"a100\") but has zero overlays in recipes/overlays/. A user running aicr recipe --accelerator a100 --service <any> cannot resolve a usable recipe.
Motivation / Context
Surfaced as an explicit "out of scope (track separately)" item in #969 — the validation phase coverage audit could not include A100 because no overlays exist to audit. Filing this so the gap has a dedicated tracker.
A100 is the most broadly deployed datacenter GPU in NVIDIA's lineup; absence here means a large user segment cannot use AICR without manually authoring overlays.
A100 cloud SKU availability
| Cloud |
SKU |
| AWS EKS |
p4d.24xlarge (8x A100 40GB), p4de.24xlarge (8x A100 80GB) |
| GCP GKE |
a2-highgpu-* (A100 40GB), a2-ultragpu-* (A100 80GB) |
| Azure AKS |
Standard_ND96asr_v4 (A100 40GB), Standard_ND96amsr_A100_v4 (A100 80GB) |
| OCI OKE |
BM.GPU.A100-v2.8 (A100 80GB), BM.GPU4.8 (A100 40GB) |
A100 is widely available across all four hyperscalers, so the eventual overlay set should match H100 breadth.
Suggested scope
Minimum viable for the first PR (deliver one usable cloud), with extensions tracked as follow-up PRs against this issue:
PR 1 (minimum): A100 on the cloud with the strongest current test coverage (suggest EKS, mirroring h100-eks-*):
a100-eks-training.yaml, a100-eks-inference.yaml
a100-eks-ubuntu-training.yaml, a100-eks-ubuntu-inference.yaml
- One platform variant (
a100-eks-ubuntu-training-kubeflow.yaml)
- Per-accelerator constraint:
Deployment.gpu-operator.version floor — A100 stabilized in v22.9; recommend >= v23.6.0 baseline
- NCCL bandwidth threshold for training (A100 P2P NVLink: ~200 GB/s intra-node, ~100 GB/s inter-node via EFA — needs empirical tuning)
- Optional accelerator-wide wildcard
a100-any-training.yaml for the NCCL threshold (mirroring gb200-any-training.yaml)
PR 2+: Same patterns for GKE, AKS, OKE.
Each PR should:
Out of scope (file separately)
- MIG profile recipes — A100 supports 1g.5gb, 2g.10gb, 3g.20gb, 7g.40gb partitioning; deferred.
- Single-node inference-only overlays for cost-optimized SKUs — possibly worth a
g4dn-class follow-up if demand exists.
Related
Summary
The A100 accelerator is declared in
pkg/recipe/criteria.go(CriteriaAcceleratorA100 = \"a100\") but has zero overlays inrecipes/overlays/. A user runningaicr recipe --accelerator a100 --service <any>cannot resolve a usable recipe.Motivation / Context
Surfaced as an explicit "out of scope (track separately)" item in #969 — the validation phase coverage audit could not include A100 because no overlays exist to audit. Filing this so the gap has a dedicated tracker.
A100 is the most broadly deployed datacenter GPU in NVIDIA's lineup; absence here means a large user segment cannot use AICR without manually authoring overlays.
A100 cloud SKU availability
p4d.24xlarge(8x A100 40GB),p4de.24xlarge(8x A100 80GB)a2-highgpu-*(A100 40GB),a2-ultragpu-*(A100 80GB)Standard_ND96asr_v4(A100 40GB),Standard_ND96amsr_A100_v4(A100 80GB)BM.GPU.A100-v2.8(A100 80GB),BM.GPU4.8(A100 40GB)A100 is widely available across all four hyperscalers, so the eventual overlay set should match H100 breadth.
Suggested scope
Minimum viable for the first PR (deliver one usable cloud), with extensions tracked as follow-up PRs against this issue:
PR 1 (minimum): A100 on the cloud with the strongest current test coverage (suggest EKS, mirroring
h100-eks-*):a100-eks-training.yaml,a100-eks-inference.yamla100-eks-ubuntu-training.yaml,a100-eks-ubuntu-inference.yamla100-eks-ubuntu-training-kubeflow.yaml)Deployment.gpu-operator.versionfloor — A100 stabilized in v22.9; recommend>= v23.6.0baselinea100-any-training.yamlfor the NCCL threshold (mirroringgb200-any-training.yaml)PR 2+: Same patterns for GKE, AKS, OKE.
Each PR should:
recipes/registry.yamlif accelerator-specific component pins are neededTestOverlayValidationPhaseFloor(deployment + conformance inherited from service-root via PR feat(recipe): deliver deployment-phase floor at per-accelerator wildcards #1001)make bom-docs) if any chart pin differsOut of scope (file separately)
g4dn-class follow-up if demand exists.Related
recipes/overlays/h100-*.yaml(reference pattern)recipes/overlays/gb200-*.yaml(multi-PR build-up reference)