Skip to content

feat(validation): add performance-phase constraints to OKE overlays once OCI testbed lands #1007

@yuanchen8911

Description

@yuanchen8911

Summary

Add performance-phase constraints to the OCI / OKE GB200 overlays once an OCI testbed is available to produce empirically-grounded thresholds. This is the largest testbed-blocked cohort.

Affected overlays

Overlay Required performance constraint kind
gb200-oke-training NCCL all-reduce-bw (nccl-all-reduce-bw-net + nccl-all-reduce-bw-nvls, mirroring gb200-eks-training)
gb200-oke-ubuntu-training same
gb200-oke-ubuntu-training-kubeflow same
gb200-oke-ubuntu-inference-dynamo inference-perf (throughput + TTFT p99, mirroring h100-eks-ubuntu-inference-dynamo)

These are all the OKE entries flagged by the strict-mode floor today (AICR_VALIDATION_FLOOR_STRICT=1). The deployment + conformance phases are inherited from oke.yaml via PR #1001; only the performance phase remains gapped.

Blocker

No OCI testbed available today for:

  • NCCL bandwidth measurements on GB200 / OKE bare-metal shapes (BM.GPU.B200.8 or equivalent NVL72 IMEX domains)
  • inference-perf runs on the same hardware

The GB200 EKS reference (recipes/overlays/gb200-eks-training.yaml:90-126) split NCCL into -net (EFA) and -nvls (MNNVL across the NVL72 IMEX domain) channels — OCI's network stack is different, so the EKS thresholds (>= 40 GB/s NET, >= 500 GB/s NVLS) are not portable. OCI deserves its own empirically-grounded numbers.

Design notes

  • NCCL training overlays should follow the GB200 / EKS multi-channel pattern (one constraint per transport) rather than a single nccl-all-reduce-bw — both transports exercise the actual interconnect and a silent fallback to NET should not masquerade as a pass.
  • Dynamo inference uses the same inference-perf check as h100-eks-ubuntu-inference-dynamo; thresholds can start at the H100 placeholder floors (inference-throughput >= 5000, inference-ttft-p99 <= 200) if GB200 is at least as fast, then tighten.

Done when

  • OCI GB200 testbed produces baseline NCCL bandwidth numbers (NET + NVLS) and inference-perf numbers (throughput tok/s, TTFT p99 ms).
  • All 4 overlays gain a performance.checks block with the appropriate constraint set.
  • The overlays disappear from the AICR_VALIDATION_FLOOR_STRICT=1 floor test output.

Related

Metadata

Metadata

Assignees

No one assigned
    No fields configured for Enhancement.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions