Skip to content

feat(validation): add performance-phase constraints to LKE overlays once Linode testbed lands #1008

@yuanchen8911

Description

@yuanchen8911

Summary

Add an NCCL all-reduce-bw performance constraint to the RTX Pro 6000 / LKE training overlays once a Linode LKE testbed is available to produce empirically-grounded thresholds.

Affected overlays

Overlay Required performance constraint kind
rtx-pro-6000-lke-training nccl-all-reduce-bw
rtx-pro-6000-lke-ubuntu-training nccl-all-reduce-bw

These are the only LKE overlays flagged by the strict-mode floor today (AICR_VALIDATION_FLOOR_STRICT=1). The inference counterparts (rtx-pro-6000-lke-inference, rtx-pro-6000-lke-ubuntu-inference) are not gated for performance — RTX Pro 6000 single-node inference is the primary use case there.

Blocker

No Linode LKE testbed available today for running NCCL benchmarks on multi-node RTX Pro 6000 instances. The constraint value depends on the actual interconnect Linode exposes for these nodes — could be plain Ethernet (sub-50 GB/s expected) or a higher-bandwidth fabric — and we shouldn't pick a threshold blind.

Design notes

  • RTX Pro 6000 is a workstation-class card; multi-node NCCL throughput on LKE will likely be Ethernet-bound rather than RDMA-bound. Threshold should reflect realistic Linode networking (likely much lower than H100 EKS's 300 GB/s).
  • If LKE multi-node training turns out to be impractical (no high-bandwidth interconnect), an alternative is to mark these overlays as intent: training but single-node-only and exempt them from the multi-node NCCL floor. File a follow-up if that's the conclusion.

Done when

  • Linode LKE testbed produces baseline NCCL all-reduce-bw numbers on multi-node RTX Pro 6000 instances.
  • Both overlays gain a performance.checks: [nccl-all-reduce-bw] block with an empirically-tuned constraint.
  • The overlays disappear from the AICR_VALIDATION_FLOOR_STRICT=1 floor test output.

Related

Metadata

Metadata

Assignees

No one assigned
    No fields configured for Enhancement.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions