From 0cce7dbd4b6b1acc927866746d1109dcf3ae69bf Mon Sep 17 00:00:00 2001 From: Jan Stephan Date: Fri, 12 Jun 2026 21:09:07 +0200 Subject: [PATCH 1/5] feature: add MI210 validation guide * Add MI210 validation guide Signed-off-by: Jan Stephan --- docs/gpus/mi210.md | 178 ++++++++++++++++++++++++++++++++++++++++ docs/index.md | 1 + docs/sphinx/_toc.yml.in | 3 +- 3 files changed, 181 insertions(+), 1 deletion(-) create mode 100644 docs/gpus/mi210.md diff --git a/docs/gpus/mi210.md b/docs/gpus/mi210.md new file mode 100644 index 0000000..cdb4dc9 --- /dev/null +++ b/docs/gpus/mi210.md @@ -0,0 +1,178 @@ +--- +myst: + html_meta: + "description": "MI210 GPU system acceptance guide: prerequisites, health checks, system validation, and performance benchmarks for HPC and AI deployments." + "keywords": "AMD Instinct MI210, GPU acceptance testing, ROCm, HPC, AI, PCIe GPU, system validation, health checks, BabelStream, rocBLAS, RCCL, TransferBench, CDNA2" +--- +# AMD Instinct MI210 + +The AMD Instinct™ MI210 GPU is a mainstream HPC and AI PCIe-form-factor accelerator. This document provides MI210-specific prerequisites, health checks, validation steps, and performance acceptance criteria. + +## Overview + +The AMD Instinct MI210 brings second-generation CDNA architecture to a standard full-height, full-length, dual-slot PCIe® add-in card aimed at single-server HPC and AI deployments. Each MI210 provides 104 compute units, 64 GB of HBM2e memory at up to 1.6 TB/s of memory bandwidth, and up to three AMD Infinity Fabric™ links that enable direct GPU-to-GPU connectivity in dual- and quad-GPU hive configurations. The card is passively cooled with a 300 W TDP and supports PCIe® Gen4 host connectivity. + +The MI210 is built on AMD CDNA 2 architecture (gfx90a) in a PCIe add-in-card form factor with 104 compute units and 64 GB of HBM2e memory per accelerator. Unlike the OAM-based AMD Instinct MI250 and MI250X, MI210 deployments are PCIe-attached; GPU-to-GPU traffic uses AMD Infinity Fabric™ links when an Infinity Fabric bridge is installed, otherwise it traverses host PCIe. + +- **[MI210 Product Page](https://www.amd.com/en/products/accelerators/instinct/mi200/mi210.html)** +- **[MI200 Series Data Sheet](https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instinct-mi200-datasheet.pdf)** +- **[MI200 Series Microarchitecture](https://instinct.docs.amd.com/latest/gpu-arch/mi250.html)** + +## System requirements + +### Operating system support + +For the most up-to-date information on supported operating systems and distributions, see the official ROCm documentation: + +[ROCm System Requirements - Supported Distributions](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-distributions) + +```{note} +[ROCm docs](https://rocm.docs.amd.com) is the single source of truth for supported versions, distribution compatibility, and required dependencies for the ROCm toolkit. +``` + +For BIOS, NUMA, and OS-level tuning that applies to all AMD Instinct hosts, see [BIOS settings](../common/bios-settings.md) and [OS tuning](../common/os-tuning.md). MI210 systems share the general OS and IOMMU guidance documented for other CDNA 2 platforms but might differ in BIOS power and xGMI topology settings; consult your platform vendor's BIOS guide for MI210-specific values. + +### GPU identification + +All MI210 GPUs (PCI vendor:device `1002:740f`) should appear in `lspci` output: + +```bash +sudo lspci -d 1002:740f +``` + +Expected output example: + +```bash +03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Aldebaran/MI200 [Instinct MI210] (rev 02) +27:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Aldebaran/MI200 [Instinct MI210] (rev 02) +43:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Aldebaran/MI200 [Instinct MI210] (rev 02) +63:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Aldebaran/MI200 [Instinct MI210] (rev 02) +83:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Aldebaran/MI200 [Instinct MI210] (rev 02) +a3:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Aldebaran/MI200 [Instinct MI210] (rev 02) +c3:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Aldebaran/MI200 [Instinct MI210] (rev 02) +e3:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Aldebaran/MI200 [Instinct MI210] (rev 02) +``` + +## Acceptance criteria + +The MI210 system acceptance process validates that the platform is correctly configured, stable, and performing to expectations. Follow the sequence: Prerequisites → Basic Health Checks → System Validation → Performance Benchmarks. + +### System acceptance process + +1. **[Prerequisites validation](#prerequisites-validation)** - Ensure all system requirements and dependencies are met +2. **[Basic health checks](#basic-health-checks)** - Verify hardware detection and basic system health +3. **[System validation](#system-validation)** - Conduct comprehensive stress testing and qualification +4. **[Performance benchmarks](#performance-benchmarks)** - Validate compute, memory, and interconnect performance + +The system is accepted when all criteria below are successfully validated. + +### Prerequisites validation + +Ensure all system requirements are met before proceeding with validation. See the [Prerequisites documentation](../common/prerequisites.md) and [System setup](../common/system-setup.md) for more details. + +- ✅ Supported operating system version installed +- ✅ Compatible ROCm version installed +- ✅ BIOS configured per [BIOS settings](../common/bios-settings.md), with MI210-specific values per platform vendor +- ✅ Required kernel parameters present: `pci=realloc=off`, `pci=bfsort`, `iommu=pt`, and `amd_iommu=on` (or `intel_iommu=on` on Intel hosts) — see [Kernel Parameters](../common/kernel-parameters.md) +- ✅ Minimum 512G system memory available +- ✅ Latest applicable firmware applied consistently across nodes +- ✅ ROCm Validation Suite (RVS) installed + +### Basic health checks + +These checks ensure fundamental system health and proper GPU detection. For detailed procedures, see [Health Checks](../common/health-checks.md). + +| Test | Command | Pass/Fail criteria | +|------|---------|-------------------| +| [Check OS distribution](../common/health-checks.md#check-os-distribution) | `cat /etc/os-release` | **Pass**: OS version listed in compatibility matrix
**Fail**: Otherwise | +| [Check kernel boot arguments](../common/health-checks.md#check-kernel-boot-arguments) | `cat /proc/cmdline` | **Pass**: Contains `pci=realloc=off`, `pci=bfsort`, `iommu=pt`, and `amd_iommu=on` or `intel_iommu=on`
**Fail**: Otherwise | +| [Check for driver errors](../common/health-checks.md#check-for-driver-errors) | `sudo dmesg -T \| grep amdgpu \| grep -i error` | **Pass**: Null
**Fail**: Errors reported | +| [Check available memory](../common/health-checks.md#check-for-available-system-memory) | `lsmem \| grep "Total online memory"` | **Pass**: ≥ 512G
**Fail**: Less than 512G | +| [Check GPU presence](../common/health-checks.md#check-gpu-presence) | `sudo lspci -d 1002:740f` | **Pass**: 4 MI210 GPUs found
**Fail**: Otherwise | +| [Check GPU link speed and width](../common/health-checks.md#check-gpu-pcie-bus-link-speed-and-width) | `sudo lspci -d 1002:740f -vvv \| grep -e DevSta -e LnkSta` | **Pass**: Speed 16GT/s, width `x16`, no `FatalErr+`
**Fail**: Otherwise | +| [Monitor utilization metrics](../common/health-checks.md#monitor-utilization-metrics) | `amd-smi monitor -putm` | **Pass**: Idle metrics as specified
**Fail**: Otherwise | +| [Check system kernel logs for errors](../common/health-checks.md#check-system-kernel-logs) | `sudo dmesg -T \| grep -i 'error\|warn\|fail\|exception'` | **Pass**: Null
**Fail**: Otherwise | + +### System validation + +Comprehensive validation ensures system stability under load. For detailed procedures, see [System Validation](../common/system-validation.md). + +| Test | Command | Pass/Fail criteria | +|------|---------|-------------------| +| [Compute/GPU properties](../common/system-validation.md#gpu-properties) | `rvs -c ${RVS_CONF}/gpup_single.conf` | **Pass**: All GPUs listed with no errors
**Fail**: Missing GPUs or errors | +| [GPU stress test (GST)](../common/system-validation.md#gpu-stress-test) | `rvs -c ${RVS_CONF}/MI210/gst_single.conf` | **Pass**: `met: TRUE` in logs
**Fail**: Target GFLOP/s not met | +| [Input energy delay product (IET)](../common/system-validation.md#input-energy-delay-product) | `rvs -c ${RVS_CONF}/MI210/iet_single.conf` | **Pass**: `met: TRUE` for all actions
**Fail**: Otherwise | +| [Memory test (MEM)](../common/system-validation.md#mem) | `rvs -c ${RVS_CONF}/mem.conf -l mem.txt` | **Pass**: All tests passed; bandwidth ~1.1TB/s per GPU
**Fail**: Any test failed or low bandwidth | +| [PCIe bandwidth benchmark (PEBB)](../common/system-validation.md#pcie-bandwidth-benchmark) | `rvs -c ${RVS_CONF}/MI210/pebb_single.conf` | **Pass**: All distances and bandwidths displayed
**Fail**: Missing data | +| [PCIe qualification tool (PEQT)](../common/system-validation.md#pcie-qualification-tool) | `rvs -c ${RVS_CONF}/peqt_single.conf` | **Pass**: All actions true
**Fail**: Otherwise | +| [P2P benchmark and qualification tool (PBQT)](../common/system-validation.md#p2p-benchmark-and-qualification-tool) | `rvs -c ${RVS_CONF}/pbqt_single.conf` | **Pass**: `peers:true` lines and non-zero throughput
**Fail**: Otherwise | + +### Performance benchmarks + +Performance validation ensures the system meets MI210 specifications. For detailed procedures, see [Performance Benchmarking](../common/system-validation.md#performance-benchmarking). + +:::{card} Command: `TransferBench a2a` +[TransferBench all-to-all](../common/system-validation.md#transferbench) +^^^ +**Pass:** ≥ 80 GB/s per GPU aggregate ++++ +**Fail:** otherwise +::: + +:::{card} Command: `TransferBench p2p` +[TransferBench peer-to-peer](../common/system-validation.md#transferbench) +^^^ + +| Test | Pass Criteria | +|------|--------------| +| UniDir | ≥ 35 GB/s per same-socket peer-pair | +| BiDir | ≥ 65 GB/s per same-socket peer-pair (combined) | + ++++ +**Fail:** otherwise +::: + +:::{card} Command: `build/all_reduce_perf -b 8 -e 8G -f 2 -g ` +[RCCL Allreduce](../common/system-validation.md#rccl-allreduce) +^^^ + +| Config | Pass Criteria | +|--------|--------------| +| `-g 4` (single-socket quad) | ≥ 30 GB/s avg bus bandwidth | +| `-g 8` (dual-socket, cross-socket ring) | ≥ 8 GB/s avg bus bandwidth | + ++++ +**Fail:** otherwise +::: + +:::{card} Command: `rocblas-bench` (see code block below) +[rocBLAS FP32](../common/system-validation.md#rocblas-gemm-benchmarks) +^^^ + +```bash +rocblas-bench -f gemm \ + -r s -m 4000 -n 4000 -k 4000 \ + --lda 4000 --ldb 4000 --ldc 4000 \ + --transposeA N --transposeB T +``` + +**Pass:** ≥ 28000 GFLOPS ++++ +**Fail:** otherwise +::: + +:::{card} Command: `mpiexec -n 4 wrapper.sh` +[BabelStream](../common/system-validation.md#babelstream) +^^^ + +| Kernel | Threshold (MB/s) | +|--------|-----------------| +| Copy | ≥ 1,230,000 | +| Mul | ≥ 1,225,000 | +| Add | ≥ 1,115,000 | +| Triad | ≥ 1,115,000 | +| Dot | ≥ 1,170,000 | + ++++ +**Fail:** otherwise +::: diff --git a/docs/index.md b/docs/index.md index 1915708..429db8a 100644 --- a/docs/index.md +++ b/docs/index.md @@ -138,6 +138,7 @@ Start by selecting the page for the specific GPU accelerator you are validating. - **[AMD Instinct MI350X](gpus/mi350x.md)** - **[AMD Instinct MI325X](gpus/mi325x.md)** - **[AMD Instinct MI300X](gpus/mi300x.md)** +- **[AMD Instinct MI210](gpus/mi210.md)** Follow the GPU page end‑to‑end; it will walk you through verifying system prerequisites, running health checks, executing validation suites and microbenchmarks, and applying acceptance criteria thresholds. diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index 3eb17cd..3f9dc0a 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -7,7 +7,8 @@ subtrees: - file: gpus/mi350x.md - file: gpus/mi325x.md - file: gpus/mi300x.md - + - file: gpus/mi210.md + - caption: System Configuration entries: - file: common/prerequisites.md From 3673b63064f8b42ccf910ad2afd663c0fd8e0d3f Mon Sep 17 00:00:00 2001 From: Jan Stephan Date: Fri, 12 Jun 2026 21:24:22 +0200 Subject: [PATCH 2/5] feature: add MI300A validation guide * Add MI300A validation guide Signed-off-by: Jan Stephan --- .wordlist.txt | 4 + docs/gpus/mi300a.md | 172 ++++++++++++++++++++++++++++++++++++++++ docs/index.md | 1 + docs/sphinx/_toc.yml.in | 1 + 4 files changed, 178 insertions(+) create mode 100644 docs/gpus/mi300a.md diff --git a/.wordlist.txt b/.wordlist.txt index 0356d26..9127471 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -17,6 +17,8 @@ ARI APBDIS api APIC +APU +APUs args arp aspm @@ -137,6 +139,7 @@ GFLOPs gfortran gfx GFX +GiB GMI GPUP GPUs @@ -160,6 +163,7 @@ HPE HPL HSA hugepage +hugepages ib ibdiagnet ibstat diff --git a/docs/gpus/mi300a.md b/docs/gpus/mi300a.md new file mode 100644 index 0000000..bf24b10 --- /dev/null +++ b/docs/gpus/mi300a.md @@ -0,0 +1,172 @@ +--- +myst: + html_meta: + "description": "AMD Instinct MI300A acceptance criteria — prerequisites, health checks, system validation, and performance benchmarks for CDNA 3 APU platforms." + "keywords": "MI300A, AMD Instinct, APU, CDNA 3, ROCm, system acceptance, validation, benchmarks, HBM3" +--- + +# AMD Instinct MI300A + +The AMD Instinct™ MI300A is a data-center Accelerated Processing Unit (APU) that integrates AMD "Zen 4" CPU cores and CDNA 3 GPU compute dies on a single package with unified HBM3 memory. This document provides MI300A-specific prerequisites, health checks, validation steps, and performance acceptance criteria. + +## Overview + +The MI300A is built on the CDNA 3 architecture (gfx942) and combines CPU and GPU compute dies sharing a single coherent pool of 128 GB HBM3 per APU. Unlike discrete OAM accelerators, MI300A platforms are vendor-defined; a typical qualified configuration hosts 4 MI300A APUs per node. + +- **[MI300A Product Page](https://www.amd.com/en/products/accelerators/instinct/mi300/mi300a.html)** +- **[MI300A Data Sheet](https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/data-sheets/amd-instinct-mi300a-data-sheet.pdf)** +- **[AMD Instinct MI300 Series microarchitecture](https://instinct.docs.amd.com/latest/gpu-arch/mi300.html)** + +## System requirements + +### Operating system support + +For the most up-to-date information on supported operating systems and distributions, refer to the official ROCm documentation: + +[ROCm System Requirements - Supported Distributions](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-distributions) + +```{note} +[ROCm docs](https://rocm.docs.amd.com) is the single source of truth for supported versions, distribution compatibility, and required dependencies for the ROCm toolkit. +``` + +For BIOS, IOMMU, transparent hugepages, NUMA, and OS-level tuning that applies to all AMD Instinct hosts, see [BIOS settings](../common/bios-settings.md), [OS tuning](../common/os-tuning.md), and [Kernel parameters](../common/kernel-parameters.md). MI300A requires a Linux kernel that supports "Zen 4" (≥ 5.18 recommended). + +### GPU identification + +All MI300A APUs (PCI vendor:device `1002:74a0`) should appear in `lspci` output: + +```bash +sudo lspci -d 1002:74a0 +``` + +Expected output example: + +```bash +0000:01:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300A] +0001:01:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300A] +0002:01:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300A] +0003:01:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300A] +``` + +## Acceptance criteria + +The MI300A system acceptance process validates that the platform is correctly configured, stable, and performing to expectations. Follow the sequence: Prerequisites → Basic Health Checks → System Validation → Performance Benchmarks. + +### System acceptance process + +1. **[Prerequisites validation](#prerequisites-validation)** - Ensure all system requirements and dependencies are met +2. **[Basic health checks](#basic-health-checks)** - Verify hardware detection and basic system health +3. **[System validation](#system-validation)** - Conduct comprehensive stress testing and qualification +4. **[Performance benchmarks](#performance-benchmarks)** - Validate compute, memory, and interconnect performance + +The system is accepted when all criteria below are successfully validated. + +### Prerequisites validation + +Ensure all system requirements are met before proceeding with validation. See the [Prerequisites documentation](../common/prerequisites.md) and [System setup](../common/system-setup.md) for more details. + +- ✅ Supported operating system version installed with kernel ≥ 5.18 (Zen 4 support) +- ✅ Compatible ROCm version installed (verify: `cat /opt/rocm/.info/version`); see the [ROCm System Requirements](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html) for the current supported version matrix +- ✅ BIOS configured per [BIOS settings](../common/bios-settings.md), with MI300A-specific values per platform vendor (IOMMU off, memory interleaving, NPS) +- ✅ Required kernel parameters present: `pci=realloc=off transparent_hugepage=always numa_balancing=disable` +- ✅ Sysctl tunings applied: `vm.compaction_proactiveness=20`, `vm.max_map_count` increased per ROCm guide +- ✅ Environment variables (where applicable): + - `HSA_OVERRIDE_CPU_AFFINITY_DEBUG=0` + - `GPU_MAX_ALLOC_PERCENT` and `GPU_SINGLE_ALLOC_PERCENT` tuned per workload +- ✅ Minimum 4 × 128 GB = 512 GB unified HBM3 visible to the OS usable host-visible memory (note: MI300A's HBM is unified with CPU) +- ✅ Latest applicable firmware applied consistently across nodes +- ✅ ROCm Validation Suite (RVS) installed + +### Basic health checks + +These checks ensure fundamental system health and proper APU detection. For detailed procedures, see [Health Checks](../common/health-checks.md). + +| Test | Command | Pass/Fail criteria | +|------|---------|-------------------| +| [Check OS distribution](../common/health-checks.md#check-os-distribution) | `cat /etc/os-release` | **Pass**: OS version listed in compatibility matrix
**Fail**: Otherwise | +| [Check kernel boot arguments](../common/health-checks.md#check-kernel-boot-arguments) | `cat /proc/cmdline` | **Pass**: Contains `pci=realloc=off transparent_hugepage=always numa_balancing=disable`
**Fail**: Missing any required param | +| [Check for driver errors](../common/health-checks.md#check-for-driver-errors) | `sudo dmesg -T \| grep amdgpu \| grep -i error` | **Pass**: Null
**Fail**: Errors reported | +| [Check available memory](../common/health-checks.md#check-for-available-system-memory) | `lsmem \| grep "Total online memory"` | **Pass**: ≥ 4 × 128 GB = 512 GB unified HBM3 visible to the OS
**Fail**: Less than 4 × 128 GB = 512 GB unified HBM3 visible to the OS | +| [Check GPU presence](../common/health-checks.md#check-gpu-presence) | `sudo lspci -d 1002:74a0` | **Pass**: 4 MI300A APUs found
**Fail**: Otherwise | +| [Check GPU link speed and width](../common/health-checks.md#check-gpu-pcie-bus-link-speed-and-width) | `sudo lspci -d 1002:74a0 -vvv \| grep -e DevSta -e LnkSta` | **Pass**: Speed PCIe Gen 5 (32 GT/s), width `x16`, no `FatalErr+`
**Fail**: Otherwise | +| [Monitor utilization metrics](../common/health-checks.md#monitor-utilization-metrics) | `amd-smi monitor -putm` | **Pass**: Idle metrics as specified
**Fail**: Otherwise | +| [Check system kernel logs for errors](../common/health-checks.md#check-system-kernel-logs) | `sudo dmesg -T \| grep -i 'error\|warn\|fail\|exception'` | **Pass**: Null
**Fail**: Otherwise | + +### System validation + +Comprehensive validation ensures system stability under load. For detailed procedures, see [System Validation](../common/system-validation.md). + +| Test | Command | Pass/Fail criteria | +|------|---------|-------------------| +| [Compute/GPU properties](../common/system-validation.md#gpu-properties) | `rvs -c ${RVS_CONF}/gpup_single.conf` | **Pass**: All APUs listed with no errors
**Fail**: Missing APUs or errors | +| [GPU stress test (GST)](../common/system-validation.md#gpu-stress-test) | `rvs -c ${RVS_CONF}/MI300A/gst_single.conf` | **Pass**: `met: TRUE` in logs
**Fail**: Target GFLOP/s not met | +| [Input energy delay product (IET)](../common/system-validation.md#input-energy-delay-product) | `rvs -c ${RVS_CONF}/MI300A/iet_single.conf` | **Pass**: `met: TRUE` for all actions
**Fail**: Otherwise | +| [Memory test (MEM)](../common/system-validation.md#mem) | `rvs -c ${RVS_CONF}/mem.conf -l mem.txt` | **Pass**: All tests passed; bandwidth ≥ 2.0 TB/s per APU
**Fail**: Any test failed or low bandwidth | +| [PCIe bandwidth benchmark (PEBB)](../common/system-validation.md#pcie-bandwidth-benchmark) | `rvs -c ${RVS_CONF}/MI300A/pebb_single.conf` | **Pass**: All distances and bandwidths displayed
**Fail**: Missing data | +| [PCIe qualification tool (PEQT)](../common/system-validation.md#pcie-qualification-tool) | `rvs -c ${RVS_CONF}/peqt_single.conf` | **Pass**: All actions true
**Fail**: Otherwise | +| [P2P benchmark and qualification tool (PBQT)](../common/system-validation.md#p2p-benchmark-and-qualification-tool) | `rvs -c ${RVS_CONF}/pbqt_single.conf` | **Pass**: `peers:true` lines and non-zero throughput across all xGMI peers
**Fail**: Otherwise | + +### Performance benchmarks + +Performance validation ensures the system meets MI300A specifications. For detailed procedures, see [Performance Benchmarking](../common/system-validation.md#performance-benchmarking). + +:::{card} Command: `TransferBench a2a` +[TransferBench all-to-all](../common/system-validation.md#transferbench) +^^^ +**Pass:** ≥ 700 GB/s aggregate ++++ +**Fail:** otherwise +::: + +:::{card} Command: `TransferBench p2p` +[TransferBench peer-to-peer](../common/system-validation.md#transferbench) +^^^ + +| Test | Pass criteria | +|------|--------------| +| UniDir | ≥ 80 GB/s | +| BiDir | ≥ 155 GB/s | + ++++ +**Fail:** otherwise +::: + +:::{card} Command: `build/all_reduce_perf -b 8 -e 8G -f 2 -g 4` +[RCCL Allreduce](../common/system-validation.md#rccl-allreduce) +^^^ +**Pass:** ≥ 230 GB/s busbw (peak, at 8 GiB message size) ++++ +**Fail:** otherwise +::: + +:::{card} Command: `rocblas-bench` (see code block below) +[rocBLAS FP32](../common/system-validation.md#rocblas-gemm-benchmarks) +^^^ + +```bash +rocblas-bench -f gemm \ + -r s -m 4000 -n 4000 -k 4000 \ + --lda 4000 --ldb 4000 --ldc 4000 \ + --transposeA N --transposeB T +``` + +**Pass:** ≥ 60 TFLOPS per APU ++++ +**Fail:** otherwise +::: + +:::{card} Command: `mpiexec -n 4 wrapper.sh` +[BabelStream](../common/system-validation.md#babelstream) +^^^ + +| Kernel | Threshold (MB/s) | +|--------|-----------------| +| Copy | ≥ 2,900,000 | +| Mul | ≥ 3,000,000 | +| Add | ≥ 3,250,000 | +| Triad | ≥ 3,250,000 | +| Dot | ≥ 2,200,000 | + ++++ +**Fail:** otherwise +::: diff --git a/docs/index.md b/docs/index.md index 429db8a..6fa685f 100644 --- a/docs/index.md +++ b/docs/index.md @@ -138,6 +138,7 @@ Start by selecting the page for the specific GPU accelerator you are validating. - **[AMD Instinct MI350X](gpus/mi350x.md)** - **[AMD Instinct MI325X](gpus/mi325x.md)** - **[AMD Instinct MI300X](gpus/mi300x.md)** +- **[AMD Instinct MI300A](gpus/mi300a.md)** - **[AMD Instinct MI210](gpus/mi210.md)** Follow the GPU page end‑to‑end; it will walk you through verifying system prerequisites, running health checks, executing validation suites and microbenchmarks, and applying acceptance criteria thresholds. diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index 3f9dc0a..e52c489 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -7,6 +7,7 @@ subtrees: - file: gpus/mi350x.md - file: gpus/mi325x.md - file: gpus/mi300x.md + - file: gpus/mi300a.md - file: gpus/mi210.md - caption: System Configuration From 51668ce3904abe95ddb91a3b3a14b52257be2bd6 Mon Sep 17 00:00:00 2001 From: Jan Stephan Date: Fri, 12 Jun 2026 21:29:01 +0200 Subject: [PATCH 3/5] feature: add MI100 validation guide * Add MI100 validation guide Signed-off-by: Jan Stephan Co-authored-by: Michael Benavidez --- docs/gpus/mi100.md | 174 ++++++++++++++++++++++++++++++++++++++++ docs/index.md | 1 + docs/sphinx/_toc.yml.in | 4 +- 3 files changed, 178 insertions(+), 1 deletion(-) create mode 100644 docs/gpus/mi100.md diff --git a/docs/gpus/mi100.md b/docs/gpus/mi100.md new file mode 100644 index 0000000..d6b64ee --- /dev/null +++ b/docs/gpus/mi100.md @@ -0,0 +1,174 @@ +--- +myst: + html_meta: + "description": "AMD Instinct MI100 acceptance criteria — prerequisites, health checks, system validation, and performance benchmarks for CDNA PCIe GPU platforms." + "keywords": "MI100, AMD Instinct, CDNA, ROCm, PCIe, Infinity Fabric, system acceptance, validation, benchmarks, HBM2" +--- + +# AMD Instinct MI100 + +The AMD Instinct™ MI100 is a data-center compute PCIe-form-factor GPU. This document provides MI100-specific prerequisites, health checks, validation steps, and performance acceptance criteria. + +## Overview + +The AMD Instinct MI100 introduces the first-generation CDNA architecture in a standard full-height, full-length, dual-slot PCIe® add-in card aimed at HPC and accelerated computing workloads. Each MI100 provides 120 compute units with Matrix Core technology, 32 GB of HBM2 memory at up to 1.2 TB/s, and AMD Infinity Fabric™ link support for direct GPU-to-GPU connectivity in 2- and 4-GPU hive configurations. The card is passively cooled with a 300 W TDP and supports PCIe® Gen4 host connectivity. + +The MI100 is built on the CDNA architecture (gfx908) with 120 compute units and 32 GB of HBM2 memory per GPU. The MI100 Infinity Fabric™ topology tops out at 4 GPUs per hive, so the validation reference configuration for this document is a single 4-GPU MI100 hive with Infinity Fabric™ bridges providing direct GPU-to-GPU connectivity across all peers. Larger deployments (for example, dual-socket servers with two 4-GPU hives for 8 MI100s total) are common; in those systems, cross-hive traffic traverses the host PCIe fabric and the per-hive criteria below apply to each hive independently. + +- **[MI100 Product Page](https://www.amd.com/en/products/accelerators/instinct/mi100.html)** +- **[MI100 Product Brief](https://www.amd.com/content/dam/amd/en/documents/instinct-business-docs/product-briefs/instinct-mi100-brochure.pdf)** +- **[MI100 Microarchitecture](https://instinct.docs.amd.com/latest/gpu-arch/mi100.html)** + +## System requirements + +### Operating system support + +For the most up-to-date information on supported operating systems and distributions, refer to the official ROCm documentation: + +[ROCm System Requirements - Supported Distributions](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-distributions) + +```{note} +[ROCm docs](https://rocm.docs.amd.com) is the single source of truth for supported versions, distribution compatibility, and required dependencies for the ROCm toolkit. +``` + +For BIOS, NUMA, and OS-level tuning that applies to all AMD Instinct hosts, see [BIOS settings](../common/bios-settings.md) and [OS tuning](../common/os-tuning.md). + +### GPU identification + +All MI100 GPUs (PCI vendor:device `1002:738c`) should appear in `lspci` output: + +```bash +sudo lspci -d 1002:738c +``` + +Expected output example (4-GPU MI100 hive): + +```bash +1d:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Arcturus GL-XL [Instinct MI100] (rev 01) +20:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Arcturus GL-XL [Instinct MI100] (rev 01) +23:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Arcturus GL-XL [Instinct MI100] (rev 01) +26:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Arcturus GL-XL [Instinct MI100] (rev 01) +``` + +## Acceptance criteria + +The MI100 system acceptance process validates that the platform is correctly configured, stable, and performing to expectations. Follow the sequence: Prerequisites → Basic Health Checks → System Validation → Performance Benchmarks. + +### System acceptance process + +1. **[Prerequisites validation](#prerequisites-validation)** - Ensure all system requirements and dependencies are met +2. **[Basic health checks](#basic-health-checks)** - Verify hardware detection and basic system health +3. **[System validation](#system-validation)** - Conduct comprehensive stress testing and qualification +4. **[Performance benchmarks](#performance-benchmarks)** - Validate compute, memory, and interconnect performance + +The system is accepted when all criteria below are successfully validated. + +### Prerequisites validation + +Ensure all system requirements are met before proceeding with validation. See the [Prerequisites documentation](../common/prerequisites.md) and [System setup](../common/system-setup.md) for more details. + +- ✅ Supported operating system version installed +- ✅ Compatible ROCm version installed +- ✅ BIOS configured per [BIOS settings](../common/bios-settings.md), with MI100-specific values per platform vendor +- ✅ Required kernel parameters present: `pci=realloc=off`, `pci=bfsort`, `iommu=pt`, and `amd_iommu=on` (or `intel_iommu=on` on Intel hosts) — see [Kernel Parameters](../common/kernel-parameters.md) +- ✅ Minimum 256G system memory available +- ✅ Latest applicable firmware applied consistently across nodes +- ✅ ROCm Validation Suite (RVS) installed + +### Basic health checks + +These checks ensure fundamental system health and proper GPU detection. For detailed procedures, see [Health Checks](../common/health-checks.md). + +| Test | Command | Pass/Fail criteria | +|------|---------|-------------------| +| [Check OS distribution](../common/health-checks.md#check-os-distribution) | `cat /etc/os-release` | **Pass**: OS version listed in compatibility matrix
**Fail**: Otherwise | +| [Check kernel boot arguments](../common/health-checks.md#check-kernel-boot-arguments) | `cat /proc/cmdline` | **Pass**: Contains `pci=realloc=off`, `pci=bfsort`, `iommu=pt`, and `amd_iommu=on` or `intel_iommu=on`
**Fail**: Otherwise | +| [Check for driver errors](../common/health-checks.md#check-for-driver-errors) | `sudo dmesg -T \| grep amdgpu \| grep -i error` | **Pass**: Null
**Fail**: Errors reported | +| [Check available memory](../common/health-checks.md#check-for-available-system-memory) | `lsmem \| grep "Total online memory"` | **Pass**: ≥ 256G
**Fail**: Less than 256G | +| [Check GPU presence](../common/health-checks.md#check-gpu-presence) | `sudo lspci -d 1002:738c` | **Pass**: 4 MI100 GPUs found (per hive)
**Fail**: Otherwise | +| [Check GPU link speed and width](../common/health-checks.md#check-gpu-pcie-bus-link-speed-and-width) | `sudo lspci -d 1002:738c -vvv \| grep -e DevSta -e LnkSta` | **Pass**: Speed 16GT/s, width `x16`, no `FatalErr+`
**Fail**: Otherwise | +| [Monitor utilization metrics](../common/health-checks.md#monitor-utilization-metrics) | `amd-smi monitor -putm` | **Pass**: Idle metrics as specified
**Fail**: Otherwise | +| [Check system kernel logs for errors](../common/health-checks.md#check-system-kernel-logs) | `sudo dmesg -T \| grep -i 'error\|warn\|fail\|exception'` | **Pass**: Null
**Fail**: Otherwise | + +### System validation + +Comprehensive validation ensures system stability under load. For detailed procedures, see [System Validation](../common/system-validation.md). + +| Test | Command | Pass/Fail criteria | +|------|---------|-------------------| +| [Compute/GPU properties](../common/system-validation.md#gpu-properties) | `rvs -c ${RVS_CONF}/gpup_single.conf` | **Pass**: All GPUs listed with no errors
**Fail**: Missing GPUs or errors | +| [GPU stress test (GST)](../common/system-validation.md#gpu-stress-test) | `rvs -c ${RVS_CONF}/MI100/gst_single.conf` | **Pass**: `met: TRUE` in logs
**Fail**: Target GFLOP/s not met | +| [Input energy delay product (IET)](../common/system-validation.md#input-energy-delay-product) | `rvs -c ${RVS_CONF}/MI100/iet_single.conf` | **Pass**: `met: TRUE` for all actions
**Fail**: Otherwise | +| [Memory test (MEM)](../common/system-validation.md#mem) | `rvs -c ${RVS_CONF}/mem.conf -l mem.txt` | **Pass**: All tests passed; bandwidth ≥ 800 GB/s per GPU
**Fail**: Any test failed or low bandwidth | +| [PCIe bandwidth benchmark (PEBB)](../common/system-validation.md#pcie-bandwidth-benchmark) | `rvs -c ${RVS_CONF}/MI100/pebb_single.conf` | **Pass**: All distances and bandwidths displayed
**Fail**: Missing data | +| [PCIe qualification tool (PEQT)](../common/system-validation.md#pcie-qualification-tool) | `rvs -c ${RVS_CONF}/peqt_single.conf` | **Pass**: All actions true
**Fail**: Otherwise | +| [P2P benchmark and qualification tool (PBQT)](../common/system-validation.md#p2p-benchmark-and-qualification-tool) | `rvs -c ${RVS_CONF}/pbqt_single.conf` | **Pass**: `peers:true` lines and non-zero throughput
**Fail**: Otherwise | + +```{note} +The reference configuration for this document is a single 4-GPU MI100 hive with AMD Infinity Fabric™ bridges installed, so intra-hive PBQT and TransferBench numbers reflect XGMI throughput. On systems without bridges, P2P traffic traverses the host PCIe fabric and these thresholds will not be met. +``` + +### Performance benchmarks + +Performance validation ensures the system meets MI100 specifications. For detailed procedures, see [Performance Benchmarking](../common/system-validation.md#performance-benchmarking). + +:::{card} Command: `TransferBench a2a` +[TransferBench all-to-all](../common/system-validation.md#transferbench) +^^^ +**Pass:** ≥ 270 GB/s aggregate ++++ +**Fail:** otherwise +::: + +:::{card} Command: `TransferBench p2p` +[TransferBench peer-to-peer](../common/system-validation.md#transferbench) +^^^ + +| Test | Pass criteria | +|------|--------------| +| UniDir | ≥ 30 GB/s | +| BiDir | ≥ 57 GB/s | + ++++ +**Fail:** otherwise +::: + +:::{card} Command: `build/all_reduce_perf -b 8 -e 8G -f 2 -g 4` +[RCCL Allreduce](../common/system-validation.md#rccl-allreduce) +^^^ +**Pass:** ≥ 72 GB/s busbw (peak, at 8 GiB message size) ++++ +**Fail:** otherwise +::: + +:::{card} Command: `rocblas-bench` (see code block below) +[rocBLAS FP32](../common/system-validation.md#rocblas-gemm-benchmarks) +^^^ + +```bash +rocblas-bench -f gemm \ + -r s -m 4000 -n 4000 -k 4000 \ + --lda 4000 --ldb 4000 --ldc 4000 \ + --transposeA N --transposeB T +``` + +**Pass:** ≥ 28 TFLOPS per GPU ++++ +**Fail:** otherwise +::: + +:::{card} Command: `mpiexec -n 4 wrapper.sh` +[BabelStream](../common/system-validation.md#babelstream) +^^^ + +| Kernel | Threshold (MB/s) | +|--------|-----------------| +| Copy | ≥ 940,000 | +| Mul | ≥ 940,000 | +| Add | ≥ 910,000 | +| Triad | ≥ 910,000 | +| Dot | ≥ 950,000 | + ++++ +**Fail:** otherwise +::: diff --git a/docs/index.md b/docs/index.md index 6fa685f..15942c3 100644 --- a/docs/index.md +++ b/docs/index.md @@ -140,6 +140,7 @@ Start by selecting the page for the specific GPU accelerator you are validating. - **[AMD Instinct MI300X](gpus/mi300x.md)** - **[AMD Instinct MI300A](gpus/mi300a.md)** - **[AMD Instinct MI210](gpus/mi210.md)** +- **[AMD Instinct MI100](gpus/mi100.md)** Follow the GPU page end‑to‑end; it will walk you through verifying system prerequisites, running health checks, executing validation suites and microbenchmarks, and applying acceptance criteria thresholds. diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index e52c489..78e915d 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -6,9 +6,11 @@ subtrees: - file: gpus/mi355x.md - file: gpus/mi350x.md - file: gpus/mi325x.md - - file: gpus/mi300x.md + - file: gpus/mi300x.md - file: gpus/mi300a.md - file: gpus/mi210.md + - file: gpus/mi100.md + - caption: System Configuration entries: From a0b1a80cce0c2032b38fe257109ee023d2fc74c6 Mon Sep 17 00:00:00 2001 From: Michael Benavidez Date: Fri, 12 Jun 2026 15:07:49 -0500 Subject: [PATCH 4/5] feature: add MI250 validation guide * Add MI250 validation guide Signed-off-by: Jan Stephan Co-authored-by: Michael Benavidez --- .wordlist.txt | 1 + docs/gpus/mi250.md | 179 ++++++++++++++++++++++++++++++++++++++++ docs/index.md | 1 + docs/sphinx/_toc.yml.in | 4 +- 4 files changed, 183 insertions(+), 2 deletions(-) create mode 100644 docs/gpus/mi250.md diff --git a/.wordlist.txt b/.wordlist.txt index 9127471..3587820 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -213,6 +213,7 @@ lsmem lsof lspci LTS +LUMI lvl makefile maxBytes diff --git a/docs/gpus/mi250.md b/docs/gpus/mi250.md new file mode 100644 index 0000000..a1bce0d --- /dev/null +++ b/docs/gpus/mi250.md @@ -0,0 +1,179 @@ +--- +myst: + html_meta: + "description": "AMD Instinct MI250 acceptance criteria — prerequisites, health checks, system validation, and performance benchmarks for CDNA 2 OAM GPU platforms." + "keywords": "MI250, AMD Instinct, CDNA 2, OAM, ROCm, xGMI, Infinity Fabric, system acceptance, validation, benchmarks, HBM2e" +--- + +# AMD Instinct MI250 / MI250X + +The AMD Instinct™ MI250 is a data-center OAM-form-factor GPU. This document provides MI250-specific prerequisites, health checks, validation steps, and performance acceptance criteria. It also applies to the AMD Instinct™ MI250X, which shares the same CDNA 2 (gfx90a) OAM platform and acceptance criteria; MI250X-specific differences are noted inline. + +## Overview + +The AMD Instinct MI250 brings the second-generation CDNA architecture to an OCP Accelerator Module (OAM) form factor purpose-built for HPC and large-scale AI training. Each MI250 packages two Graphics Compute Dies (GCDs) under a single OAM, each GCD presenting 110 CUs with Matrix Core technology and 64 GB of HBM2e memory at up to 1.6 TB/s, for a combined 128 GB and 3.2 TB/s per OAM. The two GCDs on an OAM are linked by a high-bandwidth on-package AMD Infinity Fabric™ interconnect, and each OAM exposes additional xGMI ports for direct GPU-to-GPU connectivity across a 4-OAM all-to-all mesh. A typical qualified configuration hosts 4 MI250 OAMs (8 GCDs total) per node. + +The MI250 is built on the CDNA 2 architecture (gfx90a) in an OCP Accelerator Module (OAM) form factor. Each MI250 OAM hosts two Graphics Compute Dies (GCDs), each enumerated as an independent GPU by ROCm tools, with 128 GB of HBM2e memory per OAM (64 GB per GCD). GPUs are connected to each other and to the host CPUs through AMD Infinity Fabric™ (xGMI). + +The MI250X is the higher-performance variant of the same CDNA 2 (gfx90a) OAM platform and is validated using the criteria in this document. It powers exascale-class supercomputers such as Frontier and LUMI. MI250X reference deployments commonly use an 8-OAM (16-GCD) node topology; scale the per-node GCD counts in the commands below accordingly (for example, `-g 16` for RCCL and `mpiexec -n 16` for BabelStream on an 8-OAM node). MI250X also shares the MI250 PCI vendor:device ID (`1002:740c`). + +- **[MI250 Product Page](https://www.amd.com/en/products/accelerators/instinct/mi200/mi250.html)** +- **[MI250X Product Page](https://www.amd.com/en/products/accelerators/instinct/mi200/mi250x.html)** +- **[MI200 Series Microarchitecture](https://instinct.docs.amd.com/latest/gpu-arch/mi250.html)** +- **[MI200 Series Data Sheet](https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instinct-mi200-datasheet.pdf)** + +## System requirements + +### Operating system support + +For the most up-to-date information on supported operating systems and distributions, refer to the official ROCm documentation: + +[ROCm System Requirements - Supported Distributions](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-distributions) + +```{note} +[ROCm docs](https://rocm.docs.amd.com) is the single source of truth for supported versions, distribution compatibility, and required dependencies for the ROCm toolkit. +``` + +For BIOS, NUMA, and OS-level tuning that applies to all AMD Instinct hosts, see [BIOS settings](../common/bios-settings.md) and [OS tuning](../common/os-tuning.md). + +### GPU identification + +All MI250 GCDs (PCI vendor:device `1002:740c`) should appear in `lspci` output. On a fully populated 4-OAM MI250 platform you should see 8 GCD entries (2 per OAM): + +```bash +sudo lspci -d 1002:740c +``` + +Expected output example: + +```bash +0000:11:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Aldebaran (rev 01) +0000:14:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Aldebaran (rev 01) +0000:32:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Aldebaran (rev 01) +0000:35:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Aldebaran (rev 01) +0000:8e:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Aldebaran (rev 01) +0000:93:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Aldebaran (rev 01) +0000:ae:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Aldebaran (rev 01) +0000:b3:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Aldebaran (rev 01) +``` + +The 8 GCDs are paired by OAM (e.g. `11`+`14` are the two GCDs on one OAM, `32`+`35` on the next, and so on). Same-OAM GCD pairs are connected by a high-bandwidth on-package link, while cross-OAM connectivity uses external xGMI ports in a 4-OAM all-to-all mesh. + +## Acceptance criteria + +The MI250 system acceptance process validates that the platform is correctly configured, stable, and performing to expectations. Follow the sequence: Prerequisites → Basic Health Checks → System Validation → Performance Benchmarks. + +### System acceptance process + +1. **[Prerequisites validation](#prerequisites-validation)** - Ensure all system requirements and dependencies are met +2. **[Basic health checks](#basic-health-checks)** - Verify hardware detection and basic system health +3. **[System validation](#system-validation)** - Conduct comprehensive stress testing and qualification +4. **[Performance benchmarks](#performance-benchmarks)** - Validate compute, memory, and interconnect performance + +The system is accepted when all criteria below are successfully validated. + +### Prerequisites validation + +Ensure all system requirements are met before proceeding with validation. See the [Prerequisites documentation](../common/prerequisites.md) and [System setup](../common/system-setup.md) for more details. + +- ✅ Supported operating system version installed +- ✅ Compatible ROCm version installed (verify: `cat /opt/rocm/.info/version`); see the [ROCm System Requirements](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html) for the current supported version matrix +- ✅ BIOS configured per [BIOS settings](../common/bios-settings.md), with MI250-specific values per platform vendor +- ✅ Required kernel parameters present: `pci=realloc=off iommu=pt` +- ✅ Minimum 1T system memory available +- ✅ Latest applicable firmware applied consistently across nodes +- ✅ ROCm Validation Suite (RVS) installed + +### Basic health checks + +These checks ensure fundamental system health and proper GPU detection. For detailed procedures, see [Health Checks](../common/health-checks.md). + +| Test | Command | Pass/Fail criteria | +|------|---------|-------------------| +| [Check OS distribution](../common/health-checks.md#check-os-distribution) | `cat /etc/os-release` | **Pass**: OS version listed in compatibility matrix
**Fail**: Otherwise | +| [Check kernel boot arguments](../common/health-checks.md#check-kernel-boot-arguments) | `cat /proc/cmdline` | **Pass**: Contains `pci=realloc=off iommu=pt`
**Fail**: Otherwise | +| [Check for driver errors](../common/health-checks.md#check-for-driver-errors) | `sudo dmesg -T \| grep amdgpu \| grep -i error` | **Pass**: Null
**Fail**: Errors reported | +| [Check available memory](../common/health-checks.md#check-for-available-system-memory) | `lsmem \| grep "Total online memory"` | **Pass**: ≥ 1T
**Fail**: Less than 1T | +| [Check GPU presence](../common/health-checks.md#check-gpu-presence) | `sudo lspci -d 1002:740c` | **Pass**: 8 MI250 GCDs found
**Fail**: Otherwise | +| [Check GPU link speed and width](../common/health-checks.md#check-gpu-pcie-bus-link-speed-and-width) | `sudo lspci -d 1002:740c -vvv \| grep -e DevSta -e LnkSta` | **Pass**: Speed PCIe Gen 4 (16 GT/s), width `x16`, no `FatalErr+`
**Fail**: Otherwise | +| [Monitor utilization metrics](../common/health-checks.md#monitor-utilization-metrics) | `amd-smi monitor -putm` | **Pass**: Idle metrics as specified
**Fail**: Otherwise | +| [Check system kernel logs for errors](../common/health-checks.md#check-system-kernel-logs) | `sudo dmesg -T \| grep -i 'error\|warn\|fail\|exception'` | **Pass**: Null
**Fail**: Otherwise | + +### System validation + +Comprehensive validation ensures system stability under load. For detailed procedures, see [System Validation](../common/system-validation.md). + +| Test | Command | Pass/Fail criteria | +|------|---------|-------------------| +| [Compute/GPU properties](../common/system-validation.md#gpu-properties) | `rvs -c ${RVS_CONF}/gpup_single.conf` | **Pass**: All GCDs listed with no errors
**Fail**: Missing GCDs or errors | +| [GPU stress test (GST)](../common/system-validation.md#gpu-stress-test) | `rvs -c ${RVS_CONF}/MI250/gst_single.conf` | **Pass**: `met: TRUE` in logs
**Fail**: Target GFLOP/s not met | +| [Input energy delay product (IET)](../common/system-validation.md#input-energy-delay-product) | `rvs -c ${RVS_CONF}/MI250/iet_single.conf` | **Pass**: `met: TRUE` for all actions
**Fail**: Otherwise | +| [Memory test (MEM)](../common/system-validation.md#mem) | `rvs -c ${RVS_CONF}/mem.conf -l mem.txt` | **Pass**: All tests passed; bandwidth ≥ 1050 GB/s per GCD
**Fail**: Any test failed or low bandwidth | +| [PCIe bandwidth benchmark (PEBB)](../common/system-validation.md#pcie-bandwidth-benchmark) | `rvs -c ${RVS_CONF}/MI250/pebb_single.conf` | **Pass**: All distances and bandwidths displayed
**Fail**: Missing data | +| [PCIe qualification tool (PEQT)](../common/system-validation.md#pcie-qualification-tool) | `rvs -c ${RVS_CONF}/peqt_single.conf` | **Pass**: All actions true
**Fail**: Otherwise | +| [P2P benchmark and qualification tool (PBQT)](../common/system-validation.md#p2p-benchmark-and-qualification-tool) | `rvs -c ${RVS_CONF}/pbqt_single.conf` | **Pass**: `peers:true` lines and non-zero throughput across all xGMI peers
**Fail**: Otherwise | + +### Performance benchmarks + +Performance validation ensures the system meets MI250 specifications. For detailed procedures, see [Performance Benchmarking](../common/system-validation.md#performance-benchmarking). + +:::{card} Command: `TransferBench a2a` +[TransferBench all-to-all](../common/system-validation.md#transferbench) +^^^ +**Pass:** ≥ 800 GB/s aggregate ++++ +**Fail:** otherwise +::: + +:::{card} Command: `TransferBench p2p` +[TransferBench peer-to-peer](../common/system-validation.md#transferbench) +^^^ + +| Test | Pass criteria | +|------|--------------| +| UniDir | ≥ 30 GB/s | +| BiDir | ≥ 55 GB/s | + ++++ +**Fail:** otherwise +::: + +:::{card} Command: `build/all_reduce_perf -b 8 -e 8G -f 2 -g 8` +[RCCL Allreduce](../common/system-validation.md#rccl-allreduce) +^^^ +**Pass:** ≥ 125 GB/s busbw (peak, at 8 GiB message size) ++++ +**Fail:** otherwise +::: + +:::{card} Command: `rocblas-bench` (see code block below) +[rocBLAS FP32](../common/system-validation.md#rocblas-gemm-benchmarks) +^^^ + +```bash +rocblas-bench -f gemm \ + -r s -m 4000 -n 4000 -k 4000 \ + --lda 4000 --ldb 4000 --ldc 4000 \ + --transposeA N --transposeB T +``` + +**Pass:** ≥ 30 TFLOPS per GCD ++++ +**Fail:** otherwise +::: + +:::{card} Command: `mpiexec -n 8 wrapper.sh` +[BabelStream](../common/system-validation.md#babelstream) +^^^ + +| Kernel | Threshold (MB/s) | +|--------|-----------------| +| Copy | ≥ 1,200,000 | +| Mul | ≥ 1,200,000 | +| Add | ≥ 1,100,000 | +| Triad | ≥ 1,100,000 | +| Dot | ≥ 1,200,000 | + ++++ +**Fail:** otherwise +::: diff --git a/docs/index.md b/docs/index.md index 15942c3..4ef6467 100644 --- a/docs/index.md +++ b/docs/index.md @@ -139,6 +139,7 @@ Start by selecting the page for the specific GPU accelerator you are validating. - **[AMD Instinct MI325X](gpus/mi325x.md)** - **[AMD Instinct MI300X](gpus/mi300x.md)** - **[AMD Instinct MI300A](gpus/mi300a.md)** +- **[AMD Instinct MI250 / MI250X](gpus/mi250.md)** - **[AMD Instinct MI210](gpus/mi210.md)** - **[AMD Instinct MI100](gpus/mi100.md)** diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index 78e915d..6dff208 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -6,12 +6,12 @@ subtrees: - file: gpus/mi355x.md - file: gpus/mi350x.md - file: gpus/mi325x.md - - file: gpus/mi300x.md + - file: gpus/mi300x.md - file: gpus/mi300a.md + - file: gpus/mi250.md - file: gpus/mi210.md - file: gpus/mi100.md - - caption: System Configuration entries: - file: common/prerequisites.md From d7de47651969f6cc7516e7c04c733255a17d57c6 Mon Sep 17 00:00:00 2001 From: Michael Benavidez Date: Fri, 12 Jun 2026 15:11:54 -0500 Subject: [PATCH 5/5] fix: make project value consistent with url --- docs/conf.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/conf.py b/docs/conf.py index 713f896..c2a05ef 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -13,7 +13,7 @@ # Disable external projects to avoid GitHub API issues external_projects_remote_repository = "" -external_projects_current_project = os.environ.get("SPHINX_PROJECT_SLUG", "system-acceptance-docs") +external_projects_current_project = os.environ.get("SPHINX_PROJECT_SLUG", "system-acceptance") version = "1.0.0" release = version