Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ ARI
APBDIS
api
APIC
APU
APUs
args
arp
aspm
Expand Down Expand Up @@ -137,6 +139,7 @@ GFLOPs
gfortran
gfx
GFX
GiB
GMI
GPUP
GPUs
Expand All @@ -160,6 +163,7 @@ HPE
HPL
HSA
hugepage
hugepages
ib
ibdiagnet
ibstat
Expand Down Expand Up @@ -209,6 +213,7 @@ lsmem
lsof
lspci
LTS
LUMI
lvl
makefile
maxBytes
Expand Down
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@

# Disable external projects to avoid GitHub API issues
external_projects_remote_repository = ""
external_projects_current_project = os.environ.get("SPHINX_PROJECT_SLUG", "system-acceptance-docs")
external_projects_current_project = os.environ.get("SPHINX_PROJECT_SLUG", "system-acceptance")

version = "1.0.0"
release = version
Expand Down
174 changes: 174 additions & 0 deletions docs/gpus/mi100.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
---
myst:
html_meta:
"description": "AMD Instinct MI100 acceptance criteria — prerequisites, health checks, system validation, and performance benchmarks for CDNA PCIe GPU platforms."
"keywords": "MI100, AMD Instinct, CDNA, ROCm, PCIe, Infinity Fabric, system acceptance, validation, benchmarks, HBM2"
---

# AMD Instinct MI100

The AMD Instinct™ MI100 is a data-center compute PCIe-form-factor GPU. This document provides MI100-specific prerequisites, health checks, validation steps, and performance acceptance criteria.

## Overview

The AMD Instinct MI100 introduces the first-generation CDNA architecture in a standard full-height, full-length, dual-slot PCIe® add-in card aimed at HPC and accelerated computing workloads. Each MI100 provides 120 compute units with Matrix Core technology, 32 GB of HBM2 memory at up to 1.2 TB/s, and AMD Infinity Fabric™ link support for direct GPU-to-GPU connectivity in 2- and 4-GPU hive configurations. The card is passively cooled with a 300 W TDP and supports PCIe® Gen4 host connectivity.

The MI100 is built on the CDNA architecture (gfx908) with 120 compute units and 32 GB of HBM2 memory per GPU. The MI100 Infinity Fabric™ topology tops out at 4 GPUs per hive, so the validation reference configuration for this document is a single 4-GPU MI100 hive with Infinity Fabric™ bridges providing direct GPU-to-GPU connectivity across all peers. Larger deployments (for example, dual-socket servers with two 4-GPU hives for 8 MI100s total) are common; in those systems, cross-hive traffic traverses the host PCIe fabric and the per-hive criteria below apply to each hive independently.

- **[MI100 Product Page](https://www.amd.com/en/products/accelerators/instinct/mi100.html)**
- **[MI100 Product Brief](https://www.amd.com/content/dam/amd/en/documents/instinct-business-docs/product-briefs/instinct-mi100-brochure.pdf)**
- **[MI100 Microarchitecture](https://instinct.docs.amd.com/latest/gpu-arch/mi100.html)**

## System requirements

### Operating system support

For the most up-to-date information on supported operating systems and distributions, refer to the official ROCm documentation:

[ROCm System Requirements - Supported Distributions](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-distributions)

```{note}
[ROCm docs](https://rocm.docs.amd.com) is the single source of truth for supported versions, distribution compatibility, and required dependencies for the ROCm toolkit.
```

For BIOS, NUMA, and OS-level tuning that applies to all AMD Instinct hosts, see [BIOS settings](../common/bios-settings.md) and [OS tuning](../common/os-tuning.md).

### GPU identification

All MI100 GPUs (PCI vendor:device `1002:738c`) should appear in `lspci` output:

```bash
sudo lspci -d 1002:738c
```

Expected output example (4-GPU MI100 hive):

```bash
1d:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Arcturus GL-XL [Instinct MI100] (rev 01)
20:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Arcturus GL-XL [Instinct MI100] (rev 01)
23:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Arcturus GL-XL [Instinct MI100] (rev 01)
26:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Arcturus GL-XL [Instinct MI100] (rev 01)
```

## Acceptance criteria

The MI100 system acceptance process validates that the platform is correctly configured, stable, and performing to expectations. Follow the sequence: Prerequisites → Basic Health Checks → System Validation → Performance Benchmarks.

### System acceptance process

1. **[Prerequisites validation](#prerequisites-validation)** - Ensure all system requirements and dependencies are met
2. **[Basic health checks](#basic-health-checks)** - Verify hardware detection and basic system health
3. **[System validation](#system-validation)** - Conduct comprehensive stress testing and qualification
4. **[Performance benchmarks](#performance-benchmarks)** - Validate compute, memory, and interconnect performance

The system is accepted when all criteria below are successfully validated.

### Prerequisites validation

Ensure all system requirements are met before proceeding with validation. See the [Prerequisites documentation](../common/prerequisites.md) and [System setup](../common/system-setup.md) for more details.

- ✅ Supported operating system version installed
- ✅ Compatible ROCm version installed
- ✅ BIOS configured per [BIOS settings](../common/bios-settings.md), with MI100-specific values per platform vendor
- ✅ Required kernel parameters present: `pci=realloc=off`, `pci=bfsort`, `iommu=pt`, and `amd_iommu=on` (or `intel_iommu=on` on Intel hosts) — see [Kernel Parameters](../common/kernel-parameters.md)
- ✅ Minimum 256G system memory available
- ✅ Latest applicable firmware applied consistently across nodes
- ✅ ROCm Validation Suite (RVS) installed

### Basic health checks

These checks ensure fundamental system health and proper GPU detection. For detailed procedures, see [Health Checks](../common/health-checks.md).

| Test | Command | Pass/Fail criteria |
|------|---------|-------------------|
| [Check OS distribution](../common/health-checks.md#check-os-distribution) | `cat /etc/os-release` | **Pass**: OS version listed in compatibility matrix<br>**Fail**: Otherwise |
| [Check kernel boot arguments](../common/health-checks.md#check-kernel-boot-arguments) | `cat /proc/cmdline` | **Pass**: Contains `pci=realloc=off`, `pci=bfsort`, `iommu=pt`, and `amd_iommu=on` or `intel_iommu=on`<br>**Fail**: Otherwise |
| [Check for driver errors](../common/health-checks.md#check-for-driver-errors) | `sudo dmesg -T \| grep amdgpu \| grep -i error` | **Pass**: Null<br>**Fail**: Errors reported |
| [Check available memory](../common/health-checks.md#check-for-available-system-memory) | `lsmem \| grep "Total online memory"` | **Pass**: ≥ 256G<br>**Fail**: Less than 256G |
| [Check GPU presence](../common/health-checks.md#check-gpu-presence) | `sudo lspci -d 1002:738c` | **Pass**: 4 MI100 GPUs found (per hive)<br>**Fail**: Otherwise |
| [Check GPU link speed and width](../common/health-checks.md#check-gpu-pcie-bus-link-speed-and-width) | `sudo lspci -d 1002:738c -vvv \| grep -e DevSta -e LnkSta` | **Pass**: Speed 16GT/s, width `x16`, no `FatalErr+`<br>**Fail**: Otherwise |
| [Monitor utilization metrics](../common/health-checks.md#monitor-utilization-metrics) | `amd-smi monitor -putm` | **Pass**: Idle metrics as specified<br>**Fail**: Otherwise |
| [Check system kernel logs for errors](../common/health-checks.md#check-system-kernel-logs) | `sudo dmesg -T \| grep -i 'error\|warn\|fail\|exception'` | **Pass**: Null<br>**Fail**: Otherwise |

### System validation

Comprehensive validation ensures system stability under load. For detailed procedures, see [System Validation](../common/system-validation.md).

| Test | Command | Pass/Fail criteria |
|------|---------|-------------------|
| [Compute/GPU properties](../common/system-validation.md#gpu-properties) | `rvs -c ${RVS_CONF}/gpup_single.conf` | **Pass**: All GPUs listed with no errors<br>**Fail**: Missing GPUs or errors |
| [GPU stress test (GST)](../common/system-validation.md#gpu-stress-test) | `rvs -c ${RVS_CONF}/MI100/gst_single.conf` | **Pass**: `met: TRUE` in logs<br>**Fail**: Target GFLOP/s not met |
| [Input energy delay product (IET)](../common/system-validation.md#input-energy-delay-product) | `rvs -c ${RVS_CONF}/MI100/iet_single.conf` | **Pass**: `met: TRUE` for all actions<br>**Fail**: Otherwise |
| [Memory test (MEM)](../common/system-validation.md#mem) | `rvs -c ${RVS_CONF}/mem.conf -l mem.txt` | **Pass**: All tests passed; bandwidth ≥ 800 GB/s per GPU<br>**Fail**: Any test failed or low bandwidth |
| [PCIe bandwidth benchmark (PEBB)](../common/system-validation.md#pcie-bandwidth-benchmark) | `rvs -c ${RVS_CONF}/MI100/pebb_single.conf` | **Pass**: All distances and bandwidths displayed<br>**Fail**: Missing data |
| [PCIe qualification tool (PEQT)](../common/system-validation.md#pcie-qualification-tool) | `rvs -c ${RVS_CONF}/peqt_single.conf` | **Pass**: All actions true<br>**Fail**: Otherwise |
| [P2P benchmark and qualification tool (PBQT)](../common/system-validation.md#p2p-benchmark-and-qualification-tool) | `rvs -c ${RVS_CONF}/pbqt_single.conf` | **Pass**: `peers:true` lines and non-zero throughput<br>**Fail**: Otherwise |

```{note}
The reference configuration for this document is a single 4-GPU MI100 hive with AMD Infinity Fabric™ bridges installed, so intra-hive PBQT and TransferBench numbers reflect XGMI throughput. On systems without bridges, P2P traffic traverses the host PCIe fabric and these thresholds will not be met.
```

### Performance benchmarks

Performance validation ensures the system meets MI100 specifications. For detailed procedures, see [Performance Benchmarking](../common/system-validation.md#performance-benchmarking).

:::{card} Command: `TransferBench a2a`
[TransferBench all-to-all](../common/system-validation.md#transferbench)
^^^
**Pass:** ≥ 270 GB/s aggregate
+++
**Fail:** otherwise
:::

:::{card} Command: `TransferBench p2p`
[TransferBench peer-to-peer](../common/system-validation.md#transferbench)
^^^

| Test | Pass criteria |
|------|--------------|
| UniDir | ≥ 30 GB/s |
| BiDir | ≥ 57 GB/s |

+++
**Fail:** otherwise
:::

:::{card} Command: `build/all_reduce_perf -b 8 -e 8G -f 2 -g 4`
[RCCL Allreduce](../common/system-validation.md#rccl-allreduce)
^^^
**Pass:** ≥ 72 GB/s busbw (peak, at 8 GiB message size)
+++
**Fail:** otherwise
:::

:::{card} Command: `rocblas-bench` (see code block below)
[rocBLAS FP32](../common/system-validation.md#rocblas-gemm-benchmarks)
^^^

```bash
rocblas-bench -f gemm \
-r s -m 4000 -n 4000 -k 4000 \
--lda 4000 --ldb 4000 --ldc 4000 \
--transposeA N --transposeB T
```

**Pass:** ≥ 28 TFLOPS per GPU
+++
**Fail:** otherwise
:::

:::{card} Command: `mpiexec -n 4 wrapper.sh`
[BabelStream](../common/system-validation.md#babelstream)
^^^

| Kernel | Threshold (MB/s) |
|--------|-----------------|
| Copy | ≥ 940,000 |
| Mul | ≥ 940,000 |
| Add | ≥ 910,000 |
| Triad | ≥ 910,000 |
| Dot | ≥ 950,000 |

+++
**Fail:** otherwise
:::
Loading
Loading