Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,15 @@
.venv
.vscode
.cline_storage

# AI assistant tool directories (personal, not project source)
.claude/
.cline/
.cline_storage/
.codex/
.cursor/

# Skills are a local tool dependency, not project source
skills/

# documentation artifacts
_build/
Expand Down
22 changes: 11 additions & 11 deletions docs/common/system-validation.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ The `rvs` has two different types of modules to validate the Compute subsystem.
- [Properties](#gpu-properties)
- [Benchmark / Stress / Qualification](#benchmark-stress-qualification)

MI300X GPU accelerators have many architectural features. Similar to the [Check GPU presence (lspci)](../mi300x/health-checks.md#check-gpu-presence-lspci) section, `rvs` has an option to display all MI300X GPU accelerators present in the SUT. Before
MI300X GPU accelerators have many architectural features. Similar to the [Check GPU presence (lspci)](health-checks.md#check-gpu-presence) section, `rvs` has an option to display all MI300X GPU accelerators present in the SUT. Before
proceeding with the modules below, run the following command to make sure all the GPUs are seen with their correct PCIe properties.

Command:
Expand Down Expand Up @@ -380,7 +380,7 @@ grep "bandwidth" mem.txt

#### BABEL

Refer to the [BabelStream section](mi300x-bench-babelstream.md) for instructions on how to run this module to test memory.
Refer to the [BabelStream section](#babelstream) for instructions on how to run this module to test memory.

### IO

Expand Down Expand Up @@ -870,10 +870,10 @@ For comprehensive instructions, test scope, and result interpretation, refer to

High level test summary:

- **PCIe Subsystem:** Tests PCIe link status, speed, width, and stress bandwidth (host-to-device, device-to-host, and bidirectional).
- **Memory Subsystem:** Exercises and validates HBM (High Bandwidth Memory) through stress tests such as bandwidth, dual stream, and random access patterns.
- **Compute Subsystem:** Runs compute kernels at various data types and loads, verifying the stability and peak capability of the GPU compute units.
- **Power and Thermal:** Max power and sustained stress kernels help uncover errors that show up under load.
- **PCIe Subsystem:** Tests PCIe link status, speed, width, and stress bandwidth (host-to-device, device-to-host, and bidirectional).
- **Memory Subsystem:** Exercises and validates HBM (High Bandwidth Memory) through stress tests such as bandwidth, dual stream, and random access patterns.
- **Compute Subsystem:** Runs compute kernels at various data types and loads, verifying the stability and peak capability of the GPU compute units.
- **Power and Thermal:** Max power and sustained stress kernels help uncover errors that show up under load.

Extended information:

Expand Down Expand Up @@ -929,9 +929,9 @@ Program exiting with return code AGFHC_SUCCESS [0]
This test should be run twice to better exercise the HBM memory ensuring no ECC exceptions are present.
```

- **HBM Bandwidth:** Measures and stresses memory read/write throughput.
- **HBM Data Patterns:** Performs wide pattern tests (dual stream, single/dual stream random, and sequential).
- **Memory Error Detection:** Looks for correctable/uncorrectable errors under load—useful for catching early DIMM or silicon issues.
- **HBM Bandwidth:** Measures and stresses memory read/write throughput.
- **HBM Data Patterns:** Performs wide pattern tests (dual stream, single/dual stream random, and sequential).
- **Memory Error Detection:** Looks for correctable/uncorrectable errors under load—useful for catching early DIMM or silicon issues.

Extended information

Expand Down Expand Up @@ -1140,12 +1140,12 @@ Each run generates detailed logs and a summary JSON file (typically named result
"tests": [
{"name": "pcie_link_status", "result": "PASS"},
{"name": "hbm_bw", "result": "PASS"},
...
// ... additional test entries ...
]
}
```

If any result entry shows **FAIL**, that test did not pass.
If any result entry shows **FAIL**, that test did not pass.

#### Return Code

Expand Down
4 changes: 3 additions & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,4 +37,6 @@
# Table of contents
external_toc_path = "./sphinx/_toc.yml"

exclude_patterns = ['.venv']
exclude_patterns = ['.venv']
# Add anchors to headings up to level 4
myst_heading_anchors = 4
2 changes: 1 addition & 1 deletion docs/reference/related-documentation.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ setup the system and run the tests in this guide.
- [RVS user guide](https://github.com/ROCm/ROCmValidationSuite/blob/master/docs/ug1main.md)
- [RVS modules](https://rocm.docs.amd.com/projects/ROCmValidationSuite/en/latest/conceptual/rvs-modules.html)
- [TransferBench repository](https://github.com/ROCm/TransferBench)
- [TransferBench how to guide](transferbench:how%20/use-transferbench)
- [TransferBench how to guide](https://rocm.docs.amd.com/projects/TransferBench/en/latest/how%20to/use-transferbench.html)
- [TransferBench example configuration](https://github.com/ROCm/TransferBench/blob/develop/examples/example.cfg)
- [RCCL repository](https://github.com/ROCm/rccl)
- [RCCL Tests repository](https://github.com/ROCm/rccl-tests/tree/develop)
Expand Down
Loading