Skip to content

Conversation

@WenqingLan1
Copy link
Contributor

This pull request adds support for NVBench-based GPU micro-benchmarks to SuperBench.

  • Integrated the NVBench submodule
  • Implemented two benchmarks
    • nvbench-sleep-kernel
    • nvbench-kernel-launch
  • updated documentation and added example scripts

Example config:

version: v0.12
superbench:
  enable:
  # nvbench benchmarks
  - nvbench-sleep-kernel:single
  - nvbench-sleep-kernel:list
  - nvbench-sleep-kernel:range
  - nvbench-sleep-kernel:range-step
  - nvbench-kernel-launch
  var:
    default_local_mode: &default_local_mode
      modes:
      - name: local
        proc_num: 4
        prefix: CUDA_VISIBLE_DEVICES={proc_rank}
        parallel: yes
  benchmarks:
    nvbench-sleep-kernel:single:
      <<: *default_local_mode
      timeout: 300
      parameters:
        duration_us: "50"                   # Single value format
        timeout: 30
    nvbench-sleep-kernel:list:
      <<: *default_local_mode
      timeout: 300
      parameters:
        duration_us: "[25,50,75]"         # List format - no spaces after commas
        timeout: 30
    nvbench-sleep-kernel:range:
      <<: *default_local_mode
      timeout: 300
      parameters:
        duration_us: "[0:5]"           # Range format
        timeout: 30
    nvbench-sleep-kernel:range-step:
      <<: *default_local_mode
      timeout: 300
      parameters:
        duration_us: "[0:50:10]"         # Range with step format
        timeout: 30
    nvbench-kernel-launch:
      <<: *default_local_mode
      timeout: 300

@WenqingLan1 WenqingLan1 requested a review from a team as a code owner October 9, 2025 23:12
@WenqingLan1 WenqingLan1 added benchmarks SuperBench Benchmarks micro-benchmarks Micro Benchmark Test for SuperBench Benchmarks labels Oct 9, 2025
@codecov
Copy link

codecov bot commented Oct 10, 2025

Codecov Report

❌ Patch coverage is 89.11917% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.78%. Comparing base (c99380b) to head (2877feb).

Files with missing lines Patch % Lines
...rbench/benchmarks/micro_benchmarks/nvbench_base.py 80.39% 20 Missing ⚠️
...enchmarks/micro_benchmarks/nvbench_sleep_kernel.py 98.07% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #750      +/-   ##
==========================================
+ Coverage   85.69%   85.78%   +0.08%     
==========================================
  Files         102      105       +3     
  Lines        7699     7892     +193     
==========================================
+ Hits         6598     6770     +172     
- Misses       1101     1122      +21     
Flag Coverage Δ
cpu-python3.10-unit-test 71.41% <88.94%> (+0.43%) ⬆️
cpu-python3.7-unit-test 70.88% <89.11%> (+0.46%) ⬆️
cuda-unit-test 83.74% <88.94%> (+0.13%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@guoshzhao guoshzhao self-assigned this Oct 17, 2025
@polarG polarG requested a review from Copilot January 23, 2026 00:00
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds NVBench-based CUDA GPU micro-benchmarks to SuperBench, including build integration, result parsing, tests, examples, and documentation updates.

Changes:

  • Adds NVBench submodule integration and a cuda_nvbench third-party build target.
  • Introduces two new micro-benchmarks (nvbench-sleep-kernel, nvbench-kernel-launch) with parsing + unit tests.
  • Updates Docker images, docs, and CI workflow to support required tooling (notably newer CMake for NVBench).

Reviewed changes

Copilot reviewed 20 out of 23 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
third_party/nvbench Adds NVBench as a git submodule dependency.
third_party/Makefile Adds cuda_nvbench build/install target and adjusts recipe indentation.
tests/data/nvbench_sleep_kernel.log Adds a sample NVBench sleep-kernel output fixture for parsing tests.
tests/data/nvbench_kernel_launch.log Adds a sample NVBench kernel-launch output fixture for parsing tests.
tests/benchmarks/micro_benchmarks/test_nvbench_sleep_kernel.py Adds unit tests for sleep-kernel preprocess and parsing.
tests/benchmarks/micro_benchmarks/test_nvbench_kernel_launch.py Adds unit tests for kernel-launch preprocess and parsing.
superbench/benchmarks/micro_benchmarks/nvbench_sleep_kernel.py Implements the NVBench sleep-kernel benchmark wrapper + output parser.
superbench/benchmarks/micro_benchmarks/nvbench_kernel_launch.py Implements the NVBench kernel-launch benchmark wrapper + output parser.
superbench/benchmarks/micro_benchmarks/nvbench_base.py Adds a shared NVBench benchmark base class (CLI args, parsing helpers).
superbench/benchmarks/micro_benchmarks/nvbench/sleep_kernel.cu Adds NVBench CUDA benchmark implementing a sleep/busy-wait kernel.
superbench/benchmarks/micro_benchmarks/nvbench/kernel_launch.cu Adds NVBench CUDA benchmark for empty-kernel launch overhead.
superbench/benchmarks/micro_benchmarks/nvbench/CMakeLists.txt Adds CMake build for NVBench-based benchmark executables.
superbench/benchmarks/micro_benchmarks/init.py Exports the new NVBench benchmarks from the micro-benchmarks package.
examples/benchmarks/nvbench_sleep_kernel.py Adds an example runner for the sleep-kernel benchmark.
examples/benchmarks/nvbench_kernel_launch.py Adds an example runner for the kernel-launch benchmark.
docs/user-tutorial/benchmarks/micro-benchmarks.md Documents the new NVBench benchmarks and their metrics.
dockerfile/rocm5.0.x.dockerfile Updates Intel MLC download version used in the ROCm image.
dockerfile/cuda13.0.dockerfile Installs newer CMake and builds cuda_nvbench in the CUDA image.
dockerfile/cuda12.9.dockerfile Installs newer CMake and builds cuda_nvbench in the CUDA image.
dockerfile/cuda12.8.dockerfile Installs newer CMake and builds cuda_nvbench in the CUDA image.
.gitmodules Registers the third_party/nvbench submodule.
.gitignore Ignores compile_commands.json.
.github/workflows/codeql-analysis.yml Upgrades CodeQL actions to v3 and adds CMake setup for the C++ job.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +74 to +85
gpu_section = r'### \[(\d+)\] NVIDIA'
# Regex pattern to handle different time units and flexible spacing
row_pat = (
r'\|\s*([0-9]+)\s*\|\s*' # Duration (us)
r'([0-9]+)x\s*\|\s*' # Samples
r'([\d.]+\s*[μmun]?s)\s*\|\s*' # CPU Time (μs, ns, ms, us, s)
r'([\d.]+%)\s*\|\s*' # CPU Noise percentage
r'([\d.]+\s*[μmun]?s)\s*\|\s*' # GPU Time
r'([\d.]+%)\s*\|\s*' # GPU Noise percentage
r'([0-9]+)x\s*\|\s*' # Batch Samples
r'([\d.]+\s*[μmun]?s)\s*\|' # Batch GPU Time
)
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parser expects each data row to start with a single |, but the provided fixture rows start with || (e.g., markdown tables). With re.match, this prevents any row from matching and will trigger No valid rows parsed. Update the regex to accept one-or-more leading pipes (e.g., ^\\|+) so both | ... and || ... formats parse correctly.

Copilot uses AI. Check for mistakes.
Comment on lines +37 to +47
gpu_section = r'### \[(\d+)\] NVIDIA'
# Regex pattern to handle different time units and flexible spacing
row_pat = (
r'\|\s*([0-9]+)x\s*\|\s*' # Samples
r'([\d.]+\s*[μmun]?s)\s*\|\s*' # CPU Time (μs, ns, ms, us, s)
r'([\d.]+%)\s*\|\s*' # CPU Noise percentage
r'([\d.]+\s*[μmun]?s)\s*\|\s*' # GPU Time
r'([\d.]+%)\s*\|\s*' # GPU Noise percentage
r'([0-9]+)x\s*\|\s*' # Batch Samples
r'([\d.]+\s*[μmun]?s)\s*\|' # Batch GPU Time
)
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as nvbench_sleep_kernel: the row regex only matches lines starting with a single |, but the fixture output uses ||. This will make parsing fail. Allow one-or-more leading pipes (anchor with ^\\|+) so both formats are supported.

Copilot uses AI. Check for mistakes.
Comment on lines +13 to +27
def parse_time_to_us(raw: str) -> float:
"""Helper: parse '123.45 us', '678.9 ns', '0.12 ms' → float µs."""
raw = raw.strip()
if raw.endswith('%'):
return float(raw[:-1])
# split "value unit" or "valueunit"
m = re.match(r'([\d.]+)\s*([mun]?s)?', raw)
if not m:
return float(raw)
val, unit = float(m.group(1)), (m.group(2) or 'us')
if unit == 'ns':
return val / 1e3
if unit == 'ms':
return val * 1e3
return val
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_time_to_us currently does not convert seconds (s) to microseconds (it falls through and returns val). Since your row regex explicitly allows plain s, this yields incorrect results by a factor of 1e6 when NVBench reports seconds. Add explicit handling for unit == 's' (multiply by 1e6), and consider anchoring the regex to the end of the string to avoid partial matches.

Copilot uses AI. Check for mistakes.
Comment on lines +21 to +22
.add_int64_axis("Duration (us)", nvbench::range(0, 100, 5))
.set_timeout(1); // Limit to one second per measurement. No newline at end of file
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hard-codes a 1s timeout at the benchmark definition level, which can override/conflict with the CLI --timeout that SuperBench passes through (and tests/configs expect to control). To make --timeout effective and consistent across NVBench benchmarks, remove the .set_timeout(1) override (or only apply it when no explicit timeout is provided).

Suggested change
.add_int64_axis("Duration (us)", nvbench::range(0, 100, 5))
.set_timeout(1); // Limit to one second per measurement.
.add_int64_axis("Duration (us)", nvbench::range(0, 100, 5));

Copilot uses AI. Check for mistakes.
Comment on lines +79 to +84
assert benchmark.result['duration_us_25_cpu_time'][0] == 42.123
# assert benchmark.result['duration_us_25_cpu_noise'][0] == 69.78
assert benchmark.result['duration_us_25_gpu_time'][0] == 25.321
# assert benchmark.result['duration_us_25_gpu_noise'][0] == 0.93
# assert benchmark.result['duration_us_25_batch_samples'][0] == 17448
assert benchmark.result['duration_us_25_batch_gpu_time'][0] == 23.456
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests compare floats for exact equality, which is fragile due to floating-point representation (e.g., parsing may produce 42.123000000000005). Use unittest.TestCase float assertions (e.g., assertAlmostEqual) with a tolerance to avoid flaky failures.

Suggested change
assert benchmark.result['duration_us_25_cpu_time'][0] == 42.123
# assert benchmark.result['duration_us_25_cpu_noise'][0] == 69.78
assert benchmark.result['duration_us_25_gpu_time'][0] == 25.321
# assert benchmark.result['duration_us_25_gpu_noise'][0] == 0.93
# assert benchmark.result['duration_us_25_batch_samples'][0] == 17448
assert benchmark.result['duration_us_25_batch_gpu_time'][0] == 23.456
self.assertAlmostEqual(benchmark.result['duration_us_25_cpu_time'][0], 42.123, places=6)
# assert benchmark.result['duration_us_25_cpu_noise'][0] == 69.78
self.assertAlmostEqual(benchmark.result['duration_us_25_gpu_time'][0], 25.321, places=6)
# assert benchmark.result['duration_us_25_gpu_noise'][0] == 0.93
# assert benchmark.result['duration_us_25_batch_samples'][0] == 17448
self.assertAlmostEqual(benchmark.result['duration_us_25_batch_gpu_time'][0], 23.456, places=6)

Copilot uses AI. Check for mistakes.
Comment on lines +46 to +53
'BlasLtBaseBenchmark', 'ComputationCommunicationOverlap', 'CpuMemBwLatencyBenchmark', 'CpuHplBenchmark',
'CpuStreamBenchmark', 'CublasBenchmark', 'CublasLtBenchmark', 'CudaGemmFlopsBenchmark', 'CudaMemBwBenchmark',
'CudaNcclBwBenchmark', 'CudnnBenchmark', 'DiskBenchmark', 'DistInference', 'HipBlasLtBenchmark', 'GPCNetBenchmark',
'GemmFlopsBenchmark', 'GpuBurnBenchmark', 'GpuCopyBwBenchmark', 'GpuStreamBenchmark', 'IBBenchmark',
'IBLoopbackBenchmark', 'KernelLaunch', 'MemBwBenchmark', 'MicroBenchmark', 'MicroBenchmarkWithInvoke',
'ORTInferenceBenchmark', 'RocmGemmFlopsBenchmark', 'RocmMemBwBenchmark', 'ShardingMatmul',
'TCPConnectivityBenchmark', 'TensorRTInferenceBenchmark', 'DirectXGPUEncodingLatency', 'DirectXGPUCopyBw',
'DirectXGPUMemBw', 'DirectXGPUCoreFlops', 'NvBandwidthBenchmark', 'NvbenchKernelLaunch', 'NvbenchSleepKernel'
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collapsing __all__ to long comma-separated lines reduces readability and likely violates typical line-length formatting used elsewhere in the project. Consider reverting to one-entry-per-line (or a more structured wrap) to keep diffs smaller and maintenance easier.

Suggested change
'BlasLtBaseBenchmark', 'ComputationCommunicationOverlap', 'CpuMemBwLatencyBenchmark', 'CpuHplBenchmark',
'CpuStreamBenchmark', 'CublasBenchmark', 'CublasLtBenchmark', 'CudaGemmFlopsBenchmark', 'CudaMemBwBenchmark',
'CudaNcclBwBenchmark', 'CudnnBenchmark', 'DiskBenchmark', 'DistInference', 'HipBlasLtBenchmark', 'GPCNetBenchmark',
'GemmFlopsBenchmark', 'GpuBurnBenchmark', 'GpuCopyBwBenchmark', 'GpuStreamBenchmark', 'IBBenchmark',
'IBLoopbackBenchmark', 'KernelLaunch', 'MemBwBenchmark', 'MicroBenchmark', 'MicroBenchmarkWithInvoke',
'ORTInferenceBenchmark', 'RocmGemmFlopsBenchmark', 'RocmMemBwBenchmark', 'ShardingMatmul',
'TCPConnectivityBenchmark', 'TensorRTInferenceBenchmark', 'DirectXGPUEncodingLatency', 'DirectXGPUCopyBw',
'DirectXGPUMemBw', 'DirectXGPUCoreFlops', 'NvBandwidthBenchmark', 'NvbenchKernelLaunch', 'NvbenchSleepKernel'
'BlasLtBaseBenchmark',
'ComputationCommunicationOverlap',
'CpuMemBwLatencyBenchmark',
'CpuHplBenchmark',
'CpuStreamBenchmark',
'CublasBenchmark',
'CublasLtBenchmark',
'CudaGemmFlopsBenchmark',
'CudaMemBwBenchmark',
'CudaNcclBwBenchmark',
'CudnnBenchmark',
'DiskBenchmark',
'DistInference',
'HipBlasLtBenchmark',
'GPCNetBenchmark',
'GemmFlopsBenchmark',
'GpuBurnBenchmark',
'GpuCopyBwBenchmark',
'GpuStreamBenchmark',
'IBBenchmark',
'IBLoopbackBenchmark',
'KernelLaunch',
'MemBwBenchmark',
'MicroBenchmark',
'MicroBenchmarkWithInvoke',
'ORTInferenceBenchmark',
'RocmGemmFlopsBenchmark',
'RocmMemBwBenchmark',
'ShardingMatmul',
'TCPConnectivityBenchmark',
'TensorRTInferenceBenchmark',
'DirectXGPUEncodingLatency',
'DirectXGPUCopyBw',
'DirectXGPUMemBw',
'DirectXGPUCoreFlops',
'NvBandwidthBenchmark',
'NvbenchKernelLaunch',
'NvbenchSleepKernel',

Copilot uses AI. Check for mistakes.
DEBIAN_FRONTEND=noninteractive apt-get install -y ffmpeg libavcodec-dev libavformat-dev libavutil-dev libswresample-dev sudo
DEBIAN_FRONTEND=noninteractive apt-get install -y ffmpeg libavcodec-dev libavformat-dev libavutil-dev libswresample-dev sudo build-essential
- name: Setup CMake
uses: lukka/get-cmake@latest
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using @latest for third-party GitHub Actions is a supply-chain risk and can lead to non-reproducible CI behavior. Pin this action to a specific tagged version or commit SHA.

Suggested change
uses: lukka/get-cmake@latest
uses: lukka/get-cmake@v3.20.0

Copilot uses AI. Check for mistakes.
&& git -C msccl checkout 87048bd && git -C msccl submodule update --recursive --init
else ifeq ($(shell echo $(CUDA_VER)">=12.8" | bc -l), 1)
# Get commit 87048bd from msscl to support updated nccl and sm_100
# Get commit 87048bd from msscl to support updated nccl and sm_100
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: change msscl to msccl.

Suggested change
# Get commit 87048bd from msscl to support updated nccl and sm_100
# Get commit 87048bd from msccl to support updated nccl and sm_100

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmarks SuperBench Benchmarks micro-benchmarks Micro Benchmark Test for SuperBench Benchmarks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants