Add a toy multi-GPU benchmark #5753

wujingyue · 2026-01-03T18:37:06Z

We've been adding C++ multi-GPU benchmarks organically, e.g.,

Fuser/benchmarks/cpp/p2p_communication.cpp

Line 25 in d395676

void benchmarkP2PCommunication() {

and

Fuser/tests/cpp/test_multidevice_lower_communication_cuda.cpp

Line 80 in d395676

at::Tensor runBenchmark(

.

Benchmarks tend to run longer and don't need to run as frequently as (correctness) tests. Therefore, it's worth separating benchmarks from tests.

This PR adds a sample multi-GPU benchmark based on https://github.com/google/benchmark, which nvFuser already uses for single-GPU. This way,

Tests and benchmarks are registered separately and therefore can be run separately.
Benchmarks can leverage tools from Google benchmark. For example, by default, a benchmark reports the average wall time of a non-warmup iteration and prints out the times in a table.

Notes for customization:

The user can measure GPU time with manual timing.
The user can request multiple metrics to be printed out via custom counters.

Below is the repro for the toy benchmark:

$ mpirun -np 2 -output-filename /tmp/test_multidevice bin/test_multidevice --benchmarks=all

$ cat /tmp/test_multidevice/1/rank.0/stdout
-----------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations
-----------------------------------------------------------------------------------------
MultiDeviceBenchmark/Reduction/4/iterations:10   20128420 ns     16788148 ns           10
MultiDeviceBenchmark/Reduction/8/iterations:10     100694 ns       100708 ns           10

wujingyue · 2026-01-03T18:37:16Z

!test

github-actions · 2026-01-03T18:38:05Z

Review updated until commit 5da9fb0

Description

Refactor multi-device test infrastructure by separating concerns into MultiDeviceFixture base class
Add MultiDeviceBenchmark class for Google Benchmark integration with multi-device testing
Implement sample MultiDeviceBenchmark::Reduction benchmark for tensor reduction across devices
Update build system to include benchmark library dependencies

Changes walkthrough

Relevant files

Enhancement

multidevice.h `Refactor test infrastructure with fixture separation` tests/cpp/multidevice.h Rename MultiDeviceTest to MultiDeviceFixture base class Create new MultiDeviceTest class inheriting from NVFuserTest and MultiDeviceFixture Add MultiDeviceBenchmark class inheriting from benchmark::Fixture and MultiDeviceFixture Add benchmark and gtest includes	+25/-7
multidevice.cpp `Implement fixture classes and benchmark integration` tests/cpp/multidevice.cpp Implement MultiDeviceFixture constructor/destructor Add MultiDeviceBenchmark::TearDown with barrier synchronization Add benchmark detection logic in main() function Update main() to run benchmarks when requested	+42/-5

Tests

test_multidevice_sharding.cpp `Add sample multi-device benchmark test` tests/cpp/test_multidevice_sharding.cpp Add MultiDeviceBenchmark::Reduction benchmark test Register benchmark with Arg(4), Arg(8), and Iterations(10) Include benchmark header for Google Benchmark support	+36/-0

Configuration changes

CMakeLists.txt `Add benchmark library dependencies` CMakeLists.txt Add benchmark include directory to test build configuration Link benchmark library to test targets	+2/-0

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review
Constructor ordering The MultiDeviceFixture constructor is defined before MultiDeviceTest constructor, but MultiDeviceTest inherits from MultiDeviceFixture. This could lead to initialization order issues. The constructors should be properly ordered or MultiDeviceFixture should have a virtual destructor. MultiDeviceFixture::MultiDeviceFixture() { // Enable logging in c10d so debug messages can be printed out via // `TORCH_DISTRIBUTED_DEBUG`. c10d::setDebugLevelFromEnvironment(); communicator_ = &Communicator::getInstance(); tensor_options_ = at::TensorOptions().dtype(at::kFloat).device(communicator_->device()); debug_print = getNvFuserEnv("MULTIDEVICE_DEBUG_PRINT") != nullptr; } MultiDeviceTest::MultiDeviceTest() { disable_skip = getNvFuserEnv("MULTIDEVICE_DISABLE_SKIP") != nullptr; } Benchmark iteration consistency The comment mentions that iterations must be consistent across processes to prevent hanging, but there's no validation that all processes actually receive the same iteration count. Consider adding runtime validation.

greptile-apps · 2026-01-03T18:39:33Z

Greptile Summary

Refactored test infrastructure by extracting MultiDeviceFixture as a separate base class from MultiDeviceTest. This allows MultiDeviceTest to use multiple inheritance (from both NVFuserTest and MultiDeviceFixture), enabling better code reuse and separation of concerns.

Key changes:

Created MultiDeviceFixture containing common multi-device testing utilities (communicator_, tensor_options_, debug_print, shardTensor methods)
MultiDeviceTest now inherits from both NVFuserTest and MultiDeviceFixture
Moved initialization logic to appropriate constructors based on class responsibilities
Removed unused <mutex> include
Removed comment about setting random seed (no related code was present)

Confidence Score: 5/5

This PR is safe to merge with minimal risk
Clean refactoring with proper separation of concerns, maintains backward compatibility for all existing tests that inherit from MultiDeviceTest, and follows C++ multiple inheritance best practices
No files require special attention

Important Files Changed

Filename	Overview
tests/cpp/multidevice.h	Extracted `MultiDeviceFixture` as a separate base class containing common testing utilities, allowing `MultiDeviceTest` to use multiple inheritance cleanly.
tests/cpp/multidevice.cpp	Moved initialization logic from `MultiDeviceTest` to `MultiDeviceFixture`, removed unused `<mutex>` include, cleaned up constructor responsibilities.

Sequence Diagram

sequenceDiagram
    participant TestRunner as Test Runner
    participant MDT as MultiDeviceTest
    participant NVFT as NVFuserTest
    participant MDF as MultiDeviceFixture
    participant Comm as Communicator
    
    TestRunner->>MDT: Construct MultiDeviceTest
    MDT->>NVFT: Call NVFuserTest()
    NVFT-->>MDT: Base initialized
    MDT->>MDF: Call MultiDeviceFixture()
    MDF->>Comm: getInstance()
    Comm-->>MDF: communicator instance
    MDF->>MDF: Setup tensor_options_
    MDF->>MDF: Setup debug_print
    MDF-->>MDT: Fixture initialized
    MDT->>MDT: Setup disable_skip
    MDT-->>TestRunner: Test object ready
    
    TestRunner->>MDT: SetUp()
    MDT->>NVFT: NVFuserTest::SetUp()
    NVFT-->>MDT: Base setup complete
    MDT->>MDT: Check communicator availability
    MDT-->>TestRunner: Setup complete
    
    TestRunner->>MDT: Run test
    Note over MDT,MDF: Test uses communicator_,<br/>tensor_options_ from fixture
    MDT-->>TestRunner: Test complete
    
    TestRunner->>MDT: Destroy MultiDeviceTest
    MDT->>MDF: Call ~MultiDeviceFixture()
    MDF->>Comm: barrier() if available
    MDF-->>MDT: Cleanup complete
    MDT-->>TestRunner: Destroyed

``` $ mpirun -np 2 -output-filename /tmp/test_multidevice bin/test_multidevice --benchmarks=all $ cat /tmp/test_multidevice/1/rank.0/stdout ----------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------------------------------- MultiDeviceBenchmark/Reduction/4/iterations:10 20128420 ns 16788148 ns 10 MultiDeviceBenchmark/Reduction/8/iterations:10 100694 ns 100708 ns 10 ```

wujingyue · 2026-01-03T21:23:37Z

tests/cpp/multidevice.cpp

  testing::InitGoogleTest(&argc, argv);
  testing::AddGlobalTestEnvironment(new nvfuser::MultiDeviceTestEnvironment());
+
+  if (wantsBenchmarks(argc, argv)) {


Benchmarks tend to run longer and don't need to run as frequently as tests, so it's worth separating benchmarks from (correctness) tests.

The question though is how.

In this version, the benchmarks are defined in the same set of files as tests, and I'm reusing the same main function which detects flags like --benchmarks.

Alternatively, I could write two main functions (one for tests and the other for benchmarks) and link them to different binaries (test_multidevice vs benchmark_multidevice).

Furthermore, I could even split the test files and the benchmark files. It's harder to reuse code this way. For example, a FusionDefinition needs to be DRY'ed in order to be both tested and benchmarked.

Option (1) might be simplest to use in the short term. Instead of 2 different commands, only an additional flag is needed. The downside is that tests and benchmarks do not have a clear distinction.

Option (2) is a good balance to reuse while maintaining different binaries but requires different commands for the validation and benchmarking part.

For option (3), we could define common fusions in a path outside tests/benchmarks, however the setup will still likely be repeated. Another downside I see is that there are multiple locations which need to be kept in sync.

Yet another option is to have these in the benchmark file with validation, and allow arguments to disable either. The github CI can only run validation whereas nightly CI runs everything.

For now what you have in the PR looks like a good starting point to atleast unify how we create benchmarks. I am assuming you intend to modify

Fuser/tests/cpp/test_multidevice_lower_communication_cuda.cpp

Line 80 in d395676

at::Tensor runBenchmark(

to use google benchmarks as well?

I am assuming you intend to modify

Yes, that'll likely be the first target.

... to speed up CI and local runs The way forward could be to reduce `warmup_iters` and `timing_iters` and/or make this a benchmark (e.g. #5753) that doesn't run by default.

Priya2698 · 2026-01-06T21:41:21Z

tests/cpp/multidevice.cpp

  testing::InitGoogleTest(&argc, argv);
  testing::AddGlobalTestEnvironment(new nvfuser::MultiDeviceTestEnvironment());
+
+  if (wantsBenchmarks(argc, argv)) {


Does this mean that we only run one of validation or benchmarking?

Yes, that has been a Google internal convention -- when the user specifies --benchmarks=all the default main function will run just the benchmarks. But I'm open to other contracts. Multi-GPU tests come with a customized main function so we can do whichever we prefer.

Fuser/tests/cpp/test_multidevice_lower_communication_cuda.cpp

Line 80 in d395676

at::Tensor runBenchmark(

has both validation and benchmarking. It would be preferable to allow having both done together in a single run.

has both validation and benchmarking

I suspect we are talking about different things.

Nothing prevents a BENCHMARK_DEFINE_F from using comparison macros like EXPECT_EQ. That'll make a BENCHMARK_DEFINE_F on par with the runBenchmark function you pointed to.

I'm asking whether a benchmark binary (e.g. multidevice_benchmark) or a combined binary running in benchmark mode (e.g. test_multidevice --benchmarks=all) should also run TEST_Fs (in addition to BENCHMARK_DEFINE_Fs). Wdyt?

Got it.

I'm asking whether a benchmark binary (e.g. multidevice_benchmark) or a combined binary running in benchmark mode (e.g. test_multidevice --benchmarks=all) should also run TEST_Fs (in addition to BENCHMARK_DEFINE_Fs). Wdyt?

I think we should either run tests or benchmarks. Benchmarks can additionally validate the results, as you mentioned. In this case, my preference would be to link them to different binaries. Test binaries only run tests and benchmark binaries only run benchmarks. This behavior sounds the most predictable to me.

Priya2698 · 2026-01-06T21:46:44Z

tests/cpp/multidevice.cpp

  testing::InitGoogleTest(&argc, argv);
  testing::AddGlobalTestEnvironment(new nvfuser::MultiDeviceTestEnvironment());
+
+  if (wantsBenchmarks(argc, argv)) {


Option (1) might be simplest to use in the short term. Instead of 2 different commands, only an additional flag is needed. The downside is that tests and benchmarks do not have a clear distinction.

Option (2) is a good balance to reuse while maintaining different binaries but requires different commands for the validation and benchmarking part.

For option (3), we could define common fusions in a path outside tests/benchmarks, however the setup will still likely be repeated. Another downside I see is that there are multiple locations which need to be kept in sync.

Yet another option is to have these in the benchmark file with validation, and allow arguments to disable either. The github CI can only run validation whereas nightly CI runs everything.

For now what you have in the PR looks like a good starting point to atleast unify how we create benchmarks. I am assuming you intend to modify

Fuser/tests/cpp/test_multidevice_lower_communication_cuda.cpp

Line 80 in d395676

at::Tensor runBenchmark(

to use google benchmarks as well?

Priya2698

@wujingyue the PR is still in draft, should I review it now?

wujingyue · 2026-01-07T20:08:50Z

@wujingyue the PR is still in draft, should I review it now?

No, you don't. I added you to get some early feedback, but draft means don't review.

Create MultiDeviceFixture

1b031a5

wujingyue marked this pull request as draft January 3, 2026 18:37

wujingyue added 2 commits January 3, 2026 10:50

Create MultiDeviceBenchmark

36804d9

wujingyue changed the title ~~Create MultiDeviceFixture~~ Add a sample multi-GPU benchmark Jan 3, 2026

wujingyue commented Jan 3, 2026

View reviewed changes

wujingyue requested a review from Priya2698 January 3, 2026 21:26

wujingyue mentioned this pull request Jan 6, 2026

Skip certain configs in LowerCollectiveCudaAndNcclTest #5743

Merged

Priya2698 reviewed Jan 6, 2026

View reviewed changes

wujingyue changed the title ~~Add a sample multi-GPU benchmark~~ Add a toy multi-GPU benchmark Jan 7, 2026

Priya2698 reviewed Jan 7, 2026

View reviewed changes

Add a toy multi-GPU benchmark #5753

Are you sure you want to change the base?

Add a toy multi-GPU benchmark #5753

Conversation

wujingyue commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wujingyue commented Jan 3, 2026

Uh oh!

github-actions bot commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Uh oh!

greptile-apps bot commented Jan 3, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Priya2698 Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Priya2698 left a comment

Choose a reason for hiding this comment

Uh oh!

wujingyue commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wujingyue commented Jan 3, 2026 •

edited

Loading

github-actions bot commented Jan 3, 2026 •

edited

Loading

Priya2698 Jan 8, 2026 •

edited

Loading