Skip to content

Conversation

@wujingyue
Copy link
Collaborator

@wujingyue wujingyue commented Jan 3, 2026

We've been adding C++ multi-GPU benchmarks organically, e.g.,

void benchmarkP2PCommunication() {
and .

Benchmarks tend to run longer and don't need to run as frequently as (correctness) tests. Therefore, it's worth separating benchmarks from tests.

This PR adds a sample multi-GPU benchmark based on https://github.com/google/benchmark, which nvFuser already uses for single-GPU. This way,

  1. Tests and benchmarks are registered separately and therefore can be run separately.
  2. Benchmarks can leverage tools from Google benchmark. For example, by default, a benchmark reports the average wall time of a non-warmup iteration and prints out the times in a table.

Notes for customization:

  1. The user can measure GPU time with manual timing.
  2. The user can request multiple metrics to be printed out via custom counters.

Below is the repro for the toy benchmark:

$ mpirun -np 2 -output-filename /tmp/test_multidevice bin/test_multidevice --benchmarks=all

$ cat /tmp/test_multidevice/1/rank.0/stdout
-----------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations
-----------------------------------------------------------------------------------------
MultiDeviceBenchmark/Reduction/4/iterations:10   20128420 ns     16788148 ns           10
MultiDeviceBenchmark/Reduction/8/iterations:10     100694 ns       100708 ns           10

@wujingyue wujingyue marked this pull request as draft January 3, 2026 18:37
@wujingyue
Copy link
Collaborator Author

!test

@github-actions
Copy link

github-actions bot commented Jan 3, 2026

Review updated until commit 5da9fb0

Description

  • Refactor multi-device test infrastructure by separating concerns into MultiDeviceFixture base class

  • Add MultiDeviceBenchmark class for Google Benchmark integration with multi-device testing

  • Implement sample MultiDeviceBenchmark::Reduction benchmark for tensor reduction across devices

  • Update build system to include benchmark library dependencies

Changes walkthrough

Relevant files
Enhancement
multidevice.h
Refactor test infrastructure with fixture separation         

tests/cpp/multidevice.h

  • Rename MultiDeviceTest to MultiDeviceFixture base class
  • Create new MultiDeviceTest class inheriting from NVFuserTest and
    MultiDeviceFixture
  • Add MultiDeviceBenchmark class inheriting from benchmark::Fixture and
    MultiDeviceFixture
  • Add benchmark and gtest includes
  • +25/-7   
    multidevice.cpp
    Implement fixture classes and benchmark integration           

    tests/cpp/multidevice.cpp

  • Implement MultiDeviceFixture constructor/destructor
  • Add MultiDeviceBenchmark::TearDown with barrier synchronization
  • Add benchmark detection logic in main() function
  • Update main() to run benchmarks when requested
  • +42/-5   
    Tests
    test_multidevice_sharding.cpp
    Add sample multi-device benchmark test                                     

    tests/cpp/test_multidevice_sharding.cpp

  • Add MultiDeviceBenchmark::Reduction benchmark test
  • Register benchmark with Arg(4), Arg(8), and Iterations(10)
  • Include benchmark header for Google Benchmark support
  • +36/-0   
    Configuration changes
    CMakeLists.txt
    Add benchmark library dependencies                                             

    CMakeLists.txt

  • Add benchmark include directory to test build configuration
  • Link benchmark library to test targets
  • +2/-0     

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review
    Constructor ordering

    The MultiDeviceFixture constructor is defined before MultiDeviceTest constructor, but MultiDeviceTest inherits from MultiDeviceFixture. This could lead to initialization order issues. The constructors should be properly ordered or MultiDeviceFixture should have a virtual destructor.

    MultiDeviceFixture::MultiDeviceFixture() {
      // Enable logging in c10d so debug messages can be printed out via
      // `TORCH_DISTRIBUTED_DEBUG`.
      c10d::setDebugLevelFromEnvironment();
    
      communicator_ = &Communicator::getInstance();
      tensor_options_ =
          at::TensorOptions().dtype(at::kFloat).device(communicator_->device());
      debug_print = getNvFuserEnv("MULTIDEVICE_DEBUG_PRINT") != nullptr;
    }
    
    MultiDeviceTest::MultiDeviceTest() {
      disable_skip = getNvFuserEnv("MULTIDEVICE_DISABLE_SKIP") != nullptr;
    }
    Benchmark iteration consistency

    The comment mentions that iterations must be consistent across processes to prevent hanging, but there's no validation that all processes actually receive the same iteration count. Consider adding runtime validation.

    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Jan 3, 2026

    Greptile Summary

    Refactored test infrastructure by extracting MultiDeviceFixture as a separate base class from MultiDeviceTest. This allows MultiDeviceTest to use multiple inheritance (from both NVFuserTest and MultiDeviceFixture), enabling better code reuse and separation of concerns.

    Key changes:

    • Created MultiDeviceFixture containing common multi-device testing utilities (communicator_, tensor_options_, debug_print, shardTensor methods)
    • MultiDeviceTest now inherits from both NVFuserTest and MultiDeviceFixture
    • Moved initialization logic to appropriate constructors based on class responsibilities
    • Removed unused <mutex> include
    • Removed comment about setting random seed (no related code was present)

    Confidence Score: 5/5

    • This PR is safe to merge with minimal risk
    • Clean refactoring with proper separation of concerns, maintains backward compatibility for all existing tests that inherit from MultiDeviceTest, and follows C++ multiple inheritance best practices
    • No files require special attention

    Important Files Changed

    Filename Overview
    tests/cpp/multidevice.h Extracted MultiDeviceFixture as a separate base class containing common testing utilities, allowing MultiDeviceTest to use multiple inheritance cleanly.
    tests/cpp/multidevice.cpp Moved initialization logic from MultiDeviceTest to MultiDeviceFixture, removed unused <mutex> include, cleaned up constructor responsibilities.

    Sequence Diagram

    sequenceDiagram
        participant TestRunner as Test Runner
        participant MDT as MultiDeviceTest
        participant NVFT as NVFuserTest
        participant MDF as MultiDeviceFixture
        participant Comm as Communicator
        
        TestRunner->>MDT: Construct MultiDeviceTest
        MDT->>NVFT: Call NVFuserTest()
        NVFT-->>MDT: Base initialized
        MDT->>MDF: Call MultiDeviceFixture()
        MDF->>Comm: getInstance()
        Comm-->>MDF: communicator instance
        MDF->>MDF: Setup tensor_options_
        MDF->>MDF: Setup debug_print
        MDF-->>MDT: Fixture initialized
        MDT->>MDT: Setup disable_skip
        MDT-->>TestRunner: Test object ready
        
        TestRunner->>MDT: SetUp()
        MDT->>NVFT: NVFuserTest::SetUp()
        NVFT-->>MDT: Base setup complete
        MDT->>MDT: Check communicator availability
        MDT-->>TestRunner: Setup complete
        
        TestRunner->>MDT: Run test
        Note over MDT,MDF: Test uses communicator_,<br/>tensor_options_ from fixture
        MDT-->>TestRunner: Test complete
        
        TestRunner->>MDT: Destroy MultiDeviceTest
        MDT->>MDF: Call ~MultiDeviceFixture()
        MDF->>Comm: barrier() if available
        MDF-->>MDT: Cleanup complete
        MDT-->>TestRunner: Destroyed
    
    Loading

    ```
    $ mpirun -np 2 -output-filename /tmp/test_multidevice bin/test_multidevice --benchmarks=all
    
    $ cat /tmp/test_multidevice/1/rank.0/stdout
    -----------------------------------------------------------------------------------------
    Benchmark                                               Time             CPU   Iterations
    -----------------------------------------------------------------------------------------
    MultiDeviceBenchmark/Reduction/4/iterations:10   20128420 ns     16788148 ns           10
    MultiDeviceBenchmark/Reduction/8/iterations:10     100694 ns       100708 ns           10
    ```
    @wujingyue wujingyue changed the title Create MultiDeviceFixture Add a sample multi-GPU benchmark Jan 3, 2026
    testing::InitGoogleTest(&argc, argv);
    testing::AddGlobalTestEnvironment(new nvfuser::MultiDeviceTestEnvironment());

    if (wantsBenchmarks(argc, argv)) {
    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Benchmarks tend to run longer and don't need to run as frequently as tests, so it's worth separating benchmarks from (correctness) tests.

    The question though is how.

    1. In this version, the benchmarks are defined in the same set of files as tests, and I'm reusing the same main function which detects flags like --benchmarks.
    2. Alternatively, I could write two main functions (one for tests and the other for benchmarks) and link them to different binaries (test_multidevice vs benchmark_multidevice).
    3. Furthermore, I could even split the test files and the benchmark files. It's harder to reuse code this way. For example, a FusionDefinition needs to be DRY'ed in order to be both tested and benchmarked.

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Option (1) might be simplest to use in the short term. Instead of 2 different commands, only an additional flag is needed. The downside is that tests and benchmarks do not have a clear distinction.

    Option (2) is a good balance to reuse while maintaining different binaries but requires different commands for the validation and benchmarking part.

    For option (3), we could define common fusions in a path outside tests/benchmarks, however the setup will still likely be repeated. Another downside I see is that there are multiple locations which need to be kept in sync.

    Yet another option is to have these in the benchmark file with validation, and allow arguments to disable either. The github CI can only run validation whereas nightly CI runs everything.

    For now what you have in the PR looks like a good starting point to atleast unify how we create benchmarks. I am assuming you intend to modify

    to use google benchmarks as well?

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I am assuming you intend to modify

    Yes, that'll likely be the first target.

    @wujingyue wujingyue requested a review from Priya2698 January 3, 2026 21:26
    wujingyue added a commit that referenced this pull request Jan 6, 2026
    ... to speed up CI and local runs
    
    The way forward could be to reduce `warmup_iters` and `timing_iters`
    and/or make this a benchmark (e.g.
    #5753) that doesn't run by default.
    testing::InitGoogleTest(&argc, argv);
    testing::AddGlobalTestEnvironment(new nvfuser::MultiDeviceTestEnvironment());

    if (wantsBenchmarks(argc, argv)) {
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Does this mean that we only run one of validation or benchmarking?

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Yes, that has been a Google internal convention -- when the user specifies --benchmarks=all the default main function will run just the benchmarks. But I'm open to other contracts. Multi-GPU tests come with a customized main function so we can do whichever we prefer.

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    has both validation and benchmarking. It would be preferable to allow having both done together in a single run.

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    has both validation and benchmarking

    I suspect we are talking about different things.

    Nothing prevents a BENCHMARK_DEFINE_F from using comparison macros like EXPECT_EQ. That'll make a BENCHMARK_DEFINE_F on par with the runBenchmark function you pointed to.

    I'm asking whether a benchmark binary (e.g. multidevice_benchmark) or a combined binary running in benchmark mode (e.g. test_multidevice --benchmarks=all) should also run TEST_Fs (in addition to BENCHMARK_DEFINE_Fs). Wdyt?

    Copy link
    Collaborator

    @Priya2698 Priya2698 Jan 8, 2026

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Got it.

    I'm asking whether a benchmark binary (e.g. multidevice_benchmark) or a combined binary running in benchmark mode (e.g. test_multidevice --benchmarks=all) should also run TEST_Fs (in addition to BENCHMARK_DEFINE_Fs). Wdyt?

    I think we should either run tests or benchmarks. Benchmarks can additionally validate the results, as you mentioned. In this case, my preference would be to link them to different binaries. Test binaries only run tests and benchmark binaries only run benchmarks. This behavior sounds the most predictable to me.

    testing::InitGoogleTest(&argc, argv);
    testing::AddGlobalTestEnvironment(new nvfuser::MultiDeviceTestEnvironment());

    if (wantsBenchmarks(argc, argv)) {
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Option (1) might be simplest to use in the short term. Instead of 2 different commands, only an additional flag is needed. The downside is that tests and benchmarks do not have a clear distinction.

    Option (2) is a good balance to reuse while maintaining different binaries but requires different commands for the validation and benchmarking part.

    For option (3), we could define common fusions in a path outside tests/benchmarks, however the setup will still likely be repeated. Another downside I see is that there are multiple locations which need to be kept in sync.

    Yet another option is to have these in the benchmark file with validation, and allow arguments to disable either. The github CI can only run validation whereas nightly CI runs everything.

    For now what you have in the PR looks like a good starting point to atleast unify how we create benchmarks. I am assuming you intend to modify

    to use google benchmarks as well?

    @wujingyue wujingyue changed the title Add a sample multi-GPU benchmark Add a toy multi-GPU benchmark Jan 7, 2026
    Copy link
    Collaborator

    @Priya2698 Priya2698 left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    @wujingyue the PR is still in draft, should I review it now?

    @wujingyue
    Copy link
    Collaborator Author

    @wujingyue the PR is still in draft, should I review it now?

    No, you don't. I added you to get some early feedback, but draft means don't review.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    3 participants