-
Notifications
You must be signed in to change notification settings - Fork 74
Add a toy multi-GPU benchmark #5753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
!test |
|
Review updated until commit 5da9fb0 Description
|
| Relevant files | |||||
|---|---|---|---|---|---|
| Enhancement |
| ||||
| Tests |
| ||||
| Configuration changes |
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Constructor ordering
|
Greptile SummaryRefactored test infrastructure by extracting Key changes:
Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant TestRunner as Test Runner
participant MDT as MultiDeviceTest
participant NVFT as NVFuserTest
participant MDF as MultiDeviceFixture
participant Comm as Communicator
TestRunner->>MDT: Construct MultiDeviceTest
MDT->>NVFT: Call NVFuserTest()
NVFT-->>MDT: Base initialized
MDT->>MDF: Call MultiDeviceFixture()
MDF->>Comm: getInstance()
Comm-->>MDF: communicator instance
MDF->>MDF: Setup tensor_options_
MDF->>MDF: Setup debug_print
MDF-->>MDT: Fixture initialized
MDT->>MDT: Setup disable_skip
MDT-->>TestRunner: Test object ready
TestRunner->>MDT: SetUp()
MDT->>NVFT: NVFuserTest::SetUp()
NVFT-->>MDT: Base setup complete
MDT->>MDT: Check communicator availability
MDT-->>TestRunner: Setup complete
TestRunner->>MDT: Run test
Note over MDT,MDF: Test uses communicator_,<br/>tensor_options_ from fixture
MDT-->>TestRunner: Test complete
TestRunner->>MDT: Destroy MultiDeviceTest
MDT->>MDF: Call ~MultiDeviceFixture()
MDF->>Comm: barrier() if available
MDF-->>MDT: Cleanup complete
MDT-->>TestRunner: Destroyed
|
``` $ mpirun -np 2 -output-filename /tmp/test_multidevice bin/test_multidevice --benchmarks=all $ cat /tmp/test_multidevice/1/rank.0/stdout ----------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------------------------------- MultiDeviceBenchmark/Reduction/4/iterations:10 20128420 ns 16788148 ns 10 MultiDeviceBenchmark/Reduction/8/iterations:10 100694 ns 100708 ns 10 ```
| testing::InitGoogleTest(&argc, argv); | ||
| testing::AddGlobalTestEnvironment(new nvfuser::MultiDeviceTestEnvironment()); | ||
|
|
||
| if (wantsBenchmarks(argc, argv)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Benchmarks tend to run longer and don't need to run as frequently as tests, so it's worth separating benchmarks from (correctness) tests.
The question though is how.
- In this version, the benchmarks are defined in the same set of files as tests, and I'm reusing the same main function which detects flags like
--benchmarks. - Alternatively, I could write two main functions (one for tests and the other for benchmarks) and link them to different binaries (test_multidevice vs benchmark_multidevice).
- Furthermore, I could even split the test files and the benchmark files. It's harder to reuse code this way. For example, a FusionDefinition needs to be DRY'ed in order to be both tested and benchmarked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option (1) might be simplest to use in the short term. Instead of 2 different commands, only an additional flag is needed. The downside is that tests and benchmarks do not have a clear distinction.
Option (2) is a good balance to reuse while maintaining different binaries but requires different commands for the validation and benchmarking part.
For option (3), we could define common fusions in a path outside tests/benchmarks, however the setup will still likely be repeated. Another downside I see is that there are multiple locations which need to be kept in sync.
Yet another option is to have these in the benchmark file with validation, and allow arguments to disable either. The github CI can only run validation whereas nightly CI runs everything.
For now what you have in the PR looks like a good starting point to atleast unify how we create benchmarks. I am assuming you intend to modify
| at::Tensor runBenchmark( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am assuming you intend to modify
Yes, that'll likely be the first target.
... to speed up CI and local runs The way forward could be to reduce `warmup_iters` and `timing_iters` and/or make this a benchmark (e.g. #5753) that doesn't run by default.
| testing::InitGoogleTest(&argc, argv); | ||
| testing::AddGlobalTestEnvironment(new nvfuser::MultiDeviceTestEnvironment()); | ||
|
|
||
| if (wantsBenchmarks(argc, argv)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that we only run one of validation or benchmarking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that has been a Google internal convention -- when the user specifies --benchmarks=all the default main function will run just the benchmarks. But I'm open to other contracts. Multi-GPU tests come with a customized main function so we can do whichever we prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| at::Tensor runBenchmark( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has both validation and benchmarking
I suspect we are talking about different things.
Nothing prevents a BENCHMARK_DEFINE_F from using comparison macros like EXPECT_EQ. That'll make a BENCHMARK_DEFINE_F on par with the runBenchmark function you pointed to.
I'm asking whether a benchmark binary (e.g. multidevice_benchmark) or a combined binary running in benchmark mode (e.g. test_multidevice --benchmarks=all) should also run TEST_Fs (in addition to BENCHMARK_DEFINE_Fs). Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
I'm asking whether a benchmark binary (e.g. multidevice_benchmark) or a combined binary running in benchmark mode (e.g. test_multidevice --benchmarks=all) should also run TEST_Fs (in addition to BENCHMARK_DEFINE_Fs). Wdyt?
I think we should either run tests or benchmarks. Benchmarks can additionally validate the results, as you mentioned. In this case, my preference would be to link them to different binaries. Test binaries only run tests and benchmark binaries only run benchmarks. This behavior sounds the most predictable to me.
| testing::InitGoogleTest(&argc, argv); | ||
| testing::AddGlobalTestEnvironment(new nvfuser::MultiDeviceTestEnvironment()); | ||
|
|
||
| if (wantsBenchmarks(argc, argv)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option (1) might be simplest to use in the short term. Instead of 2 different commands, only an additional flag is needed. The downside is that tests and benchmarks do not have a clear distinction.
Option (2) is a good balance to reuse while maintaining different binaries but requires different commands for the validation and benchmarking part.
For option (3), we could define common fusions in a path outside tests/benchmarks, however the setup will still likely be repeated. Another downside I see is that there are multiple locations which need to be kept in sync.
Yet another option is to have these in the benchmark file with validation, and allow arguments to disable either. The github CI can only run validation whereas nightly CI runs everything.
For now what you have in the PR looks like a good starting point to atleast unify how we create benchmarks. I am assuming you intend to modify
| at::Tensor runBenchmark( |
Priya2698
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wujingyue the PR is still in draft, should I review it now?
No, you don't. I added you to get some early feedback, but draft means don't review. |
We've been adding C++ multi-GPU benchmarks organically, e.g.,
Fuser/benchmarks/cpp/p2p_communication.cpp
Line 25 in d395676
Fuser/tests/cpp/test_multidevice_lower_communication_cuda.cpp
Line 80 in d395676
Benchmarks tend to run longer and don't need to run as frequently as (correctness) tests. Therefore, it's worth separating benchmarks from tests.
This PR adds a sample multi-GPU benchmark based on https://github.com/google/benchmark, which nvFuser already uses for single-GPU. This way,
Notes for customization:
Below is the repro for the toy benchmark: