[RFC] Implement gloo abort for graceful shutdown#388
[RFC] Implement gloo abort for graceful shutdown#388Aidyn-A wants to merge 7 commits intopytorch:mainfrom
Conversation
|
Sorry for the delay. Are you able to add a test for this change? |
|
Ignore the CI breakage for now. I'm trying to revive the CI for this repository. |
Sure, I will add a test and resolve the merge conflicts soon. |
|
Hey @c-p-i-o how does the PR look to you? Do you think it is ready to merge? Please let me know if you have any comments. |
|
@c-p-i-o has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
Hey @c-p-i-o can you please let me know what tests are failing? |
Sorry for the delay here.
|
|
I ended up on this PR while reviewing some nvidia framework docs. I am wondering what is blocking for this and if previously mentioned issues still exist and blocker here. |


In pytorch/pytorch#130345 it was requested to implement a
ProcessGroupGloo.shutdown()for faster recovery from distributed rank failures. This PR is a first step into accomplishing the proper shutdown. The second step would be implementinggloo::abort()within the PyTorch'sProcessGroupGloo.