bugfix: add group_id when creating the HccLProcessGroup to support multiple communication process_group.#999
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a static group_id_ to uniquely identify HccLProcessGroup instances, which is a good approach to prevent naming conflicts. However, the current implementation of the counter is not thread-safe and could lead to race conditions if multiple process groups are created concurrently from different threads. My review includes suggestions to use std::atomic to ensure the thread-safety of the group ID generation, which is a critical fix to prevent potential runtime failures.
…ltiple communication process_group.
7788d37 to
db04471
Compare
| // that inidicates the comm name has already exist would occur, | ||
| // so for each time when we create a new process group, we should increase | ||
| // group_id_ and pass it to c10d_npu::ProcessGroupHCCL::Options | ||
| static int32_t group_id_; |
There was a problem hiding this comment.
don't understand the modification. does it means we can not create multiple ProcessGroupHCCL ? How did it work before.
There was a problem hiding this comment.
there won’t be explicit error when creating multiple process groups,but when using a second process group with any of the communication operation,the error occurs。 this error was not triggered because only tp process group was used for qwen3next
…ltiple communication process_group.