Hi,
Currently, I'm playing with MSCCL for customized coll ops. I noticed that MSCCL supports running custom collective communication algorithms on heterogenous accelerators (Nvidia and AMD GPUs). I wonder if MSCCL achieves to performing heterogeneous communication between Nvidia and AMD GPUs via a same customized operator but lowered to CUDA/ROCm kernels respectively?
Thanks for your reply in advance.