Skip to content

feat(dag): fallback to CPU transport for TorchTensorType(transport='n…#64239

Open
caosfourn wants to merge 1 commit into
ray-project:masterfrom
caosfourn:nccl-noncompiled-fallback
Open

feat(dag): fallback to CPU transport for TorchTensorType(transport='n…#64239
caosfourn wants to merge 1 commit into
ray-project:masterfrom
caosfourn:nccl-noncompiled-fallback

Conversation

@caosfourn

Copy link
Copy Markdown

Description

This PR implements a fallback mechanism to CPU/Shared Memory transport for TorchTensorType(transport="nccl") when executed outside of Compiled Graphs (i.e. in traditional non-compiled DAGs).

Currently, specifying the "nccl" or "accelerator" transport outside of compiled graphs leads to an AssertionError (or crashes) because the communicator group (communicator_id) and communicator context have not been initialized by the Compiled Graph compiler.

To support debugging and rapid prototyping in non-compiled mode, this PR intercepts cases where no communicator has been set up inside TorchTensorType.create_channel(), emits a UserWarning, and automatically falls back to SharedMemoryType().create_channel().

Related issues

Related to #43328

Additional information

Implementation Details:

  1. python/ray/experimental/channel/torch_tensor_type.py:

    • Checks if self._communicator_id and self._communicator are both None when self.requires_accelerator() is true.
    • Emits a warning informing the user about the performance trade-off.
    • Redirects to host-memory SharedMemoryType channel creation.
  2. python/ray/dag/tests/experimental/test_non_compiled_nccl_dag.py:

    • Added a new unit test validating that non-compiled graphs with TorchTensorType(transport="nccl") correctly raise a UserWarning and fall back to a functional CPU execution path without crashing.

Testing:

  • Verified that traditional Compiled Graphs (where the compiler assigns a communicator_id) are unaffected.
  • Tested successfully using unit tests and mock integration suites.

@caosfourn caosfourn requested a review from a team as a code owner June 21, 2026 05:27

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a fallback mechanism to CPU/shared-memory transport with a warning when TorchTensorType requires an accelerator but is used outside of a Compiled Graph (i.e., _communicator_id is None). It also adds corresponding unit tests. A review comment points out that the warning message incorrectly refers to transport='nccl' instead of transport='accelerator', which is the correct option.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread python/ray/experimental/channel/torch_tensor_type.py
@caosfourn caosfourn force-pushed the nccl-noncompiled-fallback branch 2 times, most recently from 79a793e to 1a7ba86 Compare June 21, 2026 06:03
@ray-gardener ray-gardener Bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Jun 21, 2026
…ccl') in non-compiled graphs

Signed-off-by: CaosFourN <huynhdnhannd@gmail.com>
@caosfourn caosfourn force-pushed the nccl-noncompiled-fallback branch from 1a7ba86 to 5c3bf9a Compare June 22, 2026 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant