Skip to content

feat(io): fall back RDMA backend to XGMI when no RDMA device is present#316

Merged
maning00 merged 1 commit into
mainfrom
rdma-xgmi-fallback
May 15, 2026
Merged

feat(io): fall back RDMA backend to XGMI when no RDMA device is present#316
maning00 merged 1 commit into
mainfrom
rdma-xgmi-fallback

Conversation

@maning00
Copy link
Copy Markdown
Contributor

@maning00 maning00 commented May 13, 2026

Summary

Make IOEngine::CreateBackend(BackendType::RDMA, ...) behave gracefully on hosts without an active RDMA NIC. Previously the call would hit assert(availDevices.size() > 0) deep inside RdmaManager and SIGABRT
which prevents SGLang's mori PD-disaggregation path (sgl-project/sglang#25094) from running on single-node setups even when XGMI/Infinity Fabric is available between local GPUs.

This PR is the mori-side counterpart: when no RDMA device exists and the operator opts in via MORI_DISABLE_AUTO_XGMI=0, CreateBackend(RDMA) is transparently rerouted to create an XgmiBackend instead. SGLang's existing engine.create_backend(BackendType.RDMA, rdma_cfg) keeps working unchanged on no-RDMA hosts.

Behavior

Scenario New behavior
Has RDMA device Unchanged (creates RdmaBackend; EnsureXgmiBackendCreatedIfSupported() still runs)
No RDMA device, default env (MORI_DISABLE_AUTO_XGMI unset / not 0) throw std::runtime_error("...; set MORI_DISABLE_AUTO_XGMI=0 to enable XGMI-only fallback") (no SIGABRT)
No RDMA device, opt-in (MORI_DISABLE_AUTO_XGMI=0), GPU P2P available Reroutes to XgmiBackend, engine_desc.port = 1 (sentinel); one-line WARN log
No RDMA device, opt-in, no GPU P2P throw std::runtime_error("...no usable GPU P2P; cannot create any backend") — startup fail-fast

Implementation notes

  • RdmaManager ctor: assertthrow std::runtime_error (assert is compiled out in Release, masking failures).
  • RdmaBackend ctor: wrap RdmaContext* in unique_ptr to avoid leak on the new throw path.
  • RdmaBackend::CanHandle(): rejects sentinel-port remotes — otherwise a peer's BuildRdmaConn → Connect hits SYSCALL_RETURN_ZERO and exit(-1)s the whole peer process.
  • IOEngine::SelectBackend(): every routing path now checks CanHandle(); the previous "first backend wins" tail-fallback is replaced with return nullptr, so unhandled transfers surface ERR_BAD_STATE via the existing SELECT_BACKEND_AND_RETURN_IF_NONE macro.
  • Sentinel port 1: must be > 0 (SGLang asserts port > 0); unbindable by non-root, easy to spot in netstat. Defensive checks reject config.port == 1 and any RDMA ephemeral bind that lands on 1.

Test plan

New Case* tests in tests/cpp/io/test_engine.cpp, registered in the cases list. Use MORI_RDMA_DEVICES="__mori_no_such_device_for_test__" to deterministically simulate "no RDMA":

  • rdma_backend_has_active_devices_returns_false_when_no_device
  • rdma_manager_throws_when_no_active_devices
  • create_backend_rdma_throws_by_default_when_no_rdma_device
  • create_backend_rdma_falls_back_to_xgmi_when_opted_in
  • create_backend_rdma_throws_when_opted_in_but_no_xgmi (no-GPU only)
  • explicit_xgmi_then_rdma_without_opt_in_still_throws
  • explicit_xgmi_then_rdma_with_opt_in_refreshes_port
  • rdma_backend_refuses_sentinel_port_config
  • select_backend_returns_null_for_cross_node_under_xgmi_only
  • rdma_backend_can_handle_rejects_sentinel_port_remote

Existing RDMA tests guarded by if (!RdmaBackend::HasActiveDevices()) throw TestSkip(...) so the assert → throw change does not break test_engine on no-RDMA CI.

Co-authored-by: Clint Greene <Clint.Greene@amd.com>
@clintg6
Copy link
Copy Markdown
Contributor

clintg6 commented May 14, 2026

Finished reviewing and testing the PR. I can confirm this enables the XGMI-only fallback path on nodes with no active RDMA devices when MORI_DISABLE_AUTO_XGMI=0 is set. In SGLang, we can gate that by setting MORI_DISABLE_AUTO_XGMI=0 only when SGLANG_MORI_USE_XGMI=1. I have updated the SGLang PR to let MORI handle XGMI selection. @maning00

@maning00 maning00 merged commit 6ad812c into main May 15, 2026
18 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants