feat(io): fall back RDMA backend to XGMI when no RDMA device is present#316
Merged
Conversation
Co-authored-by: Clint Greene <Clint.Greene@amd.com>
Contributor
|
Finished reviewing and testing the PR. I can confirm this enables the XGMI-only fallback path on nodes with no active RDMA devices when MORI_DISABLE_AUTO_XGMI=0 is set. In SGLang, we can gate that by setting MORI_DISABLE_AUTO_XGMI=0 only when SGLANG_MORI_USE_XGMI=1. I have updated the SGLang PR to let MORI handle XGMI selection. @maning00 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Make
IOEngine::CreateBackend(BackendType::RDMA, ...)behave gracefully on hosts without an active RDMA NIC. Previously the call would hitassert(availDevices.size() > 0)deep insideRdmaManagerandSIGABRT—which prevents SGLang's mori PD-disaggregation path (sgl-project/sglang#25094) from running on single-node setups even when XGMI/Infinity Fabric is available between local GPUs.
This PR is the mori-side counterpart: when no RDMA device exists and the operator opts in via
MORI_DISABLE_AUTO_XGMI=0,CreateBackend(RDMA)is transparently rerouted to create anXgmiBackendinstead. SGLang's existingengine.create_backend(BackendType.RDMA, rdma_cfg)keeps working unchanged on no-RDMA hosts.Behavior
RdmaBackend;EnsureXgmiBackendCreatedIfSupported()still runs)MORI_DISABLE_AUTO_XGMIunset / not0)throw std::runtime_error("...; set MORI_DISABLE_AUTO_XGMI=0 to enable XGMI-only fallback")(no SIGABRT)MORI_DISABLE_AUTO_XGMI=0), GPU P2P availableXgmiBackend,engine_desc.port = 1(sentinel); one-lineWARNlogthrow std::runtime_error("...no usable GPU P2P; cannot create any backend")— startup fail-fastImplementation notes
RdmaManagerctor:assert→throw std::runtime_error(assert is compiled out in Release, masking failures).RdmaBackendctor: wrapRdmaContext*inunique_ptrto avoid leak on the new throw path.RdmaBackend::CanHandle(): rejects sentinel-port remotes — otherwise a peer'sBuildRdmaConn → ConnecthitsSYSCALL_RETURN_ZEROandexit(-1)s the whole peer process.IOEngine::SelectBackend(): every routing path now checksCanHandle(); the previous "first backend wins" tail-fallback is replaced withreturn nullptr, so unhandled transfers surfaceERR_BAD_STATEvia the existingSELECT_BACKEND_AND_RETURN_IF_NONEmacro.1: must be> 0(SGLang assertsport > 0); unbindable by non-root, easy to spot in netstat. Defensive checks rejectconfig.port == 1and any RDMA ephemeral bind that lands on1.Test plan
New
Case*tests intests/cpp/io/test_engine.cpp, registered in thecaseslist. UseMORI_RDMA_DEVICES="__mori_no_such_device_for_test__"to deterministically simulate "no RDMA":rdma_backend_has_active_devices_returns_false_when_no_devicerdma_manager_throws_when_no_active_devicescreate_backend_rdma_throws_by_default_when_no_rdma_devicecreate_backend_rdma_falls_back_to_xgmi_when_opted_increate_backend_rdma_throws_when_opted_in_but_no_xgmi(no-GPU only)explicit_xgmi_then_rdma_without_opt_in_still_throwsexplicit_xgmi_then_rdma_with_opt_in_refreshes_portrdma_backend_refuses_sentinel_port_configselect_backend_returns_null_for_cross_node_under_xgmi_onlyrdma_backend_can_handle_rejects_sentinel_port_remoteExisting RDMA tests guarded by
if (!RdmaBackend::HasActiveDevices()) throw TestSkip(...)so theassert → throwchange does not break test_engine on no-RDMA CI.