[CRL] Add & implement P2P interface for integration with nixl#466
[CRL] Add & implement P2P interface for integration with nixl#466yzhang35 wants to merge 15 commits into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b9b01fce64
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } devCtx = {it->second.ibDevN}; | ||
| engine->adaptor->deregMr(&devCtx, it->second.mhandle); | ||
| } | ||
| gMemRegInfo.clear(); |
There was a problem hiding this comment.
Scope MR teardown to the engine being destroyed
flagcxP2pEngineDestroy clears the process-global gMemRegInfo, so destroying any one engine deregisters memory regions that were registered by other still-active engines in the same process. In a multi-engine setup (e.g., separate client/server engines), subsequent transfers on the surviving engine can fail because its MR handles have been removed underneath it.
Useful? React with 👍 / 👎.
| if (conn == NULL || numIovs <= 0 || transferId == NULL) | ||
| return -1; |
There was a problem hiding this comment.
Validate read-vector argument lengths before indexing
This guard only checks numIovs > 0, but the function later indexes dstVec[i], sizeVec[i], and descs[i] up to numIovs - 1 without verifying those vectors are at least that long. If a caller provides mismatched lengths, this will read past vector bounds and can crash or corrupt transfer parameters.
Useful? React with 👍 / 👎.
| if (conn == NULL || numIovs <= 0 || transferId == NULL) | ||
| return -1; |
There was a problem hiding this comment.
Validate write-vector argument lengths before indexing
Like the read path, this entry check does not ensure dstVec, sizeVec, and descs contain at least numIovs elements before the loop indexes them. A malformed or partially constructed vectored write request can trigger out-of-bounds access and undefined behavior during transfer setup.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
Adds a new FlagCX P2P (one-sided RDMA) engine interface and implementation intended for NIXL integration, along with unit tests, a perf benchmark driver, and build/install plumbing to produce a separate libflagcx_p2p.so and export the new public header.
Changes:
- Introduces a new public P2P engine API (
flagcx_p2p.h) and its implementation (flagcx_p2p.cc) for RDMA READ/WRITE plus metadata/notification helpers. - Adds new P2P READ-path unit tests and updates the P2P unittest Makefile to build GoogleTest objects directly.
- Adds a NIXL-oriented FlagCX backend benchmark script and updates the top-level build/install targets for
libflagcx_p2p.soand public headers.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
flagcx/include/flagcx_p2p.h |
New public P2P engine API surface (descriptors, connect/accept, reg, read/write, metadata, notifications, IPC helpers). |
flagcx/core/flagcx_p2p.cc |
New implementation of the P2P engine on top of the IB P2P net adaptor + topology + notification side-channel. |
Makefile |
Builds/installs libflagcx_p2p.so and installs public headers; copies headers into build output tree. |
test/unittest/p2p/test_p2p_engine_read.cpp |
Adds unit tests for one-sided READ path (metadata handshake + RDMA read + completion polling). |
test/unittest/p2p/Makefile |
Links unit tests with locally-built gtest objects instead of $(GTEST_LINK). |
test/perf/p2p/benchmark_flagcx.py |
Adds a ZMQ-coordinated benchmark script to exercise NIXL + FlagCX backend transfers. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #if defined(__linux__) | ||
| if (engine->notifEpollFd >= 0) { | ||
| struct epoll_event event; | ||
| memset(&event, 0, sizeof(event)); | ||
| event.data.fd = fd; | ||
| event.events = EPOLLIN | EPOLLET; | ||
| #ifdef EPOLLRDHUP |
| * @param inPython Whether the engine is being created from Python. | ||
| * @return Pointer to the engine instance, or NULL on failure. | ||
| */ | ||
| FlagcxP2pEngine *flagcxP2pEngineCreate(int numCpus, bool inPython); |
There was a problem hiding this comment.
This API declaration is inherited from UCCL, but we don't actually need numCpus and inPython in our implementation. Removing these 2 parameters might make the code more understandable
| all: $(LIBDIR)/$(TARGET) | ||
| FLAGCX_P2P_TARGET = libflagcx_p2p.so | ||
| FLAGCX_P2P_SRC = flagcx/core/flagcx_p2p.cc | ||
| FLAGCX_P2P_OBJ = $(OBJDIR)/flagcx/core/flagcx_p2p.o |
There was a problem hiding this comment.
Currently, flagcx_p2p.cc is compiled into both libflagcx.so and libflagcx_p2p.so, some global variables may exist in both so files. Since libflagcx_p2p.so can't live on its own, a better approach would be to compile everything into libflagcx.so
|
|
||
| extern struct flagcxNetAdaptor flagcxNetIbP2p; | ||
|
|
||
| struct FlagcxP2pMrHandleView { |
There was a problem hiding this comment.
FlagcxP2pMrHandleView -> flagcxP2pMrHandleView
| int ibDevN; | ||
| }; | ||
|
|
||
| struct FlagcxP2pListenHandleView { |
PR Category
CRL (Communication Runtime Layer)
PR Types
New Features
PR Description
Add & implement P2P interface for integration with nixl