Skip to content

[CRL] Add & implement P2P interface for integration with nixl#466

Open
yzhang35 wants to merge 15 commits into
flagos-ai:mainfrom
yzhang35:main
Open

[CRL] Add & implement P2P interface for integration with nixl#466
yzhang35 wants to merge 15 commits into
flagos-ai:mainfrom
yzhang35:main

Conversation

@yzhang35
Copy link
Copy Markdown
Collaborator

PR Category

CRL (Communication Runtime Layer)

PR Types

New Features

PR Description

Add & implement P2P interface for integration with nixl

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b9b01fce64

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread flagcx/core/flagcx_p2p.cc
} devCtx = {it->second.ibDevN};
engine->adaptor->deregMr(&devCtx, it->second.mhandle);
}
gMemRegInfo.clear();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Scope MR teardown to the engine being destroyed

flagcxP2pEngineDestroy clears the process-global gMemRegInfo, so destroying any one engine deregisters memory regions that were registered by other still-active engines in the same process. In a multi-engine setup (e.g., separate client/server engines), subsequent transfers on the surviving engine can fail because its MR handles have been removed underneath it.

Useful? React with 👍 / 👎.

Comment thread flagcx/core/flagcx_p2p.cc
Comment on lines +1282 to +1283
if (conn == NULL || numIovs <= 0 || transferId == NULL)
return -1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate read-vector argument lengths before indexing

This guard only checks numIovs > 0, but the function later indexes dstVec[i], sizeVec[i], and descs[i] up to numIovs - 1 without verifying those vectors are at least that long. If a caller provides mismatched lengths, this will read past vector bounds and can crash or corrupt transfer parameters.

Useful? React with 👍 / 👎.

Comment thread flagcx/core/flagcx_p2p.cc
Comment on lines +1408 to +1409
if (conn == NULL || numIovs <= 0 || transferId == NULL)
return -1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate write-vector argument lengths before indexing

Like the read path, this entry check does not ensure dstVec, sizeVec, and descs contain at least numIovs elements before the loop indexes them. A malformed or partially constructed vectored write request can trigger out-of-bounds access and undefined behavior during transfer setup.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new FlagCX P2P (one-sided RDMA) engine interface and implementation intended for NIXL integration, along with unit tests, a perf benchmark driver, and build/install plumbing to produce a separate libflagcx_p2p.so and export the new public header.

Changes:

  • Introduces a new public P2P engine API (flagcx_p2p.h) and its implementation (flagcx_p2p.cc) for RDMA READ/WRITE plus metadata/notification helpers.
  • Adds new P2P READ-path unit tests and updates the P2P unittest Makefile to build GoogleTest objects directly.
  • Adds a NIXL-oriented FlagCX backend benchmark script and updates the top-level build/install targets for libflagcx_p2p.so and public headers.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
flagcx/include/flagcx_p2p.h New public P2P engine API surface (descriptors, connect/accept, reg, read/write, metadata, notifications, IPC helpers).
flagcx/core/flagcx_p2p.cc New implementation of the P2P engine on top of the IB P2P net adaptor + topology + notification side-channel.
Makefile Builds/installs libflagcx_p2p.so and installs public headers; copies headers into build output tree.
test/unittest/p2p/test_p2p_engine_read.cpp Adds unit tests for one-sided READ path (metadata handshake + RDMA read + completion polling).
test/unittest/p2p/Makefile Links unit tests with locally-built gtest objects instead of $(GTEST_LINK).
test/perf/p2p/benchmark_flagcx.py Adds a ZMQ-coordinated benchmark script to exercise NIXL + FlagCX backend transfers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread flagcx/include/flagcx_p2p.h
Comment thread flagcx/include/flagcx_p2p.h Outdated
Comment thread flagcx/core/flagcx_p2p.cc
Comment thread flagcx/core/flagcx_p2p.cc
Comment on lines +470 to +476
#if defined(__linux__)
if (engine->notifEpollFd >= 0) {
struct epoll_event event;
memset(&event, 0, sizeof(event));
event.data.fd = fd;
event.events = EPOLLIN | EPOLLET;
#ifdef EPOLLRDHUP
Comment thread flagcx/core/flagcx_p2p.cc
Comment thread test/perf/p2p/benchmark_flagcx.py
Comment thread Makefile Outdated
Comment thread flagcx/include/flagcx_p2p.h
Comment thread flagcx/core/flagcx_p2p.cc
Comment thread flagcx/core/flagcx_p2p.cc
Comment thread flagcx/include/flagcx_p2p.h Outdated
* @param inPython Whether the engine is being created from Python.
* @return Pointer to the engine instance, or NULL on failure.
*/
FlagcxP2pEngine *flagcxP2pEngineCreate(int numCpus, bool inPython);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API declaration is inherited from UCCL, but we don't actually need numCpus and inPython in our implementation. Removing these 2 parameters might make the code more understandable

Comment thread Makefile Outdated
all: $(LIBDIR)/$(TARGET)
FLAGCX_P2P_TARGET = libflagcx_p2p.so
FLAGCX_P2P_SRC = flagcx/core/flagcx_p2p.cc
FLAGCX_P2P_OBJ = $(OBJDIR)/flagcx/core/flagcx_p2p.o
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, flagcx_p2p.cc is compiled into both libflagcx.so and libflagcx_p2p.so, some global variables may exist in both so files. Since libflagcx_p2p.so can't live on its own, a better approach would be to compile everything into libflagcx.so

@MC952-arch MC952-arch changed the title Add & implement P2P interface for integration with nixl [CRL] Add & implement P2P interface for integration with nixl May 9, 2026
Comment thread flagcx/core/flagcx_p2p.cc

extern struct flagcxNetAdaptor flagcxNetIbP2p;

struct FlagcxP2pMrHandleView {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FlagcxP2pMrHandleView -> flagcxP2pMrHandleView

Comment thread flagcx/core/flagcx_p2p.cc
int ibDevN;
};

struct FlagcxP2pListenHandleView {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Copy link
Copy Markdown
Collaborator

@mikethegoblin mikethegoblin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants