Skip to content

Add Python-side guardrail for DeepEP IB limits#4719

Merged
ko3n1g merged 2 commits into
NVIDIA:mainfrom
janEbert:fix-deepep-buffer-guard
May 11, 2026
Merged

Add Python-side guardrail for DeepEP IB limits#4719
ko3n1g merged 2 commits into
NVIDIA:mainfrom
janEbert:fix-deepep-buffer-guard

Conversation

@janEbert
Copy link
Copy Markdown
Contributor

Compared to the previous attempt in #4094, which didn't properly check (was overly conservative; this was only noticed in Megatron-Bridge tests) and had to be reverted (#4718), we add a lot more code to basically re-compute the DeepEP internal handling. This is necessary to handle all cases that DeepEP considers properly.

Another approach we could try to take is filtering the raised error from DeepEP, and giving the more helpful error message when we encounter the known one.

Ref #4094, #4718, fix #3999.

@janEbert janEbert requested review from a team as code owners May 11, 2026 08:56
@svcnvidia-nemo-ci svcnvidia-nemo-ci marked this pull request as draft May 11, 2026 08:57
@github-actions
Copy link
Copy Markdown
Contributor

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

  1. Add the oncall reviewer (optional reviewer)
  2. Add required review teams based on your changes

See the contribution guide for more details.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 11, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@janEbert janEbert added the Run MBridge tests Attach this for testing this PR against MBridge main label May 11, 2026
janEbert added 2 commits May 11, 2026 11:45
Also rename `seq_len` -> `num_tokens` for better understandability.
@janEbert janEbert force-pushed the fix-deepep-buffer-guard branch from ef74c85 to 5c723f5 Compare May 11, 2026 09:45
@janEbert janEbert marked this pull request as ready for review May 11, 2026 12:05
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team May 11, 2026 12:05
@ko3n1g
Copy link
Copy Markdown
Contributor

ko3n1g commented May 11, 2026

PR passes internal MBridge tests

@ko3n1g ko3n1g merged commit ad58411 into NVIDIA:main May 11, 2026
62 of 65 checks passed
svcnvidia-nemo-ci added a commit that referenced this pull request May 12, 2026
Merges 8 commits from main into dev. Dev already contains yesterday's
sync (PR #4716) plus follow-up fixes, so this PR only carries main
commits made after that sync.

Notable changes:
- 434368c build(deps): bump nvidia-modelopt to 0.43 (#4723)
- e42e2fa ci: Major refactor of release-workflows (#4602)
- 33d47e0 [ci] fix: treat cancelled run-main-script step as failure (#4727)
- 5123f6a ci: revert bad uv.lock bump and label future bumps with
  Run functional tests (#4730)
- ad58411 Add Python-side guardrail for DeepEP IB limits (#4719)
- e93755e chore(beep boop): Bump (main) (2026-05-11)
- a2ec5c1 Revert Add Python-side guardrail for HybridEP IB limit (#4718)
- 5e31514 Create a Protocol for the MLP layer of TransformerLayer (#3435)

Kept dev's pyproject.toml, uv.lock, docker/Dockerfile.ci.dev, and
.github/CODEOWNERS (per nightly-sync skill).

Ran black + isort on changed Python files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

complexity: low Run MBridge tests Attach this for testing this PR against MBridge main

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] HybridEP dispatcher passes incorrect max_num_of_tokens_per_rank to DeepEP, causing RDMA QP assertion failure

3 participants