[ci, build] chore: Revert PR #3702 (hybridep docker move) to investigate VLM perf regression by yaoyu-33 · Pull Request #3729 · NVIDIA-NeMo/Megatron-Bridge

yaoyu-33 · 2026-05-07T02:47:05Z

Summary

Revert of #3702 (dec6d631) to validate it as the cause of a 5× VLM test slowdown.

Why

tests/functional_tests/test_groups/converter/test_generate_vlm_from_hf.py::TestGenerateVLMFromHF::test_generate_vlm regressed from ~1m44s → ~9m in L0_Launch_converter starting exactly at commit dec6d631:

Commit	test_generate_vlm
`2faedbf` (PR #3148)	1m44s
`dec6d63` (PR #3702)	8m42s
`ad8ef84` (PR #3700)	9m26s

Symptoms persist on every L0 run since. Combined with transient HF Hub flakiness on PR #3719 (lockfile bump), this pushed L0_Launch_converter over its 30-minute budget.

PR #3702 moved the DeepEP/NVSHMEM install from Dockerfile.fw_base → Dockerfile.ci. Notably the CI workflow builds Dockerfile.ci directly on nvcr.io/nvidia/pytorch:26.04-py3 (not on fw-base), so prior to PR #3702 DeepEP was not present in the CI image at all. After PR #3702, the CI image bakes in DeepEP plus env vars HYBRID_EP_MULTINODE=1, RDMA_CORE_HOME, and a prepended LD_LIBRARY_PATH=/usr/local/cuda/lib64/... — any of which could plausibly affect Qwen2.5-VL-3B-Instruct distributed init.

Test plan

L0_Launch_converter passes within budget
test_generate_vlm returns to ~1m44s
L2 full-test-suite is green (no DeepEP-dependent test regressed)

cc @okoenig

…om fw_base to Dockerfile.ci (#3702)" This reverts commit dec6d63. L0_Launch_converter::test_generate_vlm regressed from ~1m44s to ~9m starting at this commit. Reverting to confirm whether the DeepEP install (and its env vars HYBRID_EP_MULTINODE / LD_LIBRARY_PATH) being baked into the CI image is the cause. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

copy-pr-bot · 2026-05-07T02:47:09Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yaoyu-33 · 2026-05-07T02:47:28Z

/ok to test c77b31a

claude · 2026-05-07T02:49:58Z

LGTM -- clean revert of 3702. The DeepEP install block moves back to Dockerfile.fw_base (which CI does not build), removing DeepEP and its env vars (HYBRID_EP_MULTINODE, RDMA_CORE_HOME, LD_LIBRARY_PATH prepend) from the CI image. README updates are correct. Suggested test cases -- No perf tests impacted. The test plan in the PR body covers the right cases: (1) L0_Launch_converter passes within its 30-minute budget, (2) test_generate_vlm returns to approx 1m44s baseline, (3) L2 full-test-suite is green (no DeepEP-dependent test regressed).

… move) Reverts 4f21a31 (#3729), restoring the hybridep docker install move from Dockerfile.fw_base into Dockerfile.ci that PR #3702 originally introduced. Done to retest whether the VLM perf regression suspected to be caused by the move is still reproducible on current main. Original revert message: This reverts commit 4f21a31. Signed-off-by: oliver könig <okoenig@nvidia.com>

yaoyu-33 requested a review from a team as a code owner May 7, 2026 02:47

yaoyu-33 added area:build Dependencies, packaging, images, and environment setup full-test-suite needs-review PR is ready for code review and waiting on a reviewer area:ci labels May 7, 2026

copy-pr-bot Bot temporarily deployed to public May 7, 2026 02:48 Inactive

copy-pr-bot Bot temporarily deployed to test May 7, 2026 02:48 Inactive

copy-pr-bot Bot temporarily deployed to public May 7, 2026 03:01 Inactive

copy-pr-bot Bot temporarily deployed to public May 7, 2026 03:14 Inactive

yaoyu-33 merged commit 4f21a31 into main May 7, 2026
158 checks passed

yaoyu-33 deleted the yuya/revert-3702-hybridep-docker branch May 7, 2026 06:12

ko3n1g mentioned this pull request May 7, 2026

[ci, build] chore: revert revert of PR #3702 (restore hybridep docker move) #3734

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ci, build] chore: Revert PR #3702 (hybridep docker move) to investigate VLM perf regression#3729

[ci, build] chore: Revert PR #3702 (hybridep docker move) to investigate VLM perf regression#3729
yaoyu-33 merged 1 commit into
mainfrom
yuya/revert-3702-hybridep-docker

yaoyu-33 commented May 7, 2026

Uh oh!

copy-pr-bot Bot commented May 7, 2026

Uh oh!

yaoyu-33 commented May 7, 2026

Uh oh!

claude Bot commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yaoyu-33 commented May 7, 2026

Summary

Why

Test plan

Uh oh!

copy-pr-bot Bot commented May 7, 2026

Uh oh!

yaoyu-33 commented May 7, 2026

Uh oh!

claude Bot commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant