[ci, build] chore: Revert PR #3702 (hybridep docker move) to investigate VLM perf regression#3729
Merged
Merged
Conversation
…om fw_base to Dockerfile.ci (#3702)" This reverts commit dec6d63. L0_Launch_converter::test_generate_vlm regressed from ~1m44s to ~9m starting at this commit. Reverting to confirm whether the DeepEP install (and its env vars HYBRID_EP_MULTINODE / LD_LIBRARY_PATH) being baked into the CI image is the cause. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Contributor
Author
|
/ok to test c77b31a |
Contributor
|
LGTM -- clean revert of 3702. The DeepEP install block moves back to Dockerfile.fw_base (which CI does not build), removing DeepEP and its env vars (HYBRID_EP_MULTINODE, RDMA_CORE_HOME, LD_LIBRARY_PATH prepend) from the CI image. README updates are correct. Suggested test cases -- No perf tests impacted. The test plan in the PR body covers the right cases: (1) L0_Launch_converter passes within its 30-minute budget, (2) test_generate_vlm returns to approx 1m44s baseline, (3) L2 full-test-suite is green (no DeepEP-dependent test regressed). |
ko3n1g
added a commit
that referenced
this pull request
May 7, 2026
… move) Reverts 4f21a31 (#3729), restoring the hybridep docker install move from Dockerfile.fw_base into Dockerfile.ci that PR #3702 originally introduced. Done to retest whether the VLM perf regression suspected to be caused by the move is still reproducible on current main. Original revert message: This reverts commit 4f21a31. Signed-off-by: oliver könig <okoenig@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Revert of #3702 (
dec6d631) to validate it as the cause of a 5× VLM test slowdown.Why
tests/functional_tests/test_groups/converter/test_generate_vlm_from_hf.py::TestGenerateVLMFromHF::test_generate_vlmregressed from ~1m44s → ~9m inL0_Launch_converterstarting exactly at commitdec6d631:Symptoms persist on every L0 run since. Combined with transient HF Hub flakiness on PR #3719 (lockfile bump), this pushed
L0_Launch_converterover its 30-minute budget.PR #3702 moved the DeepEP/NVSHMEM install from
Dockerfile.fw_base→Dockerfile.ci. Notably the CI workflow buildsDockerfile.cidirectly onnvcr.io/nvidia/pytorch:26.04-py3(not on fw-base), so prior to PR #3702 DeepEP was not present in the CI image at all. After PR #3702, the CI image bakes in DeepEP plus env varsHYBRID_EP_MULTINODE=1,RDMA_CORE_HOME, and a prependedLD_LIBRARY_PATH=/usr/local/cuda/lib64/...— any of which could plausibly affect Qwen2.5-VL-3B-Instruct distributed init.Test plan
L0_Launch_converterpasses within budgettest_generate_vlmreturns to ~1m44scc @okoenig