Skip to content

[ci, build] chore: Revert PR #3702 (hybridep docker move) to investigate VLM perf regression#3729

Merged
yaoyu-33 merged 1 commit into
mainfrom
yuya/revert-3702-hybridep-docker
May 7, 2026
Merged

[ci, build] chore: Revert PR #3702 (hybridep docker move) to investigate VLM perf regression#3729
yaoyu-33 merged 1 commit into
mainfrom
yuya/revert-3702-hybridep-docker

Conversation

@yaoyu-33
Copy link
Copy Markdown
Contributor

@yaoyu-33 yaoyu-33 commented May 7, 2026

Summary

Revert of #3702 (dec6d631) to validate it as the cause of a 5× VLM test slowdown.

Why

tests/functional_tests/test_groups/converter/test_generate_vlm_from_hf.py::TestGenerateVLMFromHF::test_generate_vlm regressed from ~1m44s → ~9m in L0_Launch_converter starting exactly at commit dec6d631:

Commit test_generate_vlm
2faedbf (PR #3148) 1m44s
dec6d63 (PR #3702) 8m42s
ad8ef84 (PR #3700) 9m26s

Symptoms persist on every L0 run since. Combined with transient HF Hub flakiness on PR #3719 (lockfile bump), this pushed L0_Launch_converter over its 30-minute budget.

PR #3702 moved the DeepEP/NVSHMEM install from Dockerfile.fw_baseDockerfile.ci. Notably the CI workflow builds Dockerfile.ci directly on nvcr.io/nvidia/pytorch:26.04-py3 (not on fw-base), so prior to PR #3702 DeepEP was not present in the CI image at all. After PR #3702, the CI image bakes in DeepEP plus env vars HYBRID_EP_MULTINODE=1, RDMA_CORE_HOME, and a prepended LD_LIBRARY_PATH=/usr/local/cuda/lib64/... — any of which could plausibly affect Qwen2.5-VL-3B-Instruct distributed init.

Test plan

  • L0_Launch_converter passes within budget
  • test_generate_vlm returns to ~1m44s
  • L2 full-test-suite is green (no DeepEP-dependent test regressed)

cc @okoenig

…om fw_base to Dockerfile.ci (#3702)"

This reverts commit dec6d63.

L0_Launch_converter::test_generate_vlm regressed from ~1m44s to ~9m
starting at this commit. Reverting to confirm whether the DeepEP
install (and its env vars HYBRID_EP_MULTINODE / LD_LIBRARY_PATH) being
baked into the CI image is the cause.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33 yaoyu-33 requested a review from a team as a code owner May 7, 2026 02:47
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 7, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yaoyu-33 yaoyu-33 added area:build Dependencies, packaging, images, and environment setup full-test-suite needs-review PR is ready for code review and waiting on a reviewer area:ci labels May 7, 2026
@yaoyu-33
Copy link
Copy Markdown
Contributor Author

yaoyu-33 commented May 7, 2026

/ok to test c77b31a

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 7, 2026

LGTM -- clean revert of 3702. The DeepEP install block moves back to Dockerfile.fw_base (which CI does not build), removing DeepEP and its env vars (HYBRID_EP_MULTINODE, RDMA_CORE_HOME, LD_LIBRARY_PATH prepend) from the CI image. README updates are correct. Suggested test cases -- No perf tests impacted. The test plan in the PR body covers the right cases: (1) L0_Launch_converter passes within its 30-minute budget, (2) test_generate_vlm returns to approx 1m44s baseline, (3) L2 full-test-suite is green (no DeepEP-dependent test regressed).

@yaoyu-33 yaoyu-33 merged commit 4f21a31 into main May 7, 2026
158 checks passed
@yaoyu-33 yaoyu-33 deleted the yuya/revert-3702-hybridep-docker branch May 7, 2026 06:12
ko3n1g added a commit that referenced this pull request May 7, 2026
… move)

Reverts 4f21a31 (#3729), restoring the hybridep docker install move
from Dockerfile.fw_base into Dockerfile.ci that PR #3702 originally
introduced. Done to retest whether the VLM perf regression suspected
to be caused by the move is still reproducible on current main.

Original revert message:
This reverts commit 4f21a31.

Signed-off-by: oliver könig <okoenig@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:build Dependencies, packaging, images, and environment setup area:ci full-test-suite needs-review PR is ready for code review and waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant