[ci, test] chore: Split L0 converter launch script into conversion and generation by cuichenx · Pull Request #3758 · NVIDIA-NeMo/Megatron-Bridge

cuichenx · 2026-05-08T20:47:57Z

Summary

Split L0_Launch_converter.sh into two scripts so the heavier generation tests run as a separate CI matrix entry from the lighter import/export and multi-GPU conversion tests, halving per-job wall clock for either side.
Drop an existing duplication: test_hf_fsdp_conversion.py was running both in the catch-all converter script and in the dedicated L0_Launch_converter_fsdp.sh. The split removes the duplicate run.
No workflow changes — cicd-main.yml's generate-test-matrix scans h100/active/L0_*.sh dynamically, so the new L0_Launch_converter_generate.sh is picked up automatically.

Script	Tests covered
`L0_Launch_converter.sh` (modified)	`test_checkpoint_conversion.py` + `test_multi_gpu_conversion.py`
`L0_Launch_converter_generate.sh` (new)	`test_generate_from_hf.py` + `test_generate_vlm_from_hf.py`
`L0_Launch_converter_fsdp.sh` (unchanged)	`test_hf_fsdp_conversion.py`

Background

This was triggered by a CI timeout investigation on the converter L0 job. Locally on H100 with a populated HF cache, the full 10-test set passes cleanly (PP test = 47s call + 11s setup), so the CI timeout looks environmental (cold HF cache, hf_xet network behavior, runner contention) rather than a logic bug. Splitting the job gives each half its own timeout budget and isolates the heavier distributed-launch tests.

Test plan

CI: L0_Launch_converter passes (covers test_checkpoint_conversion + test_multi_gpu_conversion)
CI: L0_Launch_converter_generate passes (covers test_generate_from_hf + test_generate_vlm_from_hf)
CI: L0_Launch_converter_fsdp still passes (unchanged)
Verify the matrix generator picks up the new script (it scans h100/active/L0_*.sh automatically)

Follow-up (not in this PR)

Same split could be applied symmetrically to tests/functional_tests/launch_scripts/gb200/active/L0_Launch_converter.sh for consistency.

🤖 Generated with Claude Code

…d generation Split L0_Launch_converter.sh into two scripts to reduce per-job runtime and isolate the heavier generation tests. Also drops an existing duplication where test_hf_fsdp_conversion.py was running both in the catch-all converter script and in the dedicated L0_Launch_converter_fsdp.sh. - L0_Launch_converter.sh -> test_checkpoint_conversion.py + test_multi_gpu_conversion.py - L0_Launch_converter_generate.sh -> test_generate_from_hf.py + test_generate_vlm_from_hf.py (new) - L0_Launch_converter_fsdp.sh -> test_hf_fsdp_conversion.py (unchanged) The CI matrix is generated dynamically by scanning the launch_scripts directory, so no workflow changes are required. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot · 2026-05-08T20:48:01Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cuichenx · 2026-05-08T20:48:48Z

/ok to test 0f129b2

claude · 2026-05-08T20:49:55Z

LGTM — clean split. All five converter test files are covered across the three launch scripts (L0_Launch_converter.sh, L0_Launch_converter_generate.sh, L0_Launch_converter_fsdp.sh), the new file is executable, and the duplicate test_hf_fsdp_conversion.py run is removed.Suggested test casesNo perf tests impacted.CI validation:- L0_Launch_converter — should pass covering test_checkpoint_conversion.py + test_multi_gpu_conversion.py- L0_Launch_converter_generate — should pass covering test_generate_from_hf.py + test_generate_vlm_from_hf.py- L0_Launch_converter_fsdp — should pass unchanged covering test_hf_fsdp_conversion.py- Verify CI matrix generator picks up L0_Launch_converter_generate.sh automatically (no workflow changes needed)

cuichenx added area:ci ci CI, automation, test queue, or workflow infrastructure work needs-review PR is ready for code review and waiting on a reviewer labels May 8, 2026

cuichenx requested a review from ko3n1g May 8, 2026 20:48

copy-pr-bot Bot temporarily deployed to public May 8, 2026 20:49 Inactive

copy-pr-bot Bot temporarily deployed to test May 8, 2026 20:49 Inactive

ko3n1g approved these changes May 8, 2026

View reviewed changes

cuichenx added ready-to-merge PR is approved, current, and only waiting for CI to pass before merge and removed needs-review PR is ready for code review and waiting on a reviewer labels May 8, 2026

copy-pr-bot Bot temporarily deployed to public May 8, 2026 21:03 Inactive

copy-pr-bot Bot temporarily deployed to public May 8, 2026 21:04 Inactive

copy-pr-bot Bot temporarily deployed to public May 8, 2026 21:18 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ci, test] chore: Split L0 converter launch script into conversion and generation#3758

[ci, test] chore: Split L0 converter launch script into conversion and generation#3758
cuichenx wants to merge 1 commit into
mainfrom
chcui/split-l0-converter

cuichenx commented May 8, 2026

Uh oh!

copy-pr-bot Bot commented May 8, 2026

Uh oh!

cuichenx commented May 8, 2026

Uh oh!

claude Bot commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cuichenx commented May 8, 2026

Summary

Background

Test plan

Follow-up (not in this PR)

Uh oh!

copy-pr-bot Bot commented May 8, 2026

Uh oh!

cuichenx commented May 8, 2026

Uh oh!

claude Bot commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants