Skip to content

[ci, test] chore: Split L0 converter launch script into conversion and generation#3758

Open
cuichenx wants to merge 1 commit into
mainfrom
chcui/split-l0-converter
Open

[ci, test] chore: Split L0 converter launch script into conversion and generation#3758
cuichenx wants to merge 1 commit into
mainfrom
chcui/split-l0-converter

Conversation

@cuichenx
Copy link
Copy Markdown
Contributor

@cuichenx cuichenx commented May 8, 2026

Summary

  • Split L0_Launch_converter.sh into two scripts so the heavier generation tests run as a separate CI matrix entry from the lighter import/export and multi-GPU conversion tests, halving per-job wall clock for either side.
  • Drop an existing duplication: test_hf_fsdp_conversion.py was running both in the catch-all converter script and in the dedicated L0_Launch_converter_fsdp.sh. The split removes the duplicate run.
  • No workflow changes — cicd-main.yml's generate-test-matrix scans h100/active/L0_*.sh dynamically, so the new L0_Launch_converter_generate.sh is picked up automatically.
Script Tests covered
L0_Launch_converter.sh (modified) test_checkpoint_conversion.py + test_multi_gpu_conversion.py
L0_Launch_converter_generate.sh (new) test_generate_from_hf.py + test_generate_vlm_from_hf.py
L0_Launch_converter_fsdp.sh (unchanged) test_hf_fsdp_conversion.py

Background

This was triggered by a CI timeout investigation on the converter L0 job. Locally on H100 with a populated HF cache, the full 10-test set passes cleanly (PP test = 47s call + 11s setup), so the CI timeout looks environmental (cold HF cache, hf_xet network behavior, runner contention) rather than a logic bug. Splitting the job gives each half its own timeout budget and isolates the heavier distributed-launch tests.

Test plan

  • CI: L0_Launch_converter passes (covers test_checkpoint_conversion + test_multi_gpu_conversion)
  • CI: L0_Launch_converter_generate passes (covers test_generate_from_hf + test_generate_vlm_from_hf)
  • CI: L0_Launch_converter_fsdp still passes (unchanged)
  • Verify the matrix generator picks up the new script (it scans h100/active/L0_*.sh automatically)

Follow-up (not in this PR)

  • Same split could be applied symmetrically to tests/functional_tests/launch_scripts/gb200/active/L0_Launch_converter.sh for consistency.

🤖 Generated with Claude Code

…d generation

Split L0_Launch_converter.sh into two scripts to reduce per-job runtime
and isolate the heavier generation tests. Also drops an existing
duplication where test_hf_fsdp_conversion.py was running both in the
catch-all converter script and in the dedicated L0_Launch_converter_fsdp.sh.

- L0_Launch_converter.sh           -> test_checkpoint_conversion.py + test_multi_gpu_conversion.py
- L0_Launch_converter_generate.sh  -> test_generate_from_hf.py + test_generate_vlm_from_hf.py (new)
- L0_Launch_converter_fsdp.sh      -> test_hf_fsdp_conversion.py (unchanged)

The CI matrix is generated dynamically by scanning the launch_scripts
directory, so no workflow changes are required.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cuichenx cuichenx added area:ci ci CI, automation, test queue, or workflow infrastructure work needs-review PR is ready for code review and waiting on a reviewer labels May 8, 2026
@cuichenx
Copy link
Copy Markdown
Contributor Author

cuichenx commented May 8, 2026

/ok to test 0f129b2

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 8, 2026

LGTM — clean split. All five converter test files are covered across the three launch scripts (L0_Launch_converter.sh, L0_Launch_converter_generate.sh, L0_Launch_converter_fsdp.sh), the new file is executable, and the duplicate test_hf_fsdp_conversion.py run is removed.Suggested test casesNo perf tests impacted.CI validation:- L0_Launch_converter — should pass covering test_checkpoint_conversion.py + test_multi_gpu_conversion.py- L0_Launch_converter_generate — should pass covering test_generate_from_hf.py + test_generate_vlm_from_hf.py- L0_Launch_converter_fsdp — should pass unchanged covering test_hf_fsdp_conversion.py- Verify CI matrix generator picks up L0_Launch_converter_generate.sh automatically (no workflow changes needed)

@cuichenx cuichenx added ready-to-merge PR is approved, current, and only waiting for CI to pass before merge and removed needs-review PR is ready for code review and waiting on a reviewer labels May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:ci ci CI, automation, test queue, or workflow infrastructure work ready-to-merge PR is approved, current, and only waiting for CI to pass before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants