[ci, test] chore: Split L0 converter launch script into conversion and generation#3758
[ci, test] chore: Split L0 converter launch script into conversion and generation#3758cuichenx wants to merge 1 commit into
Conversation
…d generation Split L0_Launch_converter.sh into two scripts to reduce per-job runtime and isolate the heavier generation tests. Also drops an existing duplication where test_hf_fsdp_conversion.py was running both in the catch-all converter script and in the dedicated L0_Launch_converter_fsdp.sh. - L0_Launch_converter.sh -> test_checkpoint_conversion.py + test_multi_gpu_conversion.py - L0_Launch_converter_generate.sh -> test_generate_from_hf.py + test_generate_vlm_from_hf.py (new) - L0_Launch_converter_fsdp.sh -> test_hf_fsdp_conversion.py (unchanged) The CI matrix is generated dynamically by scanning the launch_scripts directory, so no workflow changes are required. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Chen Cui <chcui@nvidia.com>
|
/ok to test 0f129b2 |
|
LGTM — clean split. All five converter test files are covered across the three launch scripts ( |
Summary
L0_Launch_converter.shinto two scripts so the heavier generation tests run as a separate CI matrix entry from the lighter import/export and multi-GPU conversion tests, halving per-job wall clock for either side.test_hf_fsdp_conversion.pywas running both in the catch-all converter script and in the dedicatedL0_Launch_converter_fsdp.sh. The split removes the duplicate run.cicd-main.yml'sgenerate-test-matrixscansh100/active/L0_*.shdynamically, so the newL0_Launch_converter_generate.shis picked up automatically.L0_Launch_converter.sh(modified)test_checkpoint_conversion.py+test_multi_gpu_conversion.pyL0_Launch_converter_generate.sh(new)test_generate_from_hf.py+test_generate_vlm_from_hf.pyL0_Launch_converter_fsdp.sh(unchanged)test_hf_fsdp_conversion.pyBackground
This was triggered by a CI timeout investigation on the converter L0 job. Locally on H100 with a populated HF cache, the full 10-test set passes cleanly (PP test = 47s call + 11s setup), so the CI timeout looks environmental (cold HF cache,
hf_xetnetwork behavior, runner contention) rather than a logic bug. Splitting the job gives each half its own timeout budget and isolates the heavier distributed-launch tests.Test plan
L0_Launch_converterpasses (coverstest_checkpoint_conversion+test_multi_gpu_conversion)L0_Launch_converter_generatepasses (coverstest_generate_from_hf+test_generate_vlm_from_hf)L0_Launch_converter_fsdpstill passes (unchanged)h100/active/L0_*.shautomatically)Follow-up (not in this PR)
tests/functional_tests/launch_scripts/gb200/active/L0_Launch_converter.shfor consistency.🤖 Generated with Claude Code