Describe the bug
CI test tests/functional_tests/python_test_utils/test_pretraining_regular_pipeline.py::test_regular_pipeline for mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing failed in job mimo/mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing - latest on the iteration-time metric. The other metrics (lm loss, num-zeros, mem-allocated-bytes, mem-max-allocated-bytes) all PASSED — only the wall-clock iteration-time ApproximateTest(atol=0, rtol=0.05) is tripping. The failure repeated on all three retry attempts in this run.
This appears to be a chronic flake: golden values for this test were already refreshed once on PR #4611 (commit abdc055) and the test still fails because per-step iteration-time on the CI host is more variable than the global rtol=0.05 allows (e.g. step 2 actual 15.87s vs golden 9.37s, step 4 actual 2.19s vs golden 2.51s).
Tag the @mcore-oncall to get oncall's attention to this issue.
Failing run
Error
ERROR tests.functional_tests.python_test_utils.common:common.py:263 Approximate comparison of iteration-time: FAILED
INFO tests.functional_tests.python_test_utils.common:common.py:263 DETERMINISTIC test for metric lm loss: PASSED
INFO tests.functional_tests.python_test_utils.common:common.py:263 APPROXIMATE test for metric lm loss: PASSED
INFO tests.functional_tests.python_test_utils.common:common.py:263 DETERMINISTIC test for metric num-zeros: PASSED
INFO tests.functional_tests.python_test_utils.common:common.py:263 APPROXIMATE test for metric num-zeros: PASSED
INFO tests.functional_tests.python_test_utils.common:common.py:263 APPROXIMATE test for metric mem-allocated-bytes: PASSED
INFO tests.functional_tests.python_test_utils.common:common.py:263 APPROXIMATE test for metric mem-max-allocated-bytes: PASSED
=================================== FAILURES ===================================
____________________________ test_regular_pipeline _____________________________
golden_values = {'iteration-time': Values (1,100,1): (1, nan), (2, 9.3743), (3, 0.903), (4, 2.50665), (5, 0.90051), (6, 0.93025), ...}
actual_values = {'iteration-time': Values (1,100,1): (1, nan), (2, 15.87007), (3, 0.86351), (4, 2.19113), (5, 0.85531), (6, 0.85478), ...}
model_config_path = './tests/functional_tests/test_cases/mimo/mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing/model_config.yaml'
checks = {'iteration-time': [ApproximateTest(atol=0, rtol=0.05)], ...}
is_close = np.isclose(actual, golden, rtol=test.rtol, atol=test.atol)
E AssertionError: The following metrics failed: iteration-time
tests/functional_tests/python_test_utils/common.py:268: AssertionError
##[error]❌ mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing — FAILED (exit 1)
(Full log available at the job URL above.)
Steps/Code to reproduce bug
Re-run the failing CI job linked above, or locally inside the dev container:
pytest tests/functional_tests/python_test_utils/test_pretraining_regular_pipeline.py::test_regular_pipeline
against the mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing test case.
Suggested next steps
The iteration-time rtol of 0.05 (defined in CHECK_THRESHOLDS in tests/functional_tests/python_test_utils/test_pretraining_regular_pipeline.py) appears too tight for this short-running test on shared CI hardware. Options:
- Drop
iteration-time from this test's METRICS list in tests/functional_tests/test_cases/mimo/mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing/model_config.yaml (correctness is still covered by lm loss, num-zeros, and the memory metrics).
- Allow per-test override of
iteration-time tolerance.
- Make the metric median-based rather than per-step
np.isclose.
Additional context
Triaged automatically via /create-issue. The mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing/model_config.yaml was last touched by @mehraakash (commit 0e19bf11, "Add CP + Sequence Packing support for Mimo (#2135)").
Describe the bug
CI test
tests/functional_tests/python_test_utils/test_pretraining_regular_pipeline.py::test_regular_pipelineformimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packingfailed in jobmimo/mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing - lateston theiteration-timemetric. The other metrics (lm loss,num-zeros,mem-allocated-bytes,mem-max-allocated-bytes) all PASSED — only the wall-clockiteration-timeApproximateTest(atol=0, rtol=0.05)is tripping. The failure repeated on all three retry attempts in this run.This appears to be a chronic flake: golden values for this test were already refreshed once on PR #4611 (commit
abdc055) and the test still fails because per-stepiteration-timeon the CI host is more variable than the globalrtol=0.05allows (e.g. step 2 actual15.87svs golden9.37s, step 4 actual2.19svs golden2.51s).Tag the @mcore-oncall to get oncall's attention to this issue.
Failing run
Error
(Full log available at the job URL above.)
Steps/Code to reproduce bug
Re-run the failing CI job linked above, or locally inside the dev container:
against the
mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packingtest case.Suggested next steps
The
iteration-timertol of 0.05 (defined inCHECK_THRESHOLDSintests/functional_tests/python_test_utils/test_pretraining_regular_pipeline.py) appears too tight for this short-running test on shared CI hardware. Options:iteration-timefrom this test'sMETRICSlist intests/functional_tests/test_cases/mimo/mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing/model_config.yaml(correctness is still covered bylm loss,num-zeros, and the memory metrics).iteration-timetolerance.np.isclose.Additional context
Triaged automatically via
/create-issue. Themimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing/model_config.yamlwas last touched by@mehraakash(commit0e19bf11, "Add CP + Sequence Packing support for Mimo (#2135)").