Skip to content

🐛 CI failure: tests/functional_tests/python_test_utils/test_pretraining_regular_pipeline.py::test_regular_pipeline [mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing] #4654

@balasaajay

Description

@balasaajay

Describe the bug

CI test tests/functional_tests/python_test_utils/test_pretraining_regular_pipeline.py::test_regular_pipeline for mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing failed in job mimo/mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing - latest on the iteration-time metric. The other metrics (lm loss, num-zeros, mem-allocated-bytes, mem-max-allocated-bytes) all PASSED — only the wall-clock iteration-time ApproximateTest(atol=0, rtol=0.05) is tripping. The failure repeated on all three retry attempts in this run.

This appears to be a chronic flake: golden values for this test were already refreshed once on PR #4611 (commit abdc055) and the test still fails because per-step iteration-time on the CI host is more variable than the global rtol=0.05 allows (e.g. step 2 actual 15.87s vs golden 9.37s, step 4 actual 2.19s vs golden 2.51s).

Tag the @mcore-oncall to get oncall's attention to this issue.

Failing run

Field Value
PR #4611: chore: Update Docker image version to 26.04-py3
Run 25416016966
Job mimo/mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing - latest

Error

ERROR    tests.functional_tests.python_test_utils.common:common.py:263 Approximate comparison of iteration-time: FAILED
INFO     tests.functional_tests.python_test_utils.common:common.py:263 DETERMINISTIC test for metric lm loss: PASSED
INFO     tests.functional_tests.python_test_utils.common:common.py:263 APPROXIMATE test for metric lm loss: PASSED
INFO     tests.functional_tests.python_test_utils.common:common.py:263 DETERMINISTIC test for metric num-zeros: PASSED
INFO     tests.functional_tests.python_test_utils.common:common.py:263 APPROXIMATE test for metric num-zeros: PASSED
INFO     tests.functional_tests.python_test_utils.common:common.py:263 APPROXIMATE test for metric mem-allocated-bytes: PASSED
INFO     tests.functional_tests.python_test_utils.common:common.py:263 APPROXIMATE test for metric mem-max-allocated-bytes: PASSED

=================================== FAILURES ===================================
____________________________ test_regular_pipeline _____________________________

golden_values = {'iteration-time': Values (1,100,1): (1, nan), (2, 9.3743), (3, 0.903), (4, 2.50665), (5, 0.90051), (6, 0.93025), ...}
actual_values = {'iteration-time': Values (1,100,1): (1, nan), (2, 15.87007), (3, 0.86351), (4, 2.19113), (5, 0.85531), (6, 0.85478), ...}
model_config_path = './tests/functional_tests/test_cases/mimo/mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing/model_config.yaml'
checks = {'iteration-time': [ApproximateTest(atol=0, rtol=0.05)], ...}

    is_close = np.isclose(actual, golden, rtol=test.rtol, atol=test.atol)
E   AssertionError: The following metrics failed: iteration-time

tests/functional_tests/python_test_utils/common.py:268: AssertionError
##[error]❌ mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing — FAILED (exit 1)

(Full log available at the job URL above.)

Steps/Code to reproduce bug

Re-run the failing CI job linked above, or locally inside the dev container:

pytest tests/functional_tests/python_test_utils/test_pretraining_regular_pipeline.py::test_regular_pipeline

against the mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing test case.

Suggested next steps

The iteration-time rtol of 0.05 (defined in CHECK_THRESHOLDS in tests/functional_tests/python_test_utils/test_pretraining_regular_pipeline.py) appears too tight for this short-running test on shared CI hardware. Options:

  1. Drop iteration-time from this test's METRICS list in tests/functional_tests/test_cases/mimo/mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing/model_config.yaml (correctness is still covered by lm loss, num-zeros, and the memory metrics).
  2. Allow per-test override of iteration-time tolerance.
  3. Make the metric median-based rather than per-step np.isclose.

Additional context

Triaged automatically via /create-issue. The mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing/model_config.yaml was last touched by @mehraakash (commit 0e19bf11, "Add CP + Sequence Packing support for Mimo (#2135)").

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions