🐛 CI failure: tests/functional_tests/python_test_utils/test_pretraining_regular_pipeline.py::test_regular_pipeline [mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing]

**Describe the bug**

CI test `tests/functional_tests/python_test_utils/test_pretraining_regular_pipeline.py::test_regular_pipeline` for `mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing` failed in job [`mimo/mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing - latest`](https://github.com/NVIDIA/Megatron-LM/actions/runs/25416016966/job/74655244232) on the `iteration-time` metric. The other metrics (`lm loss`, `num-zeros`, `mem-allocated-bytes`, `mem-max-allocated-bytes`) all PASSED — only the wall-clock `iteration-time` `ApproximateTest(atol=0, rtol=0.05)` is tripping. The failure repeated on all three retry attempts in this run.

This appears to be a chronic flake: golden values for this test were already refreshed once on PR #4611 (commit `abdc055`) and the test still fails because per-step `iteration-time` on the CI host is more variable than the global `rtol=0.05` allows (e.g. step 2 actual `15.87s` vs golden `9.37s`, step 4 actual `2.19s` vs golden `2.51s`).

Tag the [@mcore-oncall](https://github.com/orgs/NVIDIA/teams/mcore-oncall) to get oncall's attention to this issue.

**Failing run**

| Field | Value |
|-------|-------|
| PR    | [#4611: chore: Update Docker image version to 26.04-py3](https://github.com/NVIDIA/Megatron-LM/pull/4611) |
| Run   | [25416016966](https://github.com/NVIDIA/Megatron-LM/actions/runs/25416016966) |
| Job   | [mimo/mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing - latest](https://github.com/NVIDIA/Megatron-LM/actions/runs/25416016966/job/74655244232) |

**Error**

```
ERROR    tests.functional_tests.python_test_utils.common:common.py:263 Approximate comparison of iteration-time: FAILED
INFO     tests.functional_tests.python_test_utils.common:common.py:263 DETERMINISTIC test for metric lm loss: PASSED
INFO     tests.functional_tests.python_test_utils.common:common.py:263 APPROXIMATE test for metric lm loss: PASSED
INFO     tests.functional_tests.python_test_utils.common:common.py:263 DETERMINISTIC test for metric num-zeros: PASSED
INFO     tests.functional_tests.python_test_utils.common:common.py:263 APPROXIMATE test for metric num-zeros: PASSED
INFO     tests.functional_tests.python_test_utils.common:common.py:263 APPROXIMATE test for metric mem-allocated-bytes: PASSED
INFO     tests.functional_tests.python_test_utils.common:common.py:263 APPROXIMATE test for metric mem-max-allocated-bytes: PASSED

=================================== FAILURES ===================================
____________________________ test_regular_pipeline _____________________________

golden_values = {'iteration-time': Values (1,100,1): (1, nan), (2, 9.3743), (3, 0.903), (4, 2.50665), (5, 0.90051), (6, 0.93025), ...}
actual_values = {'iteration-time': Values (1,100,1): (1, nan), (2, 15.87007), (3, 0.86351), (4, 2.19113), (5, 0.85531), (6, 0.85478), ...}
model_config_path = './tests/functional_tests/test_cases/mimo/mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing/model_config.yaml'
checks = {'iteration-time': [ApproximateTest(atol=0, rtol=0.05)], ...}

    is_close = np.isclose(actual, golden, rtol=test.rtol, atol=test.atol)
E   AssertionError: The following metrics failed: iteration-time

tests/functional_tests/python_test_utils/common.py:268: AssertionError
##[error]❌ mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing — FAILED (exit 1)
```

(Full log available at the job URL above.)

**Steps/Code to reproduce bug**

Re-run the failing CI job linked above, or locally inside the dev container:

```bash
pytest tests/functional_tests/python_test_utils/test_pretraining_regular_pipeline.py::test_regular_pipeline
```

against the `mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing` test case.

**Suggested next steps**

The `iteration-time` rtol of 0.05 (defined in `CHECK_THRESHOLDS` in `tests/functional_tests/python_test_utils/test_pretraining_regular_pipeline.py`) appears too tight for this short-running test on shared CI hardware. Options:

1. Drop `iteration-time` from this test's `METRICS` list in `tests/functional_tests/test_cases/mimo/mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing/model_config.yaml` (correctness is still covered by `lm loss`, `num-zeros`, and the memory metrics).
2. Allow per-test override of `iteration-time` tolerance.
3. Make the metric median-based rather than per-step `np.isclose`.

**Additional context**

Triaged automatically via `/create-issue`. The `mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing/model_config.yaml` was last touched by `@mehraakash` (commit `0e19bf11`, "Add CP + Sequence Packing support for Mimo (#2135)").

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 CI failure: tests/functional_tests/python_test_utils/test_pretraining_regular_pipeline.py::test_regular_pipeline [mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing] #4654

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Field	Value
PR	#4611: chore: Update Docker image version to 26.04-py3
Run	25416016966
Job	mimo/mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing - latest

🐛 CI failure: tests/functional_tests/python_test_utils/test_pretraining_regular_pipeline.py::test_regular_pipeline [mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8_seq_packing] #4654

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions