Skip to content

Flaky: TestFusedApplyMLARope::test_forward_backward_for_q[thd] backward mismatch exceeds bf16 tolerances #4640

@ko3n1g

Description

@ko3n1g

Summary

tests/unit_tests/fusions/test_mla_yarn_rope_apply.py::TestFusedApplyMLARope::test_forward_backward_for_q[thd] is intermittently failing in CI with a backward-pass numerical mismatch that exceeds bf16 tolerances.

Observed failure

CI run: https://github.com/NVIDIA/Megatron-LM/actions/runs/25408640123 (job tests/unit_tests/**/*.py - latest, ID 74525633131)

FAILED tests/unit_tests/fusions/test_mla_yarn_rope_apply.py::TestFusedApplyMLARope::test_forward_backward_for_q[thd]
E   AssertionError: Mismatch in bwd: Tensor-likes are not close!
E   Mismatched elements: 31 / 786432 (0.0%)
E   Greatest absolute difference: 3.015625 at index (104, 29, 170) (up to 0.05 allowed)
E   Greatest relative difference: 33.509803771972656 at index (104, 28, 166) (up to 0.02 allowed)

The assertion at tests/unit_tests/fusions/test_mla_yarn_rope_apply.py:111 compares the backward gradient of the reference apply_rotary_pos_emb against the fused fused_apply_mla_rope_for_q in bf16, thd packed-sequence layout, cu_seqlens=[0, 27, 54, 99, 128].

Why this is non-deterministic

  • Only 31 / 786,432 elements (~0.004%) exceed tolerance.
  • bf16 tolerances: atol=5e-2, rtol=2e-2.
  • The same merge-queue commit passed this job on rerun in workflow 25415024224.
  • The fused MLA YARN RoPE backward kernel produces small numerical drift in the thd (packed-sequence) path that occasionally exceeds bf16 tolerances at outlier indices.

Owning code

  • Test: tests/unit_tests/fusions/test_mla_yarn_rope_apply.py
  • Kernel: megatron/core/fusions/fused_mla_yarn_rope_apply.py
  • Introduced in MR !2949 (perf(mla, experimental): MLA RoPE fusion and YARN embedding cache); removed experimental tag in Remove experimental tags for fused kernels. #2233.

Mitigation

Marked flaky_in_dev in #4639 as a stop-gap. The underlying numerical drift in the fused backward kernel for thd should be investigated and tightened so the test can be re-enabled.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions