You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
tests/unit_tests/fusions/test_mla_yarn_rope_apply.py::TestFusedApplyMLARope::test_forward_backward_for_q[thd] is intermittently failing in CI with a backward-pass numerical mismatch that exceeds bf16 tolerances.
FAILED tests/unit_tests/fusions/test_mla_yarn_rope_apply.py::TestFusedApplyMLARope::test_forward_backward_for_q[thd]
E AssertionError: Mismatch in bwd: Tensor-likes are not close!
E Mismatched elements: 31 / 786432 (0.0%)
E Greatest absolute difference: 3.015625 at index (104, 29, 170) (up to 0.05 allowed)
E Greatest relative difference: 33.509803771972656 at index (104, 28, 166) (up to 0.02 allowed)
The assertion at tests/unit_tests/fusions/test_mla_yarn_rope_apply.py:111 compares the backward gradient of the reference apply_rotary_pos_emb against the fused fused_apply_mla_rope_for_q in bf16, thd packed-sequence layout, cu_seqlens=[0, 27, 54, 99, 128].
Why this is non-deterministic
Only 31 / 786,432 elements (~0.004%) exceed tolerance.
bf16 tolerances: atol=5e-2, rtol=2e-2.
The same merge-queue commit passed this job on rerun in workflow 25415024224.
The fused MLA YARN RoPE backward kernel produces small numerical drift in the thd (packed-sequence) path that occasionally exceeds bf16 tolerances at outlier indices.
Marked flaky_in_dev in #4639 as a stop-gap. The underlying numerical drift in the fused backward kernel for thd should be investigated and tightened so the test can be re-enabled.
Summary
tests/unit_tests/fusions/test_mla_yarn_rope_apply.py::TestFusedApplyMLARope::test_forward_backward_for_q[thd]is intermittently failing in CI with a backward-pass numerical mismatch that exceeds bf16 tolerances.Observed failure
CI run: https://github.com/NVIDIA/Megatron-LM/actions/runs/25408640123 (job
tests/unit_tests/**/*.py - latest, ID74525633131)The assertion at
tests/unit_tests/fusions/test_mla_yarn_rope_apply.py:111compares the backward gradient of the referenceapply_rotary_pos_embagainst the fusedfused_apply_mla_rope_for_qin bf16,thdpacked-sequence layout,cu_seqlens=[0, 27, 54, 99, 128].Why this is non-deterministic
atol=5e-2,rtol=2e-2.thd(packed-sequence) path that occasionally exceeds bf16 tolerances at outlier indices.Owning code
tests/unit_tests/fusions/test_mla_yarn_rope_apply.pymegatron/core/fusions/fused_mla_yarn_rope_apply.py!2949(perf(mla, experimental): MLA RoPE fusion and YARN embedding cache); removed experimental tag in Remove experimental tags for fused kernels. #2233.Mitigation
Marked
flaky_in_devin #4639 as a stop-gap. The underlying numerical drift in the fused backward kernel forthdshould be investigated and tightened so the test can be re-enabled.