rocm: disable gradient_accumulation_fusion on gfx950 in test_sglang_config by sreerohi · Pull Request #1156 · radixark/miles

sreerohi · 2026-05-19T23:01:58Z

Depends on #1153. Related to #1105 (Miles CI gap between ROCm & CUDA).

hipBLASLt on gfx950 has no algorithm for TE's bias-fused wgrad GEMM (bf16→fp32 + BGRADB + accumulate). This conditionally disables gradient_accumulation_fusion and relaxes CI numerical checkers on ROCm. CUDA is unaffected.

hipBLASLt on gfx950 (MI350/MI355) has no algorithm for the triple combination of bf16 output with fp32 accumulate + HIPBLASLT_EPILOGUE_BGRADB epilogue + accumulate=True. This fires during TE's LayerNormLinear backward when gradient_accumulation_fusion=True and the layer has bias. Root cause: the bridge provider defaults gradient_accumulation_fusion to True (via can_enable_gradient_accumulation_fusion) and ignores --no-gradient-accumulation-fusion from the CLI. The fix propagates the flag from Megatron args to the bridge provider.

gemini-code-assist

Code Review

This pull request enables the gradient_accumulation_fusion flag to be controlled via Megatron CLI arguments to address compatibility issues on ROCm/gfx950. It also updates the E2E configuration tests to conditionally disable specific checkers and fusions when running in a ROCm environment. Review feedback highlights a potential AttributeError when checking the PyTorch version on non-ROCm systems and suggests using a consistent null-check pattern for configuration arguments to avoid overwriting default values.

gemini-code-assist · 2026-05-19T23:05:22Z


 import miles.utils.external_utils.command_utils as U

+IS_ROCM = torch.version.hip is not None


Accessing torch.version.hip directly will raise an AttributeError on non-ROCm (e.g., CUDA) builds of PyTorch, as the hip attribute is only defined in ROCm-enabled builds. This will cause the test to crash during collection or execution on CUDA environments. Use getattr to safely check for its existence.

Suggested change

IS_ROCM = torch.version.hip is not None

IS_ROCM = getattr(torch.version, "hip", None) is not None

gemini-code-assist · 2026-05-19T23:05:22Z

+        provider.gradient_accumulation_fusion = getattr(
+            args, "gradient_accumulation_fusion", provider.gradient_accumulation_fusion
+        )


To maintain consistency with the existing code in this file (e.g., lines 105-112) and to avoid overwriting the provider's default value if the argument is None, consider using the if ... is not None pattern. This ensures that the provider's default configuration is preserved unless explicitly overridden by a non-null CLI argument.

if getattr(args, "gradient_accumulation_fusion", None) is not None: provider.gradient_accumulation_fusion = args.gradient_accumulation_fusion

References

Maintain consistency with existing patterns in the file for handling CLI arguments and defaults. ^(link)

sreerohi added 2 commits May 19, 2026 16:29

rocm: gate BGRADB workaround flags on IS_ROCM in test_sglang_config

3b966db

sreerohi requested review from Zhichenzzz, fzyzcjy, maocheng23, yueming-yuan and yushengsu-thu as code owners May 19, 2026 23:01

gemini-code-assist Bot reviewed May 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rocm: disable gradient_accumulation_fusion on gfx950 in test_sglang_config#1156

rocm: disable gradient_accumulation_fusion on gfx950 in test_sglang_config#1156
sreerohi wants to merge 2 commits into
radixark:mainfrom
sreerohi:rocm/bgradb-sglang-config

sreerohi commented May 19, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Uh oh!

gemini-code-assist Bot May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		import miles.utils.external_utils.command_utils as U

		IS_ROCM = torch.version.hip is not None

	IS_ROCM = torch.version.hip is not None
	IS_ROCM = getattr(torch.version, "hip", None) is not None

Conversation

sreerohi commented May 19, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant