Skip to content

rocm/ci: disable gradient accumulation fusion for GLM-4-9B on ROCm#1166

Open
sreerohi wants to merge 1 commit into
radixark:mainfrom
sreerohi:rocm/glm4-9b-gradient-fusion-fix
Open

rocm/ci: disable gradient accumulation fusion for GLM-4-9B on ROCm#1166
sreerohi wants to merge 1 commit into
radixark:mainfrom
sreerohi:rocm/glm4-9b-gradient-fusion-fix

Conversation

@sreerohi
Copy link
Copy Markdown

@sreerohi sreerohi commented May 21, 2026

TE fused wgrad GEMM fails with 'Unable to find any suitable algorithms' on MI350. Add --no-gradient-accumulation-fusion gated on ROCm.

Relates to #1105

TE fused wgrad GEMM fails with 'Unable to find any suitable algorithms'
on MI350. Add --no-gradient-accumulation-fusion gated on ROCm.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the GLM4-9B end-to-end test to include a check for ROCm environments, which triggers the addition of the --no-gradient-accumulation-fusion argument. The review feedback suggests simplifying the ROCm detection logic by removing a redundant hasattr check to align with existing codebase patterns.


ENABLE_EVAL = U.get_bool_env_var("MILES_TEST_ENABLE_EVAL", "1")
TIGHT_DEVICE_MEMORY = U.get_bool_env_var("MILES_TEST_TIGHT_DEVICE_MEMORY", "1")
IS_ROCM = hasattr(torch.version, "hip") and torch.version.hip is not None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with other parts of the codebase (e.g., miles/backends/megatron_utils/model.py line 809, which uses if torch.version.hip:) and for conciseness, you can simplify this check. The torch.version.hip attribute is guaranteed to exist and will be None on non-ROCm builds, so the hasattr check is redundant.

Suggested change
IS_ROCM = hasattr(torch.version, "hip") and torch.version.hip is not None
IS_ROCM = torch.version.hip is not None
References
  1. Maintain consistency with existing patterns in the repository for checking ROCm availability and avoid redundant hasattr checks for attributes guaranteed to exist in the environment. (link)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant