[ROCm] Add async-mode test_glm5_744b_a40b_4layer_rocm.py by lizamd · Pull Request #1124 · radixark/miles

lizamd · 2026-05-13T21:42:32Z

ROCm-adapted variant of test_glm5_744b_a40b_4layer.py for AMD MI300 / MI355X. Mirrors the existing pattern of NV/AMD-specific test/script pairs in the repo.

Key substitutions vs the NVIDIA test (full table in the file docstring):

Mode: colocate 8-GPU -> async (train_async.py) 4 train + 4 rollout
Megatron attn: --attention-backend flash -> fused
MoE: drop --moe-enable-deepep, use --moe-token-dispatcher-type alltoall
Optimizer: add --optimizer-cpu-offload (matches Qwen3 async)
SGLang NSA: flashmla_sparse -> tilelang (ROCm port)
SGLang layout: 1 engine x 8 GPUs -> 2 engines x 2 GPUs (DP=2 EP=2)
Drop --sglang-page-size 64 (NSA pool requires page_size=1)
Drop --disable-weights-backuper (triggers torch_memory_saver in async)
Add --sglang-disable-custom-all-reduce

The test depends on (already in flight as separate PRs):

The indexer topk clamp (fix/glm5-indexer-topk-clamp).
The sparse-MLA fwd/bwd kernel retuning for gfx950 (fix/glm5-sparse-mla-gfx950).
Optionally TE PR [ROCm] Allow bf16/bf16/fp32 in nvte_multi_tensor_gemm dispatcher ROCm/TransformerEngine#573 (CK grouped GEMM dispatcher).

Verified end-to-end on MI355X (8x): training step 0 completes in 26s steady state, train_rollout_logprob_abs_diff ~= 0.077.

ROCm-adapted variant of test_glm5_744b_a40b_4layer.py for AMD MI300 / MI355X. Mirrors the existing pattern of NV/AMD-specific test/script pairs in the repo. Key substitutions vs the NVIDIA test (full table in the file docstring): Mode: colocate 8-GPU -> async (train_async.py) 4 train + 4 rollout Megatron attn: --attention-backend flash -> fused MoE: drop --moe-enable-deepep, use --moe-token-dispatcher-type alltoall Optimizer: add --optimizer-cpu-offload (matches Qwen3 async) SGLang NSA: flashmla_sparse -> tilelang (ROCm port) SGLang layout: 1 engine x 8 GPUs -> 2 engines x 2 GPUs (DP=2 EP=2) Drop --sglang-page-size 64 (NSA pool requires page_size=1) Drop --disable-weights-backuper (triggers torch_memory_saver in async) Add --sglang-disable-custom-all-reduce The test depends on (already in flight as separate PRs): - The indexer topk clamp (fix/glm5-indexer-topk-clamp). - The sparse-MLA fwd/bwd kernel retuning for gfx950 (fix/glm5-sparse-mla-gfx950). - Optionally TE PR ROCm/TransformerEngine#573 (CK grouped GEMM dispatcher). Verified end-to-end on MI355X (8x): training step 0 completes in 26s steady state, train_rollout_logprob_abs_diff ~= 0.077.

gemini-code-assist

Code Review

This pull request adds a ROCm-adapted end-to-end test for the GLM-5 744B model, tailored for AMD MI300/MI355 hardware. The implementation features async training, checkpoint patching for architecture compatibility, and ROCm-specific SGLang configurations using the tilelang backend. Review feedback highlights a critical inconsistency in SGLang parallelism settings that would likely cause initialization failure due to over-allocation of GPUs. Additionally, the reviewer suggested replacing several hardcoded GPU counts and environment strings with values derived from the NUM_GPUS constant to improve the test's flexibility and maintainability.

gemini-code-assist · 2026-05-13T21:44:55Z

+        "--sglang-dp-size 2 "
+        "--sglang-ep-size 2 "


The SGLang parallelism configuration appears inconsistent with the allocated GPU budget per engine. With --rollout-num-gpus 4 and --rollout-num-gpus-per-engine 2, each engine is allocated 2 GPUs. However, the configuration --sglang-dp-size 2 and --sglang-ep-size 2 would require 4 GPUs per engine (assuming TP=1). Since miles defaults the engine's tp_size to the number of GPUs per engine (2 in this case), this configuration will likely fail during SGLang initialization as the product of parallelisms (TP=2 * DP=2 * EP=2 = 8) exceeds the allocated GPUs (2). Consider adjusting these values to fit the 2-GPU-per-engine budget (e.g., --sglang-dp-size 1 --sglang-ep-size 2).

Thanks for the catch on the others! On this one I think the TP × DP × EP = 8 analysis doesn't apply once --sglang-enable-dp-attention is set: under DP-attention each DP replica owns a slice of the engine's GPUs (TP internally = engine_gpus / dp_size), so the factors divide rather than multiply.

In our actual run on MI355X with rollout-num-gpus-per-engine=2, dp_size=2, ep_size=2, the engine logged tp_size=2, dp_size=2, ep_size=2 and successfully loaded the DeepseekV32 weights (Load weight end. elapsed=11.36 s, ... mem usage=13.79 GB) and completed training step 0 with train_rollout_logprob_abs_diff ≈ 0.077 — i.e. the train-side forward and the SGLang rollout forward agree to ~8%, which is the expected signature of a working setup. Repro in the PR body if useful.

gemini-code-assist · 2026-05-13T21:44:55Z

+    # flashmla_sparse -> tilelang (the ROCm NSA backend, matches the upstream
+    # MI35x SGLang eval test test_glm5_eval_mi35x.py).
+    sglang_args = (
+        "--rollout-num-gpus 4 "


Avoid hardcoding the rollout GPU count. Derive it from the NUM_GPUS constant to ensure consistency if the test configuration is modified.

Suggested change

"--rollout-num-gpus 4 "

f"--rollout-num-gpus {NUM_GPUS // 2} "

References

Magic numbers should be replaced with named constants with explanatory comments to improve code readability and maintainability.

gemini-code-assist · 2026-05-13T21:44:55Z

+        "--actor-num-gpus-per-node 4 "
+        "--num-gpus-per-node 8 "


Avoid hardcoding GPU counts in the command arguments. Use the NUM_GPUS constant and derive the actor GPU count from it to ensure consistency if the test configuration is modified.

Suggested change

"--actor-num-gpus-per-node 4 "

"--num-gpus-per-node 8 "

f"--actor-num-gpus-per-node {NUM_GPUS // 2} "

f"--num-gpus-per-node {NUM_GPUS} "

References

Magic numbers should be replaced with named constants with explanatory comments to improve code readability and maintainability.

gemini-code-assist · 2026-05-13T21:44:55Z

+            # per-worker. Otherwise SGLang's import-time is_gfx95_supported()
+            # probe fails with "No HIP GPUs are available".
+            "RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES": "1",
+            "HIP_VISIBLE_DEVICES": "0,1,2,3,4,5,6,7",


Avoid hardcoding the HIP_VISIBLE_DEVICES string. It should be derived from the NUM_GPUS constant to maintain consistency.

Suggested change

"HIP_VISIBLE_DEVICES": "0,1,2,3,4,5,6,7",

"HIP_VISIBLE_DEVICES": ",".join(map(str, range(NUM_GPUS))),

References

Magic numbers should be replaced with named constants with explanatory comments to improve code readability and maintainability.

Addresses three review comments suggesting we avoid hardcoded GPU counts in command args and env vars. NUM_GPUS // 2 is the train/rollout half-split for async mode. The HIGH-priority comment about SGLang parallelism (TP*DP*EP=8 > 2 GPUs/engine) is NOT addressed here -- with --sglang-enable-dp-attention the TP and DP factors divide the GPUs/engine rather than multiplying. The actual run logged tp_size=2/dp_size=2/ep_size=2 with rollout-num-gpus-per-engine=2 and successfully loaded weights and completed training step 0. Will reply on the PR.

- Drop DeepEP; use alltoall dispatcher - Swap flashmla_sparse NSA backend for tilelang - Set sglang-page-size to 1 (required by tilelang NSA)

…ark#1122 , radixark#1123): - Drop DeepEP; use alltoall dispatcher - Swap flashmla_sparse NSA backend for tilelang - Set sglang-page-size to 1 (required by tilelang NSA)

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

indianspeedster mentioned this pull request May 13, 2026

Miles CI gap Between ROCm & CUDA #1105

Open

80 tasks

sreerohi added a commit to sreerohi/miles_fork that referenced this pull request May 21, 2026

Mirrors the substitutions from PR radixark#1124:

589c5f0

- Drop DeepEP; use alltoall dispatcher - Swap flashmla_sparse NSA backend for tilelang - Set sglang-page-size to 1 (required by tilelang NSA)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Add async-mode test_glm5_744b_a40b_4layer_rocm.py#1124

[ROCm] Add async-mode test_glm5_744b_a40b_4layer_rocm.py#1124
lizamd wants to merge 2 commits into
radixark:mainfrom
lizamd:add/test-glm5-rocm

lizamd commented May 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

lizamd May 13, 2026

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	"--rollout-num-gpus 4 "
	f"--rollout-num-gpus {NUM_GPUS // 2} "

	"HIP_VISIBLE_DEVICES": "0,1,2,3,4,5,6,7",
	"HIP_VISIBLE_DEVICES": ",".join(map(str, range(NUM_GPUS))),

Conversation

lizamd commented May 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

lizamd May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant