[ROCm] Add async-mode test_glm5_744b_a40b_4layer_rocm.py#1124
Conversation
ROCm-adapted variant of test_glm5_744b_a40b_4layer.py for AMD MI300 /
MI355X. Mirrors the existing pattern of NV/AMD-specific test/script
pairs in the repo.
Key substitutions vs the NVIDIA test (full table in the file
docstring):
Mode: colocate 8-GPU -> async (train_async.py) 4 train + 4 rollout
Megatron attn: --attention-backend flash -> fused
MoE: drop --moe-enable-deepep, use --moe-token-dispatcher-type alltoall
Optimizer: add --optimizer-cpu-offload (matches Qwen3 async)
SGLang NSA: flashmla_sparse -> tilelang (ROCm port)
SGLang layout: 1 engine x 8 GPUs -> 2 engines x 2 GPUs (DP=2 EP=2)
Drop --sglang-page-size 64 (NSA pool requires page_size=1)
Drop --disable-weights-backuper (triggers torch_memory_saver in async)
Add --sglang-disable-custom-all-reduce
The test depends on (already in flight as separate PRs):
- The indexer topk clamp (fix/glm5-indexer-topk-clamp).
- The sparse-MLA fwd/bwd kernel retuning for gfx950
(fix/glm5-sparse-mla-gfx950).
- Optionally TE PR ROCm/TransformerEngine#573 (CK grouped GEMM
dispatcher).
Verified end-to-end on MI355X (8x): training step 0 completes in 26s
steady state, train_rollout_logprob_abs_diff ~= 0.077.
There was a problem hiding this comment.
Code Review
This pull request adds a ROCm-adapted end-to-end test for the GLM-5 744B model, tailored for AMD MI300/MI355 hardware. The implementation features async training, checkpoint patching for architecture compatibility, and ROCm-specific SGLang configurations using the tilelang backend. Review feedback highlights a critical inconsistency in SGLang parallelism settings that would likely cause initialization failure due to over-allocation of GPUs. Additionally, the reviewer suggested replacing several hardcoded GPU counts and environment strings with values derived from the NUM_GPUS constant to improve the test's flexibility and maintainability.
| "--sglang-dp-size 2 " | ||
| "--sglang-ep-size 2 " |
There was a problem hiding this comment.
The SGLang parallelism configuration appears inconsistent with the allocated GPU budget per engine. With --rollout-num-gpus 4 and --rollout-num-gpus-per-engine 2, each engine is allocated 2 GPUs. However, the configuration --sglang-dp-size 2 and --sglang-ep-size 2 would require 4 GPUs per engine (assuming TP=1). Since miles defaults the engine's tp_size to the number of GPUs per engine (2 in this case), this configuration will likely fail during SGLang initialization as the product of parallelisms (TP=2 * DP=2 * EP=2 = 8) exceeds the allocated GPUs (2). Consider adjusting these values to fit the 2-GPU-per-engine budget (e.g., --sglang-dp-size 1 --sglang-ep-size 2).
There was a problem hiding this comment.
Thanks for the catch on the others! On this one I think the TP × DP × EP = 8 analysis doesn't apply once --sglang-enable-dp-attention is set: under DP-attention each DP replica owns a slice of the engine's GPUs (TP internally = engine_gpus / dp_size), so the factors divide rather than multiply.
In our actual run on MI355X with rollout-num-gpus-per-engine=2, dp_size=2, ep_size=2, the engine logged tp_size=2, dp_size=2, ep_size=2 and successfully loaded the DeepseekV32 weights (Load weight end. elapsed=11.36 s, ... mem usage=13.79 GB) and completed training step 0 with train_rollout_logprob_abs_diff ≈ 0.077 — i.e. the train-side forward and the SGLang rollout forward agree to ~8%, which is the expected signature of a working setup. Repro in the PR body if useful.
| # flashmla_sparse -> tilelang (the ROCm NSA backend, matches the upstream | ||
| # MI35x SGLang eval test test_glm5_eval_mi35x.py). | ||
| sglang_args = ( | ||
| "--rollout-num-gpus 4 " |
There was a problem hiding this comment.
Avoid hardcoding the rollout GPU count. Derive it from the NUM_GPUS constant to ensure consistency if the test configuration is modified.
| "--rollout-num-gpus 4 " | |
| f"--rollout-num-gpus {NUM_GPUS // 2} " |
References
- Magic numbers should be replaced with named constants with explanatory comments to improve code readability and maintainability.
| "--actor-num-gpus-per-node 4 " | ||
| "--num-gpus-per-node 8 " |
There was a problem hiding this comment.
Avoid hardcoding GPU counts in the command arguments. Use the NUM_GPUS constant and derive the actor GPU count from it to ensure consistency if the test configuration is modified.
| "--actor-num-gpus-per-node 4 " | |
| "--num-gpus-per-node 8 " | |
| f"--actor-num-gpus-per-node {NUM_GPUS // 2} " | |
| f"--num-gpus-per-node {NUM_GPUS} " |
References
- Magic numbers should be replaced with named constants with explanatory comments to improve code readability and maintainability.
| # per-worker. Otherwise SGLang's import-time is_gfx95_supported() | ||
| # probe fails with "No HIP GPUs are available". | ||
| "RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES": "1", | ||
| "HIP_VISIBLE_DEVICES": "0,1,2,3,4,5,6,7", |
There was a problem hiding this comment.
Avoid hardcoding the HIP_VISIBLE_DEVICES string. It should be derived from the NUM_GPUS constant to maintain consistency.
| "HIP_VISIBLE_DEVICES": "0,1,2,3,4,5,6,7", | |
| "HIP_VISIBLE_DEVICES": ",".join(map(str, range(NUM_GPUS))), |
References
- Magic numbers should be replaced with named constants with explanatory comments to improve code readability and maintainability.
Addresses three review comments suggesting we avoid hardcoded GPU counts in command args and env vars. NUM_GPUS // 2 is the train/rollout half-split for async mode. The HIGH-priority comment about SGLang parallelism (TP*DP*EP=8 > 2 GPUs/engine) is NOT addressed here -- with --sglang-enable-dp-attention the TP and DP factors divide the GPUs/engine rather than multiplying. The actual run logged tp_size=2/dp_size=2/ep_size=2 with rollout-num-gpus-per-engine=2 and successfully loaded weights and completed training step 0. Will reply on the PR.
- Drop DeepEP; use alltoall dispatcher - Swap flashmla_sparse NSA backend for tilelang - Set sglang-page-size to 1 (required by tilelang NSA)
…ark#1122 , radixark#1123): - Drop DeepEP; use alltoall dispatcher - Swap flashmla_sparse NSA backend for tilelang - Set sglang-page-size to 1 (required by tilelang NSA)
ROCm-adapted variant of test_glm5_744b_a40b_4layer.py for AMD MI300 / MI355X. Mirrors the existing pattern of NV/AMD-specific test/script pairs in the repo.
Key substitutions vs the NVIDIA test (full table in the file docstring):
Mode: colocate 8-GPU -> async (train_async.py) 4 train + 4 rollout
Megatron attn: --attention-backend flash -> fused
MoE: drop --moe-enable-deepep, use --moe-token-dispatcher-type alltoall
Optimizer: add --optimizer-cpu-offload (matches Qwen3 async)
SGLang NSA: flashmla_sparse -> tilelang (ROCm port)
SGLang layout: 1 engine x 8 GPUs -> 2 engines x 2 GPUs (DP=2 EP=2)
Drop --sglang-page-size 64 (NSA pool requires page_size=1)
Drop --disable-weights-backuper (triggers torch_memory_saver in async)
Add --sglang-disable-custom-all-reduce
The test depends on (already in flight as separate PRs):
Verified end-to-end on MI355X (8x): training step 0 completes in 26s steady state, train_rollout_logprob_abs_diff ~= 0.077.