Skip to content

[ROCm] Add async-mode test_glm5_744b_a40b_4layer_rocm.py#1124

Open
lizamd wants to merge 2 commits into
radixark:mainfrom
lizamd:add/test-glm5-rocm
Open

[ROCm] Add async-mode test_glm5_744b_a40b_4layer_rocm.py#1124
lizamd wants to merge 2 commits into
radixark:mainfrom
lizamd:add/test-glm5-rocm

Conversation

@lizamd
Copy link
Copy Markdown
Contributor

@lizamd lizamd commented May 13, 2026

ROCm-adapted variant of test_glm5_744b_a40b_4layer.py for AMD MI300 / MI355X. Mirrors the existing pattern of NV/AMD-specific test/script pairs in the repo.

Key substitutions vs the NVIDIA test (full table in the file docstring):

Mode: colocate 8-GPU -> async (train_async.py) 4 train + 4 rollout
Megatron attn: --attention-backend flash -> fused
MoE: drop --moe-enable-deepep, use --moe-token-dispatcher-type alltoall
Optimizer: add --optimizer-cpu-offload (matches Qwen3 async)
SGLang NSA: flashmla_sparse -> tilelang (ROCm port)
SGLang layout: 1 engine x 8 GPUs -> 2 engines x 2 GPUs (DP=2 EP=2)
Drop --sglang-page-size 64 (NSA pool requires page_size=1)
Drop --disable-weights-backuper (triggers torch_memory_saver in async)
Add --sglang-disable-custom-all-reduce

The test depends on (already in flight as separate PRs):

Verified end-to-end on MI355X (8x): training step 0 completes in 26s steady state, train_rollout_logprob_abs_diff ~= 0.077.

ROCm-adapted variant of test_glm5_744b_a40b_4layer.py for AMD MI300 /
MI355X. Mirrors the existing pattern of NV/AMD-specific test/script
pairs in the repo.

Key substitutions vs the NVIDIA test (full table in the file
docstring):

  Mode: colocate 8-GPU -> async (train_async.py) 4 train + 4 rollout
  Megatron attn: --attention-backend flash -> fused
  MoE: drop --moe-enable-deepep, use --moe-token-dispatcher-type alltoall
  Optimizer: add --optimizer-cpu-offload (matches Qwen3 async)
  SGLang NSA: flashmla_sparse -> tilelang (ROCm port)
  SGLang layout: 1 engine x 8 GPUs -> 2 engines x 2 GPUs (DP=2 EP=2)
  Drop --sglang-page-size 64 (NSA pool requires page_size=1)
  Drop --disable-weights-backuper (triggers torch_memory_saver in async)
  Add --sglang-disable-custom-all-reduce

The test depends on (already in flight as separate PRs):
  - The indexer topk clamp (fix/glm5-indexer-topk-clamp).
  - The sparse-MLA fwd/bwd kernel retuning for gfx950
    (fix/glm5-sparse-mla-gfx950).
  - Optionally TE PR ROCm/TransformerEngine#573 (CK grouped GEMM
    dispatcher).

Verified end-to-end on MI355X (8x): training step 0 completes in 26s
steady state, train_rollout_logprob_abs_diff ~= 0.077.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a ROCm-adapted end-to-end test for the GLM-5 744B model, tailored for AMD MI300/MI355 hardware. The implementation features async training, checkpoint patching for architecture compatibility, and ROCm-specific SGLang configurations using the tilelang backend. Review feedback highlights a critical inconsistency in SGLang parallelism settings that would likely cause initialization failure due to over-allocation of GPUs. Additionally, the reviewer suggested replacing several hardcoded GPU counts and environment strings with values derived from the NUM_GPUS constant to improve the test's flexibility and maintainability.

Comment on lines +204 to +205
"--sglang-dp-size 2 "
"--sglang-ep-size 2 "
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The SGLang parallelism configuration appears inconsistent with the allocated GPU budget per engine. With --rollout-num-gpus 4 and --rollout-num-gpus-per-engine 2, each engine is allocated 2 GPUs. However, the configuration --sglang-dp-size 2 and --sglang-ep-size 2 would require 4 GPUs per engine (assuming TP=1). Since miles defaults the engine's tp_size to the number of GPUs per engine (2 in this case), this configuration will likely fail during SGLang initialization as the product of parallelisms (TP=2 * DP=2 * EP=2 = 8) exceeds the allocated GPUs (2). Consider adjusting these values to fit the 2-GPU-per-engine budget (e.g., --sglang-dp-size 1 --sglang-ep-size 2).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch on the others! On this one I think the TP × DP × EP = 8 analysis doesn't apply once --sglang-enable-dp-attention is set: under DP-attention each DP replica owns a slice of the engine's GPUs (TP internally = engine_gpus / dp_size), so the factors divide rather than multiply.

In our actual run on MI355X with rollout-num-gpus-per-engine=2, dp_size=2, ep_size=2, the engine logged tp_size=2, dp_size=2, ep_size=2 and successfully loaded the DeepseekV32 weights (Load weight end. elapsed=11.36 s, ... mem usage=13.79 GB) and completed training step 0 with train_rollout_logprob_abs_diff ≈ 0.077 — i.e. the train-side forward and the SGLang rollout forward agree to ~8%, which is the expected signature of a working setup. Repro in the PR body if useful.

# flashmla_sparse -> tilelang (the ROCm NSA backend, matches the upstream
# MI35x SGLang eval test test_glm5_eval_mi35x.py).
sglang_args = (
"--rollout-num-gpus 4 "
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Avoid hardcoding the rollout GPU count. Derive it from the NUM_GPUS constant to ensure consistency if the test configuration is modified.

Suggested change
"--rollout-num-gpus 4 "
f"--rollout-num-gpus {NUM_GPUS // 2} "
References
  1. Magic numbers should be replaced with named constants with explanatory comments to improve code readability and maintainability.

Comment on lines +231 to +232
"--actor-num-gpus-per-node 4 "
"--num-gpus-per-node 8 "
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Avoid hardcoding GPU counts in the command arguments. Use the NUM_GPUS constant and derive the actor GPU count from it to ensure consistency if the test configuration is modified.

Suggested change
"--actor-num-gpus-per-node 4 "
"--num-gpus-per-node 8 "
f"--actor-num-gpus-per-node {NUM_GPUS // 2} "
f"--num-gpus-per-node {NUM_GPUS} "
References
  1. Magic numbers should be replaced with named constants with explanatory comments to improve code readability and maintainability.

# per-worker. Otherwise SGLang's import-time is_gfx95_supported()
# probe fails with "No HIP GPUs are available".
"RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES": "1",
"HIP_VISIBLE_DEVICES": "0,1,2,3,4,5,6,7",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Avoid hardcoding the HIP_VISIBLE_DEVICES string. It should be derived from the NUM_GPUS constant to maintain consistency.

Suggested change
"HIP_VISIBLE_DEVICES": "0,1,2,3,4,5,6,7",
"HIP_VISIBLE_DEVICES": ",".join(map(str, range(NUM_GPUS))),
References
  1. Magic numbers should be replaced with named constants with explanatory comments to improve code readability and maintainability.

Addresses three review comments suggesting we avoid hardcoded GPU
counts in command args and env vars. NUM_GPUS // 2 is the train/rollout
half-split for async mode.

The HIGH-priority comment about SGLang parallelism (TP*DP*EP=8 > 2
GPUs/engine) is NOT addressed here -- with --sglang-enable-dp-attention
the TP and DP factors divide the GPUs/engine rather than multiplying.
The actual run logged tp_size=2/dp_size=2/ep_size=2 with
rollout-num-gpus-per-engine=2 and successfully loaded weights and
completed training step 0. Will reply on the PR.
sreerohi added a commit to sreerohi/miles_fork that referenced this pull request May 21, 2026
- Drop DeepEP; use alltoall dispatcher
- Swap flashmla_sparse NSA backend for tilelang
- Set sglang-page-size to 1 (required by tilelang NSA)
sreerohi added a commit to sreerohi/miles_fork that referenced this pull request May 21, 2026
…ark#1122 , radixark#1123):

- Drop DeepEP; use alltoall dispatcher
- Swap flashmla_sparse NSA backend for tilelang
- Set sglang-page-size to 1 (required by tilelang NSA)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant