Skip to content

[bug] CUDA Graphs Hang in MoE Token Dispatcher #3517

@ngu3

Description

@ngu3

Description

Benchmarking jobs for GPT-OSS (BF16) using CUDA graphs are hanging in the 26.02 release. This appears to be a regression, as the same configurations (64 and 128 GPUs) were functional in version 25.11.

The hang occurs during the CUDA graph capture/replay phase

Environment

  • Release: 26.02 (Regression from 25.11)
  • Model: GPT-OSS BF16
  • Hardware: 64, 128, and 256 GPU GB300 clusters
  • Config: * moe_flex_dispatcher_backend = None
    • moe_token_dispatcher_type = alltoall
    • cuda_graph_scope = ["attn", "moe_router", "moe_preprocess"]

Observed Behavior

When CUDA graphs are enabled with the scopes above, the process hangs indefinitely.

  • File Location: megatron/core/transformer/moe/token_dispatcher.py
  • GPU Utilization: 0% (Idle)
  • CPU Utilization: ~200% for the processes (indicates a busy-wait or host-side deadlock).
  • Scaling Impact: * 64/128 GPUs: Immediate hang in 26.02 (Functional in 25.11).
    • 256 GPUs: Fails in both 25.11 and 26.02.

Comparison Data (NPI NV Roofline)

| GPU Count | Scopes | 25.11 (TFLOPS) | 26.02 (TFLOPS) | Roofline Target |
| :--- | :--- | :--- | :--- |
| 64 | attn, moe_router, moe_preprocess | PASS | FAIL (Hang) |
| 128 | attn, moe_router, moe_preprocess | PASS | FAIL (Hang) |
| 256 | attn, moe_router, moe_preprocess | FAIL | FAIL (Hang) |

Additional Context

26.04 has the same hang issue and same behavior.

Minimal repro

We on above workload on GKE cluster. and the launch script we use

  python scripts/performance/custom_setup_experiment.py \
    --model_family_name gpt_oss \
    --model_recipe_name gpt_oss_120b \
    --config_variant v1 \
    --gpu gb300 \
    --num_gpus 64 \
    --gpus_per_node 4 \
    --compute_dtype bf16 \
    --seq_length 4096 \
    --global_batch_size 512 \
    --micro_batch_size 4 \
    --tensor_model_parallel_size 1 \
    --pipeline_model_parallel_size 1 \
    --context_parallel_size 1 \
    --expert_model_parallel_size 64 \
    --expert_tensor_parallel_size 1 \
    --cuda_graph_impl transformer_engine \
    --cuda_graph_scope attn,moe_router,moe_preprocess

We use custom_setup_experiment.py for GKE environment. can try setup_experiment.py on Slurm cluster.

Expected behavior

training stuck at completed CUDA graphs capture.

Affected area

area:model

Regression?

Yes

Environment

No response

Logs

Metadata

Metadata

Assignees

Labels

area:perfPerformance optimizations and benchmarkingbugSomething isn't workingcommunity-request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions