Skip to content

Miles CI gap Between ROCm & CUDA #1105

@indianspeedster

Description

@indianspeedster

Creating this issue to track adding ROCm CI for the miles repo to mirror the existing NVIDIA PR Test workflow (.github/workflows/pr-test.yml). The NVIDIA workflow runs 11 per-commit suites; we want the same suites green on AMD Instinct MI300/MI355X with ROCm.

To list what each suite contains:

python3 -m tests.ci.run_suite --hw cpu  --suite stage-a-fast --list-only
python3 -m tests.ci.run_suite --hw cuda --suite <suite-name> --list-only

Update (2026-05-21): synced with fork/main. Net changes since this issue was opened: stage-b-sglang-1-gpu renamed to stage-b-sglang-8-gpu (#1107); test_session_server_multi_role.py moved from stage-b-short-8-gpu to stage-b-sglang-8-gpu (#1107); both moonlight tests deleted in #1137 (so PR #1165 is moot); test_glm47_flash_r3_mtp.py and test_qwen3_30B_A3B_r3.py re-enabled upstream in #1137; 4 new files added to stage-a-fast (#1117, #1137, b8649e6).

Stage A

stage-a-fast (CPU) — 40 / 44 PASS, 4 new files pending verification. 1 already skipped upstream.

  • tests/fast/test_megatron_cli_flags.py
  • tests/fast/router/test_session_pretokenized_e2e.py
  • tests/fast/router/test_session_race_conditions.py
  • tests/fast/router/test_sessions.py
  • tests/fast/router/test_linear_trajectory.py
  • tests/fast/router/test_router.py
  • tests/fast/rollout/generate_utils/test_sample_utils.py
  • tests/fast/rollout/generate_utils/test_openai_endpoint_utils.py
  • tests/fast/rollout/inference_rollout/test_compatibility.py
  • tests/fast/rollout/rm_hub/test_math_dapo_utils.py
  • tests/fast/rollout/rm_hub/test_math_utils.py
  • tests/fast/rollout/rm_hub/test_rm_hub.py
  • tests/fast/rollout/rm_hub/test_deepscaler.py
  • tests/fast/rollout/rm_hub/test_gpqa.py
  • tests/fast/rollout/rm_hub/test_f1.py
  • tests/fast/utils/test_arguments.py
  • tests/fast/utils/test_types.py
  • tests/fast/utils/test_misc.py
  • tests/fast/utils/test_mask_utils.py
  • tests/fast/utils/test_async_utils.py
  • tests/fast/utils/test_dumper_utils.py
  • tests/fast/utils/test_lora_arguments.py
  • tests/fast/utils/test_http_utils.py
  • tests/fast/utils/test_env_report.py
  • tests/fast/utils/test_logging_utils.py
  • tests/fast/utils/chat_template_utils/test_tito_tokenizer.py
  • tests/fast/utils/chat_template_utils/test_template.py
  • tests/fast/utils/chat_template_utils/test_pretokenized_via_tito.py
  • tests/fast/utils/chat_template_utils/test_token_seq_comparator.py
  • tests/fast/utils/chat_template_utils/test_pretokenized_chat.py
  • tests/fast/utils/test_utils/test_mock_sglang_server.py
  • tests/fast/utils/test_utils/test_mock_tools.py
  • tests/fast/backends/megatron_utils/test_lora_hf_weight_iterator.py
  • tests/fast/backends/megatron_utils/test_fp32_param_utils.py
  • tests/fast/backends/megatron_utils/test_lora_model_branches.py
  • tests/fast/backends/megatron_utils/test_lora_checkpoint_helpers.py
  • tests/fast/backends/megatron_utils/test_lora_utils.py
  • tests/fast/backends/megatron_utils/test_lora_weight_sync_validation.py
  • tests/fast/backends/megatron_utils/test_lora_update_weight.py
  • tests/utils/test_sglang_config.py
  • tests/fast/rollout/generate_hub/test_tool_call_utils.py (re-added in fix: restoring CI tests #1137)
  • tests/fast/backends/megatron_utils/test_model_provider_true_on_policy.py (new, b8649e6)
  • tests/fast/backends/megatron_utils/test_qwen2_true_on_policy_conversion.py (new, b8649e6)
  • tests/fast/utils/test_utils/test_session_verify_runner.py (new in [TITO] model support for Kimi-k2/2.5/2.6, nemotron-3-super, mimimax-m2.5/2.7 #1117)

Stage B

stage-b-fast-1-gpu — 15 / 15 PASS.

  • tests/fast/rollout/inference_rollout/integration/test_semaphore.py
  • tests/fast/rollout/inference_rollout/integration/test_dynamic_filter.py
  • tests/fast/rollout/inference_rollout/integration/test_basic.py
  • tests/fast/rollout/inference_rollout/integration/test_deterministic.py
  • tests/fast/rollout/inference_rollout/integration/test_over_sampling.py
  • tests/fast/rollout/inference_rollout/integration/test_agent_metadata.py
  • tests/fast/rollout/inference_rollout/integration/test_multi_turn.py
  • tests/fast/rollout/inference_rollout/integration/test_group_rm.py
  • tests/fast/rollout/inference_rollout/integration/test_multi_sample.py
  • tests/fast/rollout/inference_rollout/integration/test_sample_filter.py
  • tests/fast/rollout/generate_hub/test_multi_turn.py
  • tests/fast/rollout/generate_hub/test_single_turn.py
  • tests/fast/utils/test_nvfp4_quantizer.py
  • tests/fast/utils/test_mxfp8_quantizer.py
  • tests/fast/utils/test_quantizer_ci.py

stage-b-sglang-8-gpu (renamed from stage-b-sglang-1-gpu in #1107) — 1 / 4 PASS, in process.

stage-b-short-8-gpu — 5 / 5 PASS. 1 already skipped upstream.

  • tests/e2e/short/test_qwen2.5_0.5B_gsm8k_async_short.py
  • tests/e2e/short/test_qwen2.5_0.5B_gsm8k_short.py
  • tests/e2e/sglang_config/test_sglang_config_mixed_offload.py
  • tests/e2e/sglang_config/test_sglang_config_mixed_offload_ft.py
  • tests/e2e/sglang_config/test_sglang_config.py

Stage C

stage-c-fsdp-8-gpu — 0 enabled, all skipped upstream.

stage-c-megatron-8-gpu — 4 / 6 PASS, in process. (test_moonlight_16B_A3B.py and test_moonlight_16B_A3B_r3.py deleted in #1137; test_glm47_flash_r3_mtp.py and test_qwen3_30B_A3B_r3.py re-enabled in #1137.)

stage-c-precision-8-gpu — 0 enabled, all skipped upstream.

stage-c-ckpt-8-gpu — 2 / 2 PASS.

stage-c-long-8-gpu — 2 / 2 PASS.

  • tests/e2e/long/test_qwen2.5_0.5B_gsm8k_async.py
  • tests/e2e/long/test_qwen2.5_0.5B_gsm8k.py

stage-c-lora-8-gpu — 1 / 1 PASS.

  • tests/e2e/lora/test_lora_qwen2.5_0.5B.py

stage-c-glm5-8-gpu — 1 / 1 PASS.

Roll-up: 71 / 80 enabled (non-skipped) tests confirmed PASS on MI355X (89%).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions