[ATOM SGL] runtime extraction by ZhiweiYan-96 · Pull Request #30 · zejunchen-zejun/ATOM

ZhiweiYan-96 · 2026-05-18T15:27:03Z

No description provided.

) * enable configurable weight bpreshuffle for fp8 blockscale gemm Signed-off-by: ganyi <ygan@amd.com> * add moe for configurable bpreshuffle Signed-off-by: ganyi <ygan@amd.com> --------- Signed-off-by: ganyi <ygan@amd.com> Co-authored-by: wuhuikx <hattie.wu@amd.com>

* Support deepseek v4 torch compile * remove hc head custom op

* sparse attn mtp wip * maybe I need these * wip * remove logs * remove logs * fix mem * run pass * rm logs * fix format * add test * fix draft model * dsa: remove block_table_convert_triton in dsa * commit * commit * clear code * function pass * fix zeros * fix format * Fix merge isssue --------- Co-authored-by: chenjun <junchen2@amd.com> Co-authored-by: la <46212055+junhaha666@users.noreply.github.com>

* Add the benchmark flow for OOT * Update the benchmark * Add attention-backend * Update the vLLM-ATOM GPT-OSS recipe * Change the recipe of kimi-k2 * revert the changes in recipe * Update atom-vllm-benchmark-guide.md

…usion (ROCm#723) * perf(deepseek_v4): fused_compress kernel + DualRMSNorm fusion + decode buffer cleanup Fused-compress kernel (atom/model_ops/v4_kernels/fused_compress.py) - Split K-loop into two phases (state-only `[0, window_len)` then input-only `[window_len, K)`): on AMD CDNA masked tl.load still issues the LD instruction, predicate only suppresses the register write. Issuing only the live side's loads cuts HBM traffic ~40% in the bandwidth-bound regime. Microbench (median per launch, BATCH=50, 200 iters): HCA (ratio=128 K=128): 35.07us → 28.76us (-18%) CSA (ratio=4 K=8 ): unchanged (launch-overhead floor at K=8) - Padding invariant verified: window_len = K - min(j_in_seq+1, K) bounds K-1-position, so padding (`s < 0`) lies entirely in the state phase — input phase needs no padding mask. - Eviction hints: ragged kv_in/score_in marked evict_first (single-use per program), ape evict_last (small + reused). - FP8 quant fusion path: `tl.clamp + plain .to(fp8)` (aiter style; avoids the slow `fp_downcast_rounding="rtne"` path on AMD that bypasses `v_cvt_pk_fp8_f32`), with UE8M0 scale + MFMA 16x16 preshuffle + .cs streaming stores. - Bit-exact match vs `fused_compress_attn_reference` (verified at BF16 precision; ≤1 BF16 ULP due to `tl.exp` HW vs libm). DualRMSNorm fusion (atom/model_ops/layernorm.py, atom/models/deepseek_v4.py) - q_norm2 + kv_norm (per-head Q + KV, both head_dim=128) routed through existing `DualRMSNorm` + `_fused_qk_norm_single_kernel`. - q_norm2 carries no learnable weight — added `_make_weightless_rmsnorm` factory (`del weight; weight = None`) so the parameter is absent from `state_dict` (no loader warning), and a `Q_HAS_WEIGHT` constexpr in the fused kernel skips the load when the weight is None. - `DualRMSNorm._eps` resolved once with explicit None-check fallback (handles both `variance_epsilon` and `eps` attribute names). Compressor refactor (atom/models/deepseek_v4.py) - Side-effecting `forward()` returns None — the prior caller-visible BF16 return was vestigial (paged_decode/paged_prefill read scattered entries directly from `unified_kv` / FP8 indexer pool). - `cache_scale` strided fp32 view binding for the Indexer-inner Compressor (FP8 scale region of the same allocation). - Auto-detects quant via `kv_cache.dtype != bfloat16`. CompressPlan decode capacity (atom/model_ops/v4_kernels/compress_plan.py) - New `decode_capacity_per_ratio` arg: when supplied, the returned `compress_plan_gpu` slice has fixed length = decode-tight bound (`max_decode_tokens // ratio + max_bs`) instead of prefill worst case (~13× larger), with sentinel-fill of trailing rows so the captured kernel grid is decode-sized but address-stable. - Empty-fwd path now also produces CompressPlans pointing at the pre-allocated buffers (sentinel-filled), so capture-time and replay-time addresses match even on a zero-token fwd. V4 attention builder (atom/model_ops/attentions/deepseek_v4_attn.py) - `_decode_compress_cap[ratio]` plumbing for CG decode path. - Indexer-inner Compressor `cache_scale` view bound from `runner.v4_csa_idx_kv` per-layer. - Removed `_build_indexer_compress_slot_mapping` and `compress_slot_mapping_gpu` (Indexer-inner now uses block_tables directly). - Dropped `v4_indexer_decode_logits` and `v4_indexer_decode_topk_indices` from the metadata pool — these are write-once GPU scratch with no CPU mirror; allocated per-fwd via `torch.empty` in `Indexer._score_topk_decode`. Under CG capture the allocations land in the graph's private pool and replay reuses the same address (saves ~2 MiB pinned host + ~2 MiB GPU on the prior `CpuGpuBuffer` overhead). mark_trace typing (atom/utils/decorators.py) - `@overload` + ParamSpec: pyright/pylance no longer flags DualRMSNorm-style decorated callables as "not callable". Triton MoE block_m default (atom/model_ops/fused_moe_triton.py) - `ATOM_TRITON_MOE_BLOCK_M` default 64 → 32 (better MI355X tile occupancy at typical MoE shapes). GSM8K nshot=5 (DeepSeek-V4-Pro, --level 0, ATOM_USE_TRITON_MOE=1): flexible-extract 0.9522 ±0.0059 / strict-match 0.9530 ±0.0058 (baseline 0.953/0.954 — within 1σ, no regression) * test remove toch compile * remove level0

* CI: add runner model-cache diagnostics in atom tests. * CI: gate ATOM tests on Pre Checkin workflow status Replace the signal artifact handoff with direct workflow run status checks so downstream CI avoids artifact lookup latency and pagination issues.

* [fix](gpt-oss): change the accuracy test for gpt-oss * [fix](ci): change gpt-oss accuracy test in ci * [feat](ci): add client command for gpt-oss --------- Co-authored-by: perzhang <perzhang@amd.com>

* add triton fallback for ds & gptoss * fix format * add triton mha layout * remove useless * update limit * refactor block table

…cipe (ROCm#670) * correct the quant type based on recipe for fuse_qknorm_quant * remove debug print * use quant type Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * use unified aiter fused_qk_norm interface Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * change the import path for fused_qknorm Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * fix qknorm quant compile path * document qknorm quant compile guard * simplify qknorm quant dispatch * use single rmsnorm quant entrypoint * update api * clean the qk rmsnorm code * fix corner case --------- Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> Co-authored-by: Guanbao Yu <gyu@amd.com> Co-authored-by: zejunchen-zejun <zejun.chen@amd.com> Co-authored-by: wuhuikx <hattie.wu@amd.com> Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com>

* [DSV4]moe flydsl * update * update * refactor moe shuffle * update * update * modify atom env * update env --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

) Replace the chunk + double torch.clamp + F.silu * up sequence in Expert.forward with a single aiter.silu_and_mul(out, combined, limit) call. The new limit parameter folds the swiglu_limit clamp (gate <= limit, up in [-limit, limit]) into the kernel via the v_med3_f32 intrinsic, removing several launch-bound ops on the per-token critical path. Requires aiter PR ROCm/aiter#3104 (already merged), which adds the limit parameter and HAS_LIMIT compile-time specialization to silu_and_mul. Verified on DeepSeek-V4-Pro tp=8 --level 0: GSM8K nshot=5 (AITER_BF16_FP8_MOE_BOUND=0 + ATOM_MOE_GU_ITLV=1): run 1: 0.9515 / 0.9522 (flexible / strict) run 2: 0.9522 / 0.9530 Matches V4-Pro baseline (0.9522 / 0.9530), within 1 sigma stderr.

Co-authored-by: Guanbao Yu <gyu@amd.com>

* temp reinstall triton==3.6.0 * reinstall triton for vLLM-ATOM * update comments for better tracking --------- Co-authored-by: Guanbao Yu <gyu@amd.com>

* add _aiter_fused_routing_from_topk switch * ruff fix * change import * black format

* remove chunk split Signed-off-by: ganyi <ygan@amd.com> * fix accuracy Signed-off-by: ganyi <ygan@amd.com> * add None input for not quantize case Signed-off-by: ganyi <ygan@amd.com> * fix ci Signed-off-by: ganyi <ygan@amd.com> * maybe fix the ci benchmark crash Signed-off-by: ganyi <ygan@amd.com> * add env var for bare atom acc test Signed-off-by: ganyi <ygan@amd.com> * add fp8 kv cache for qwen3next bare atom ci Signed-off-by: ganyi <ygan@amd.com> * make fusion as default Signed-off-by: ganyi <ygan@amd.com> * disable non-persistent pa Signed-off-by: ganyi <ygan@amd.com> * add num tokens Signed-off-by: ganyi <ygan@amd.com> --------- Signed-off-by: ganyi <ygan@amd.com>

ROCm#733) Three related changes that together cut input-quant overhead on V4-Pro attention: 1. atom/model_ops/layernorm.py — single dispatch covering all dynamic quant types - Replace the mxfp4-only `mxfp4_rms_quant_fuse` (triton wrapper) with a unified `_aiter_rms_quant` helper that dispatches on quant_type to aiter.{rmsnorm_quant, add_rmsnorm_quant} (HIP). - Now supports per_1x32 (MXFP4 fp4x2 + UE8M0 scale + preshuffle), per_1x128 (FP8 block + transpose-aware scale), and per_Token (FP8 dynamic per-token). - `_aiter_transpose_scale` resolved once at __init__ instead of per-forward env lookup. 2. atom/models/deepseek_v4.py — fuse q_norm with wq_b input quant - q_norm now constructed with fused_quant=True, quant_config=qc; emits (qr_fp8, qr_scale) in one launch. - Outer wq_b (ColumnParallel) and Indexer.wq_b (Replicated) both consume the pre-quantized pair via `x_scale=` — saves two redundant per_1x128 input quants per layer. - Drop `otype=torch.float32` on wkv_gate (Compressor): kernel's internal fp32 accumulator already handles the upcast; bf16 intermediate halves the (kv, score) buffer bandwidth. 3. .github/workflows/atom-benchmark.yaml — fix env_vars expression injection - Multi-line `env_vars` JSON values (e.g. "AITER_BF16_FP8_MOE_BOUND=0\nATOM_MOE_GU_ITLV=1") were inlined directly into a `for ev in ${{ ... }}` loop, breaking host bash with `syntax error near unexpected token`. - Switch both Start CI container blocks to the env-block + docker --env-file pattern already used by atom-test.yaml. - Drop the redundant inline `${{ matrix.cell.env_vars }}` prefix on the regression launch step (vars are now in the container env via --env-file) — also fixes a silent bug that dropped the first var. Verified on DeepSeek-V4-Pro tp=8 --level 0: GSM8K nshot=5 (AITER_BF16_FP8_MOE_BOUND=0 + ATOM_MOE_GU_ITLV=1): baseline : 0.9522 / 0.9530 + rmsnorm_quant refactor: 0.9507 / 0.9515 + q_norm fuse : 0.9538 / 0.9538 + drop wkv_gate fp32 : 0.9583 / 0.9591 ← this PR All within 1 sigma stderr (~0.0058) of baseline; no regression. 1024/1024 c=64 microbench (q_norm fuse vs prior commit): Total token throughput : 2735.80 → 2813.00 tok/s (+2.8%) Median TPOT : 41.28 → 38.83 ms (-5.9%)

…fault cudagraph capture to 512 (ROCm#737) ParallelHead refactor: - Delegate vocab-axis sharding, weight_loader, last-token slicing, bf16 a16w16 GEMM, and TP all-gather to ParallelLMHead. - Remove ~30 lines of duplicated logic. - Drop fp32 weight (CDNA3/CDNA4 bf16 MFMA accumulates in fp32 natively; no precision change). Saves ~1.85 GB cluster VRAM. - Hoist Expert.forward's local silu_and_mul import to top-level. CUDAGraph capture default: - Add 512 to default --cudagraph-capture-sizes list. Required to capture the full graph at max_num_seqs=512; previously the cap was 256 and size-512 graphs were filtered out by model_runner. Validated: GSM8K 1319q 0.9591 +/- 0.0058 (statistically equivalent to baseline ~0.952). c=128 1k/1k throughput unchanged at ~4855 tok/s. c=512 1k/1k throughput +50.7% (7427 -> 11195 tok/s) when paired with --max-num-seqs 512.

* [Kimi] support Eagle3 speculative decoding for Kimi K2.5 Adds Eagle3 spec decode for Kimi K2.5 (MLA target + standard MHA draft): - Eagle3LlamaModel: 1-layer Llama draft (dual-norm input, wide QKV, independent embed/lm_head) matching the lightseekorg/kimi-k2.5-eagle3 checkpoint - Eagle3DraftBuilder: implements the post-ROCm#659 builder protocol (compute_block_bytes / allocate_kv_cache_tensors / build_kv_cache_tensor) for the draft's independent non-MLA KV cache, attached to the runner from EagleProposer.__init__ via runner.eagle3_draft_builder. ModelRunner delegates KV pool sizing, allocation, and per-module binding through this hook with no eagle3-specific code in the runner KV path - Aux hidden state pipeline: target forward returns (hidden, aux_hidden_states), captured through CUDAGraph via graph_aux_hidden and fed to the draft's combine_hidden_states (fc) as input - SpeculativeConfig: --method eagle3 + --draft-model CLI; eagle3 vs MTP branching at construction time; fail-fast if draft is MLA - Scheduler: spec_stats only updated when speculation actually ran (matches vLLM's gating) - propose: draft-perspective predicate `draft_uses_mha = hasattr(runner, "eagle3_draft_builder")` drives both the metadata-flow special-cases (slot_mapping re-slice, context_lens += 1, tuple-unpack of the draft return value); is_eagle3 string comparison is gone from the hot path Result on Kimi-K2.5-MXFP4 + kimi-k2.5-eagle3, 8x MI355X, gsm8k 5-shot: acceptance 67.85%, accuracy 93.78%. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: add nightly Eagle3 spec-decode accuracy test for Kimi-K2.5 Reuses the base Kimi-K2.5-MXFP4 model + lightseekorg/kimi-k2.5-eagle3 draft, runs at TP=8 (Eagle3 draft KV needs full 8-rank sharding) under nightly schedule. Local case_verify_v9_gluon measured GSM8K 5-shot flexible-extract = 0.9257 (vLLM = 0.9280); threshold set to 0.91 with ~1.5pp noise headroom. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* support three stream in Deepseek V4 * modify hca dual stream if no indexer * clean code logic

* Update the vLLM-ATOM benchmark scope * Update benchmark model tags and weekday scheduling. Normalize model tagging to MET/OOB/AW prefixes and labels, and replace nightly A/B/C date rotation with weekday-based grouping (Mon/Wed MET, Tue/Thu AW, Fri ALL, weekends skipped) for clearer benchmark cadence control. * Align AW Minimax serve args and warmup behavior. Add --kv-cache-dtype fp8 for MiniMax-M2.5 AW TP2/4/8 so startup flags match expected cache settings, and always pass --num-warmups=$((2 * CONC)) for vllm bench serve runs to keep warmup load consistent. * Normalize AW metadata and gpt-oss memory env settings. Add nightly_group=B for all AW model entries, align all gpt-oss variants to OOT_GPU_MEMORY_UTILIZATION=0.5, and restrict scheduled benchmark cron to weekdays to avoid weekend empty runs.

…c quant recipe (ROCm#747) * [fix][acc] fix accuracy of fp8 attn weights model using ptpc quant recipe Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

* [atom-vllm benchmark] refine atom-vllm benchmark UI Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * add Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * add Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * add Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * add Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * add Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * add Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * add Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * add Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * add Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * add Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * Restore main's (MET)/(OOB)/(AW) naming, weekday nightly rotation, and AW benchmark logic - Restore (MET)/(OOB)/(AW) suffix in all model display names and prefixes - Remove feature-branch-only TP variants, keep exactly main's 27 models - Add all 15 AW model variants from main with bench_args "--random-range-ratio 1" - Restore weekday-based nightly rotation: Mon/Wed A-MET, Tue/Thu B-AW, Fri C-ALL - Restore is_aw_model logic in build-benchmark-matrix (AW ISL/OSL pairs + extra concurrency) - Restore per-model BENCH_CLIENT override: AW models auto-use vLLM bench client Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * Fix dashboard upload defaults and AW model data accuracy - Enable both dashboard uploads (gh-pages + LLM-Booster) for nightly schedule runs - Set publish_to_dashboard default to true for manual runs (matching main) - Fix AW model benchmark_client in LLM-Booster data: read per-result client from enriched JSON instead of using global env (AW models override to vLLM bench) - Fix AW model random_range_ratio: set to "1" in matrix params so filename, benchmark command, and dashboard data are all consistent - Add dashboard_model for DeepSeek-R1 FP8/MXFP4 and gpt-oss-120b MET to avoid (MET) suffix leaking into LLM-Booster display names Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> --------- Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> Co-authored-by: Claude Opus 4 <noreply@anthropic.com>

* Add upload step for runner fleet report Added a step to upload the runner fleet report as an artifact. * Add error handling for missing files in artifact upload

* [MoE] align generic MXFP4 shuffle layout Use the shared separated-layout weight shuffle path for non-interleaved MXFP4 MoE weights and keep the interleaved scale shuffle explicit, matching AITER GateMode layout expectations. Co-authored-by: Cursor <cursoragent@cursor.com> * [MoE] mark shuffled MXFP4 weights * update code --------- Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Guanbao Yu <gyu@amd.com>

* update kimi k2/k2.5 recipe * update command --------- Co-authored-by: Guanbao Yu <gyu@amd.com>

* [fix](gpt-oss): fix gpt-oss model accuracy bug * [fix](server): fix gpu-memory-utilization args --------- Co-authored-by: perzhang <perzhang@amd.com>

* fix all tokens hit issue * refine

* preshuffle indexer cache Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * use persistent mode mla Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * Format sparse MLA plugin files * Remove --block-size 1 from DeepSeek-V3.2 benchmark and accuracy configs The sparse MLA plugin now defaults to block_size=64 for preshuffled indexer cache. The hardcoded --block-size 1 in CI configs would override this default and prevent the performance gains from taking effect. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> --------- Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> Co-authored-by: XiaobingZhang <xiaobingzhangupc@gmail.com> Co-authored-by: zejunchen-zejun <zejun.chen@amd.com> Co-authored-by: Claude Opus 4 <noreply@anthropic.com>

…OCm#777) * add qwen3.5 cases for sglang-atom Signed-off-by: zhuyuhua-v <yuhzhu@amd.com> * remove redundant env Signed-off-by: zhuyuhua-v <yuhzhu@amd.com> --------- Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

* fix(qwen3_5): packed_modules_mapping for in_proj_qkvzba BF16 weight has a fused in_proj_qkvzba; FP8/MXFP4 ship pre-split in_proj_qkvz and in_proj_ba. Add a helper that rewrites the packed_modules_mapping when quant_dtype is bfloat16 so the loader slices the fused tensor along the qkv/z/b/a axes correctly. Apply at the two atom-native conditional-generation entrypoints; the vllm plugin path keeps its existing inline fixup. * ci(accuracy): add Qwen3.5-397B-A17B BF16 nightly entry Adds the BF16 weight variant alongside the existing FP8 / MXFP4 entries. Use --kv_cache_dtype fp8 to avoid the known MI355 BF16 KV cache accuracy issue. Threshold/baseline are placeholders copied from the FP8 entry; refresh after the first CI measurement. --------- Co-authored-by: JiaoliangYu <jiaolyu@hjbog-srdc-39.amd.com>

* mtp 1 acc right Signed-off-by: ganyi <ygan@amd.com> * add recipe for qwen3-next-mtp Signed-off-by: ganyi <ygan@amd.com> * modify some qwen3.5 recipe Signed-off-by: ganyi <ygan@amd.com> * black Signed-off-by: ganyi <ygan@amd.com> * remove redundant code Signed-off-by: ganyi <ygan@amd.com> * remove redundant code Signed-off-by: ganyi <ygan@amd.com> * add spec decode convert for vllm plugin Signed-off-by: ganyi <ygan@amd.com> * remove vllm related branch Signed-off-by: ganyi <ygan@amd.com> * use atom spec decode config for plugin loading Signed-off-by: ganyi <ygan@amd.com> * remove unnecessary changes in modeling Signed-off-by: ganyi <ygan@amd.com> * format Signed-off-by: ganyi <ygan@amd.com> * add qwen3next mtp into benchmark Signed-off-by: ganyi <ygan@amd.com> * [ci] disable FP8 blockscale weight preshuffle for Qwen3.5/Qwen3-Next Add ATOM_FP8_BLOCKSCALE_WEIGHT_PRESHUFFLE=0 to all Qwen3.5 and Qwen3-Next model configs across benchmark, nightly accuracy, and recipe files. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * [ci] fix Qwen3-Next MTP benchmark label from MET to AW Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * [docs] fix Qwen3.5 recipe: update env var count and add preshuffle doc Remove stale "three" count (now variable list), add ATOM_FP8_BLOCKSCALE_WEIGHT_PRESHUFFLE=0 to both the Important section and Key Environment Variables section. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> --------- Signed-off-by: ganyi <ygan@amd.com> Co-authored-by: zejunchen-zejun <zejun.chen@amd.com> Co-authored-by: Claude Opus 4 <noreply@anthropic.com>

…#785) * ci: branch-aware docker release + fix benchmark model selection docker-release: use github.ref_name for branch-aware builds so non-default branches produce tagged images (nightly_YYYYMMDDHHMM-{branch}) without overwriting the latest tag. atom-benchmark: fix model selection bug where MiniMax and GLM-5.1 models were always included regardless of checkbox state, caused by input name mismatches (MiniMax-M2.5 vs M2.7 prefix, missing glm-5.1-fp4 input). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * support tag by self --------- Co-authored-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: Claude Opus 4 <noreply@anthropic.com>

* adapt mtp for glm5 (vllm plugin) Signed-off-by: whx-sjtu <xiaowang990929@gmail.com> * add patch to support mtp>1 Signed-off-by: whx-sjtu <xiaowang990929@gmail.com> * fix model load failure of draft model Signed-off-by: whx-sjtu <xiaowang990929@gmail.com> * adapt full graph with mtp enabled Signed-off-by: whx-sjtu <xiaowang990929@gmail.com> * fix MLA MTP acceptance issue Signed-off-by: whx-sjtu <xiaowang990929@gmail.com> * fall back to vllm-style mtp position Signed-off-by: whx-sjtu <xiaowang990929@gmail.com> * fix embedding sharing failure for mtp Signed-off-by: whx-sjtu <xiaowang990929@gmail.com> * fix lint Signed-off-by: whx-sjtu <xiaowang990929@gmail.com> * fix comment Signed-off-by: whx-sjtu <xiaowang990929@gmail.com> * remove warnig log Signed-off-by: whx-sjtu <xiaowang990929@gmail.com> * add mtp support for glm4 Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * fix rope double apply for mha Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * guard vllm forward context retrieval Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * add glm4.7 mtp to workflow Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * clean up Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * format: fix black formatting in glm4_moe_mtp.py Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> --------- Signed-off-by: whx-sjtu <xiaowang990929@gmail.com> Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> Signed-off-by: kliuae-amd <kuanfliu@amd.com> Co-authored-by: whx-sjtu <xiaowang990929@gmail.com> Co-authored-by: kuanfliu@amd.com <kuanfliu@amd.com@mia1-vm-amd-prj3-k8s-005.amd3.mia.tensorwave.lan> Co-authored-by: zejunchen-zejun <zejun.chen@amd.com> Co-authored-by: Claude Opus 4 <noreply@anthropic.com>

…ROCm#795) * Add inferencex-pr skill for ATOM benchmark comparison and PR creation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Replace inferencex-pr skill with inferencex-sync Renames the skill to better reflect its purpose: comparing ATOM upstream benchmark results against InferenceX and reporting regression/improvement. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

Map vLLM cache dtype aliases before passing them into AITER metadata setup so the default auto value does not fail dtype parsing. Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

* [perf](deepseek): add fused indexer path Co-authored-by: Cursor <cursoragent@cursor.com> * [fix](deepseek): clear fused indexer rope args Co-authored-by: Cursor <cursoragent@cursor.com> * [fix](deepseek): simplify fused indexer path Co-authored-by: Cursor <cursoragent@cursor.com> * [fix](deepseek): drop unrelated indexer weights change Co-authored-by: Cursor <cursoragent@cursor.com> * [fix](deepseek): gate fused indexer path Co-authored-by: Cursor <cursoragent@cursor.com> * [fix](deepseek): guard indexer wk fusion by dtype Co-authored-by: Cursor <cursoragent@cursor.com> * [fix](deepseek): remove unused sparse import Co-authored-by: Cursor <cursoragent@cursor.com> * [fix](deepseek): address fused indexer review feedback Keep indexer weight fusion decisions consistent with quant fallback paths and make dummy/profile behavior deterministic. Co-authored-by: Cursor <cursoragent@cursor.com> * [fix](deepseek): align indexer review fixes with black Keep the PR compatible with the repository's Black formatting check after the review fixes. Co-authored-by: Cursor <cursoragent@cursor.com> * [fix](deepseek): simplify indexer fusion guard Collapse the review helper stack while keeping a single model-level fusion decision for weight loading. Co-authored-by: Cursor <cursoragent@cursor.com> * [fix](deepseek): move pending fp8 wk before dequant Ensure threaded checkpoint loading dequantizes pending FP8 indexer wk weights on the same device as their scales. Co-authored-by: Cursor <cursoragent@cursor.com> * [fix](deepseek): tighten indexer fusion fallback Keep the env-disabled path aligned with the unfused baseline and fail clearly when fused wk weights are incomplete. Co-authored-by: Cursor <cursoragent@cursor.com> * [fix](deepseek): decouple indexer wk fusion guard Keep wk/weights projection fusion capability independent from the QK RoPE cache fusion runtime switch. Co-authored-by: Cursor <cursoragent@cursor.com> * [fix](deepseek): restrict fused indexer kernel configs Avoid enabling the fused indexer kernels for GLM configs with incompatible RoPE cache shapes while keeping compatible sparse indexer configs eligible. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>

* modify logic with running p1 daily and p1/p2 on weekends * Monday-Thuesday: P0; Fridy: P1&P2; Weekend: C-ALL * modify code * modify the running time logic * Add MTP cases and run the case according to P0/P1/P2 * Merge 'cron' * Change the scheduling time --------- Co-authored-by: root <root@hjbog-srdc-19.amd.com>

* Add benchmark checkout ref inputs Allow manual benchmark runs to target a specific ATOM branch or commit while preserving the default scheduled behavior. Co-authored-by: Cursor <cursoragent@cursor.com> * Simplify benchmark ref input Keep only the commit override input for manual benchmark runs and place it after the regular benchmark parameters. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: root <root@SHAHYI01.amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>

…X PR creation (ROCm#802) * Add inferencex-pr skill for ATOM benchmark comparison and PR creation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Replace inferencex-pr skill with inferencex-sync Renames the skill to better reflect its purpose: comparing ATOM upstream benchmark results against InferenceX and reporting regression/improvement. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * pr flow update Signed-off-by: seungrokj <seungrok.jung@amd.com> * (skill)[inferencex-sync]: Clarify all-jobs-passed condition for latest run Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Signed-off-by: seungrokj <seungrok.jung@amd.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* [fix](deepseek): handle padded sparse indexer inputs Keep fused indexer cache writes limited to actual vLLM plugin tokens and pass RoPE caches in the 2D shape expected by the AITER indexer kernel. Co-authored-by: Cursor <cursoragent@cursor.com> * [fix](deepseek): pass unsliced sparse indexer inputs Rely on AITER's mapped-token bounds for padded graph inputs so the sparse indexer plugin can pass the original tensors through both fused and fallback cache update paths. * [fix](deepseek): keep sparse indexer plugin unchanged Leave the plugin sparse indexer path unchanged so this branch only carries the DeepSeek-V3.2 RoPE cache layout fix. --------- Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: perzhang <perzhang@amd.com>

* [ATOM SGL] Qwen3.5 mha support * precheckin --------- Co-authored-by: wuhuikx <hattie.wu@amd.com>

…MTP-1 CI (ROCm#817) * feat(deepseek_v4_mtp): wrapper-owned MTP with share_with_target hook Redesign V4 MTP to match V2/V3/Qwen convention: - Target (DeepseekV4ForCausalLM) carries no MTP modules / weights - Wrapper (DeepseekV4MTP) self-contains MTP blocks + loads mtp.* via the standard load_model path with spec_decode=True - EagleProposer.share_with_target() rebinds embed/head from the target's already-loaded instances; no double KV / double weight load Loader: replace per-call `not spec_decode` mtp short-circuit with `need_load_mtp = spec_decode and any("mtp" in n for n in params_dict)` so target loads and Eagle3 drafts (no `mtp.*` params) skip cleanly. EagleProposer: add `share_with_target` escape valve for models whose embed/head naming diverges from the standard `embed_tokens`/`lm_head` (V4 uses `model.embed`/`model.head`). ModelRunner: thread `extra_output_dims` so V4's `[N, hc_mult, dim]` residual-stack output (hc_head + LM head deferred to compute_logits) flows correctly into the outputs buffer. SpeculativeConfig: register `deepseek_v4` -> `deepseek_v4_mtp` mapping and stamp `model_type=deepseek_v4_mtp` onto the draft hf_config; the runner's is_deepseek_v4() now accepts both so V4-specific paths fire on the draft too. * fix(deepseek_v4): SWA+MTP cache aliasing, num_rejected rollback, EOS truncation - SWA ring buffer: cache_size = window_size + max_spec_steps to prevent draft-vs-verified slot aliasing when k>=1 spec tokens share a row's window. - Per-row window_topk indexing collapses case_a/b/c into a single formula: (pos - W + 1 + w) % cache_size with abs<0 / abs>pos masked to -1. - prepare_decode rolls back context_lens by num_rejected (mirrors aiter_mla), fixing stale ctx after MTP rejections. - scheduler.postprocess truncates bonus tokens emitted after EOS / stop_token / stop_sequence (rejection_sampler greedy kernel does not check EOS), removing trailing <BOS> in MTP outputs. - Cleanup: drop dead _build_window_topk_batched, drop start_pos_per_token param, rename kernel param win -> cache_size for clarity. Validated: GSM8K 0.9545 (flex/strict), MTP-1 acceptance 86.27%. * wip * feat(deepseek_v4): MTP-K>1 support via prepare_mtp_decode + backend-agnostic eagle mid-step - eagle.propose mid-step: gate flat kv_indices/kv_indptr reads on `has_flat_kv` (hoisted out of the loop) so V4 path skips them; MLA/MHA unchanged - DeepseekV4AttentionMetadataBuilder.prepare_mtp_decode: per-draft-step V4 region metadata rebuild for 1-token-per-seq shape, zero D2H (mirrors eagle's GPU context_lens += 1 on CPU side instead of reading back). Entry short-circuits when max_per_req_cache_slots is unset (warmup case; attr is set by ModelRunner.get_num_blocks which runs after warmup_model). - CommonAttentionBuilder.__init__: centralize num_attention_heads (was duplicated in V4, MLA, MHA builders). - start_simple_inference.sh: drop env-var translation for engine flags, all engine config now passes through EXTRA_ARGS as native CLI args. - stop_atom_server.sh: also SIGTERM lm_eval so the next launch starts clean. Validated: V4 simple_inference (MTP1 + MTP3), V4 GSM8K full-1319 flexible-extract MTP3=0.9568 / MTP1=0.9545 (within stderr, no regression), acceptance distribution {0: 6.5%, 1: 24.1%, 2: 39.7%, 3: 29.7%} for MTP3, avg toks/fwd 2.92 vs 1.87 for MTP1 (+56%). * fix(deepseek_v4): wire write_v4_paged_decode_indices kernel + tooling - _attach_v4_paged_decode_meta: replace 6 transient tensors + 2 x index_copy_ chain with the existing write_v4_paged_decode_indices Triton kernel that reads only persistent forward_vars buffers. Eliminates allocator-churn race surfaced as ASSERT_TRAP in index_copy_kernel_impl<OpaqueType<4>> under MTP-3 long prefill. - _stage: harden silent astype auto-cast into an assertion so dtype drift fails fast at the call site instead of corrupting downstream triton reads. - scripts: move run_benchmark{,_sweep}.sh to atom/scripts/ and add wait_infer_drain.sh (hang vs drain monitor used during the V4 MTP investigation). - skills: convert flat .claude/skills/*.md into the SKILL.md/<name> subdirectory layout that the Skill tool actually loads; add new capture-trace skill (English, correct script paths, structured to match peer skills). * refactor(deepseek_v4): rename seq_lens→context_lens, require plan_buffers, drop dead device arg - `_build_compress_plans` + `make_compress_plans`: rename `seq_lens_np` / `seq_lens_cpu` → `context_lens_np` / `context_lens_cpu`. Every caller already passes `context_lens`; the old name suggested "raw seq length" while the actual semantics is "absolute seq_len AFTER the new extend tokens" (= prefix + extend). Internal `prefix = context_lens - extend_lens` reconstruction unchanged. - `make_compress_plans`: `plan_buffers` is now required (was `Optional[dict]`). The legacy fresh-`torch.from_numpy(...).to(device)` fallback is removed, along with the now-unused `device` positional arg. That fallback reproduces the exact allocator-churn race we fixed in `write_v4_paged_decode_indices` (transient tensors handed to in-flight kernels). Forcing every caller through the pre-allocated CpuGpuBuffer pool keeps data pointers stable and prevents the race from being re-introduced. - `_build_compress_plans`: keep the `assert isinstance(..., np.ndarray)` guards added in the same series so callers can't silently pass torch tensors and trigger a hidden D2H sync. Verified on V4-Pro no-MTP GSM8K (TP=8, level=0, CG on): HEAD c454322f: 0.95 flexible-extract this commit: 0.9560 flexible-extract (within 1σ; 1319/1319) * docs(skill): debug-agent-locate-kernel — `--enforce-eager`/`--level 0` are optional fallbacks Reword pre-flight item 2 + the matching anti-pattern + the example launch command. Previous text claimed both flags were required ("Always pass both"). That's wrong: the debug agent runs fine under hipgraph in most cases. The flags only become useful when the symptom points at graph mode or Inductor: - `--enforce-eager` for faults that don't reproduce in eager / capture- replay crashes under the agent's no-caching-allocator behavior; - `--level 0` for AMD Inductor `cluster_dims` autotune crashes at warmup. Adding them blindly hides graph-mode-only bugs (current branch: V4-Pro no-MTP server-mode prefill hangs ONLY under `--enforce-eager`; CG path works fine — we'd never have spotted that with the old "always pass both" rule). CLAUDE.md `run_debug_agent.sh` row updated in lockstep. * fix(deepseek_v4): gate alt_stream async-compress on CUDAGraph capture eager mode triggered a hipStream deadlock on V4-Pro: 60 layers of main Compressor launches accumulate on alt_stream, and the first splitk GEMM workspace allocation (the third bf16gemm_*_splitk_clean kernel load) hits a caching-allocator race with the MoE shared_experts GEMM also targeting alt_stream. Verified hang on both small (~800 token) and large (>2k token) prefill batches in eager. Inside a CUDAGraph capture block this same pattern is safe: the graph records the fork-join edges and replay re-uses the same stream layout without per-launch allocator contention. So the fix is to gate `use_async_compress` on a new `ForwardContext.in_hipgraph` flag set only by `model_runner.capture_model`. Replay does not re-execute Python forward, so the flag is irrelevant there. While here: - Cache `torch.cuda.current_stream()` once per fwd in `ForwardContext.main_stream` and have V4 attention / MoE / launcher read it instead of querying repeatedly (eliminated per-call handle allocation; metadata builder still queries directly because it runs before `set_forward_context`). - Module-level cache `_CUDA_AVAILABLE` so the cuda-availability check doesn't fire per `set_forward_context`. - Fix `wait_infer_drain.sh` drain detection: once the eval client is gone and a single poll shows no new output, declare drain immediately instead of waiting STUCK_POLLS polls (the old logic added ~120 s of false-positive wait to every run). Verified end-to-end: - V4-Pro eager + full GSM8K (1319 reqs): 420s, 0.9553/0.9560 (was deadlocking; now matches CG baseline) - V4-Pro CG mode + full GSM8K: 240s, 0.9492/0.9500 - V4-Pro CG + MTP1 + full GSM8K: 180s, 0.9575/0.9583 - V4-Pro CG + MTP3 still crashes on large prefill batch (pre-existing MEMORY_VIOLATION unrelated to these changes) * fix(deepseek_v4): race-free swa_write_indices + unify is_pure_decode source Under concurrent serving the swa_write kernel hit MEMORY_VIOLATION on `tl.load(kv_ptr + src_id * head_dim + d_offsets)` because the shared pinned `v4_meta_swa_write_indices` buffer was rewritten by the next fwd's CPU side BEFORE the previous fwd's async H2D DMA actually fired, producing torn `src_id` values that exceed `kv.shape[0]`. Captured under rocm-debug-agent: all faulting waves at PC +1144 (`s_waitcnt vmcnt(0)`) in `_swa_write_kernel`, confirming the prior async kv load as the OOB source. The fix writes into the pre-allocated shared GPU buffer with a static GPU iota source for the pure-decode arange path and a fresh local numpy for prefill — never touches the pinned `.np` alias whose next-fwd rewrite would tear the in-flight DMA. Co-changes: - Hoist `is_pure_decode` to AttentionMetaData_DSV4 construction time (single source of truth: prepare_decode / prepare_prefill / build_for_cudagraph_capture each declare their own semantics). `_attach_v4_per_fwd_meta` and `_attach_v4_paged_decode_meta` now read instead of recomputing. - Replace the old `(token_num_per_seq == 1).all()` arange-shortcut proxy with `is_pure_decode` (MTP verifier batches no longer take the slow concat path). - Tighten swa-write grid: prefill (eager) uses `num_write` exactly, decode/MTP CG uses the existing `padded_bs * (1+max_spec_steps)` bucket. Long-prefill chunks no longer launch up to ~64x sentinel- bail programs. - Tighten `v4_meta_swa_write_indices` from `mnbt` to `max_bs * win` (~16x smaller; universal worst case across paths). - Pre-allocate static GPU iota `self._swa_iota` as the H2D-free source for the is_pure_decode arange. - scripts/wait_infer_drain.sh: add offline simple_inference mode (process-exit + no-fault = drain) and detect "Memory access fault by GPU" alongside MEMORY_VIOLATION / ASSERT_TRAP. Verified: eager mode gsm8k 5-shot conc=65, 0 MEMORY_VIOLATION, exact_match flexible 0.9553 / strict 0.9545 (baseline ~0.95+). * ci(deepseek_v4): add MTP-1 per-PR accuracy + benchmark entries * fix(deepseek_v4): ruff lint + restore prep_stream sync + cache_size in fused swa_write - Drop unused `device` / `lru_cache` / first `prep_stream` assignment - Restore `prep_stream.wait_stream` / `with torch.cuda.stream(prep_stream)` around prepare_prefill H2D staging (suspected root for CG+MTP-3 tail deadlock at recv_mtp_status_async) - fused_qk_norm_rope_swa_write: pass `cache_size` instead of undefined `win` so the ring stride matches `swa_kv.shape[1]` under MTP * feat(benchmark): custom message encoder fallback + wait_infer_drain client-log support - benchmark_serving.py: route `--use-chat-template` through atom.entrypoints.openai.chat_encoders so models without a Jinja chat_template in tokenizer_config.json (e.g. DeepSeek-V4-Pro) render via the model-shipped custom encoder - wait_infer_drain.sh: treat LOG_FILE mtime growth as a progress signal alongside the engine "output send" marker, so passing a client log (e.g. benchmark.log with tqdm output) no longer triggers a false- positive HANG * refactor(deepseek_v4): inline window_topk + ragged-packed decode kv_indices Two coupled simplifications to the V4 paged-decode index build (decode/MTP path); GSM8K 3-shot 0.949 and 1k/1k c=64 MTP1 4247 tok/s match pre-refactor. 1) Drop the [mnbt, win] `v4_meta_window_topk` CG buffer (~4 MB) and the CPU `_build_window_topk_np` function. `write_v4_paged_decode_indices` now derives `n = min(positions[t]+1, win)` and `ring = (pos - n + 1 + i) % cs` inline from `var["positions"].gpu` — no intermediate. 2) Decode `kv_indices_swa/csa/hca` now use ragged-packed per-token slot counts (`actual_swa_count[t] + n_compress[t]`), matching the prefill layout instead of the prior uniform `win + n_compress` with -1 sentinel padding. `skip_prefix_len_csa[t]` is now `actual_swa_count[t]` in both paths (was hard-coded `win` in decode). * refactor(deepseek_v4): metadata int32 sweep + drain log auto-discovery + run-atom-workload skill deepseek_v4_attn.py: - Drop all unnecessary int64 in CPU metadata paths (cu_seqlens, indptr cumsums, segment indices, n_committed_{csa,hca}_per_seq). Keeps int64 only where GPU fancy-index ABI requires it (batch_id_per_token, swa_write_indices). Saves widen/narrow churn; no perf change but cleaner mental model. - Hoist n_committed_{csa,hca}_per_seq_cpu as dataclass single source of truth, replacing 3 independent ctx//k recomputes in paged_decode_meta / paged_prefill_meta / v4_indexer_meta. - Drop dead `start_pos_per_seq_cpu` dataclass field — only one real consumer (_build_paged_prefill_meta); inline as local var, drop 3 dead setters (prepare_decode, CG capture, prepare_mtp_decode). - Drop dead `positions_np` + `start_pos_per_seq_cpu` parameters from _attach_v4_per_fwd_meta; update 5 call sites. - Drop `cu_seqlens_q_np` + `start_pos_per_seq` params from _attach_v4_indexer_meta (read from attn_metadata.n_committed_csa_per_seq_cpu). - Drop dead helper `_clear_v4_paged_decode_meta` (defaults already None). - Drop dead CpuGpuBuffers: v4_meta_start_pos_per_seq + v4_meta_token_num_per_seq (allocated but never staged). - Drop redundant `if True:` wrapper + 3 redundant `import numpy as np` inside methods (module-level import suffices). - net -178 lines, ruff clean. scripts/wait_infer_drain.sh (v1.2): - Auto-discover server log via readlink /proc/<pid>/fd/1 of the atom.entrypoints process. Eliminates the recurring bug where callers passed the wrong LOG_FILE (e.g. lm_eval's silent gsm8k_eval.log) causing false HANG verdicts. - Caller-supplied LOG_FILE is now optional (was: defaulted to hardcoded /app/logs_claude/atom_server.log). Used only as supplementary signal for fault grep (dual-log scan) and tqdm-style mtime detection. - Path-portable across repo layouts; no hardcoded log paths. .claude/skills/run-atom-workload/SKILL.md (new): - Codifies the canonical 4-step ATOM workload flow (stop → start → workload-in-shell-bg → wait_infer_drain → stop) for accuracy eval (GSM8K), benchmark, sweep, offline simple_inference, and fault repro under rocm-debug-agent. - Pins model-family env vars (V4-Pro: AITER_BF16_FP8_MOE_BOUND=0 ATOM_MOE_GU_ITLV=1; Kimi: HSA_NO_SCRATCH_RECLAIM=1; etc.). - Hard rules block past failure modes: no wrapper scripts in /app/logs_claude/, no && chaining, no double-backgrounding the server, always step 5 teardown. - Uses project-relative scripts/ paths (repo-portable). Verification: - MTP-1 GSM8K 3-shot = 0.9492 +/- 0.006 (baseline 0.9545 +/- 0.0057, within 1 sigma, no regression). - MTP-1 benchmark 1024/1024 c=64 = 4748 Total tok/s, 24.3 ms TPOT (vs prior baseline 4666 tok/s, +1.8% / no regression). - Drain auto-discovery smoke-tested on live MTP-1 server: discovered /app/logs_claude/atom_server.log via /proc, detected DRAINED cleanly. - MTP-3 + CG still hangs on high-concurrency GSM8K (known issue per feedback_v4_cg_mtp_status_deadlock.md); this commit does not address it.

* MTP(num_step=1) for DeeepSeek * Add work log for claude debug * adopt new attn constructor args * rm worklog * use atom_parameter * kwargs handle * rebase main * precheckin * fix k_scale v_scale error * new commit * fix blank * fix qwen3.5 acc --------- Co-authored-by: ZhiweiYan-96 <ZhiweiYan@amd.com> Co-authored-by: zhuyuhua-v <yuhzhu@amd.com>

…ention backend

Move the SGLang DeepSeek MLA runtime entry from legacy forward glue into SGLangDeepseekMLAAttention while keeping RadixAttention and the full-attention backend as the host/backend layers. Shrink deepseek_mla_forward.py into a helper module and clarify absorbed vs non-absorbed path naming.

ganyi1996ppo and others added 30 commits May 8, 2026 15:21

Support torch compile in deepseek v4 (ROCm#705)

68d0fbe

* Support deepseek v4 torch compile * remove hc head custom op

Add the benchmark flow for ATOM vLLM plugin (ROCm#514)

247e9b1

* Add the benchmark flow for OOT * Update the benchmark * Add attention-backend * Update the vLLM-ATOM GPT-OSS recipe * Change the recipe of kimi-k2 * revert the changes in recipe * Update atom-vllm-benchmark-guide.md

[fix](gpt-oss): change the accuracy test for gpt-oss (ROCm#720)

abdb093

* [fix](gpt-oss): change the accuracy test for gpt-oss * [fix](ci): change gpt-oss accuracy test in ci * [feat](ci): add client command for gpt-oss --------- Co-authored-by: perzhang <perzhang@amd.com>

Add triton fallback for deepseek & gptoss (ROCm#721)

77b3d1e

* add triton fallback for ds & gptoss * fix format * add triton mha layout * remove useless * update limit * refactor block table

Dsv4 moe flydsl (ROCm#718)

3f4c425

* [DSV4]moe flydsl * update * update * refactor moe shuffle * update * update * modify atom env * update env --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

Optimize Deepseek V4 prepare decode (ROCm#728)

fc64fb6

fix quant mapping for kimi k25 (ROCm#732)

28f59f2

Co-authored-by: Guanbao Yu <gyu@amd.com>

[dockerfile] temp reinstall triton==3.6.0 for accuracy (ROCm#726)

3e06687

* temp reinstall triton==3.6.0 * reinstall triton for vLLM-ATOM * update comments for better tracking --------- Co-authored-by: Guanbao Yu <gyu@amd.com>

[Triton] add fused_routing_from_topk switch (ROCm#725)

303a2e4

* add _aiter_fused_routing_from_topk switch * ruff fix * change import * black format

support three stream in Deepseek V4 (ROCm#736)

a82e5cd

* support three stream in Deepseek V4 * modify hca dual stream if no indexer * clean code logic

[fix][acc][sgl-atom] fix accuracy of fp8 attn weights model using ptp…

7934d5e

…c quant recipe (ROCm#747) * [fix][acc] fix accuracy of fp8 attn weights model using ptpc quant recipe Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

Add upload step for runner fleet report (ROCm#754)

247bb23

* Add upload step for runner fleet report Added a step to upload the runner fleet report as an artifact. * Add error handling for missing files in artifact upload

Remove lru_cache in get_per_token_exponential func (ROCm#742)

23f2272

fix kimi k25 config error with trust-remote-code (ROCm#751)

c88261e

Co-authored-by: Guanbao Yu <gyu@amd.com>

support dp attention in deepseek v4 (ROCm#745)

2b43d39

[recipe][vLLM-ATOM] update kimi k2/k2.5 recipe (ROCm#730)

7452ea9

* update kimi k2/k2.5 recipe * update command --------- Co-authored-by: Guanbao Yu <gyu@amd.com>

[fix](gpt-oss): fix gpt-oss model accuracy bug (ROCm#748)

48068eb

* [fix](gpt-oss): fix gpt-oss model accuracy bug * [fix](server): fix gpu-memory-utilization args --------- Co-authored-by: perzhang <perzhang@amd.com>

jiayyu and others added 28 commits May 14, 2026 22:16

fix prefix default on (ROCm#784)

e77a5ce

* fix all tokens hit issue * refine

(ci)[SGLang-ATOM]: Add Qwen3.5 cases for ci, nightly and benchmark (R…

f993245

…OCm#777) * add qwen3.5 cases for sglang-atom Signed-off-by: zhuyuhua-v <yuhzhu@amd.com> * remove redundant env Signed-off-by: zhuyuhua-v <yuhzhu@amd.com> --------- Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

[fix](mha): handle scalar kv scales in prefix gather (ROCm#793)

9d279f0

Co-authored-by: Cursor <cursoragent@cursor.com>

fix: handle auto kv cache dtype in vLLM plugin for sparse mla (ROCm#806)

ebc5154

Update minimax_m2.py (ROCm#820)

beae6cd

CI: extend aiter wheel artifact retention (ROCm#823)

aa6d2c8

[fix](gpt-oss): fix quark quantized model in moe bias (ROCm#787)

aa7c25a

Co-authored-by: perzhang <perzhang@amd.com>

rm mtp no prefix cache assert (ROCm#824)

a076ab3

[ATOM SGL] Qwen3.5 mha support (ROCm#819)

6619cc7

* [ATOM SGL] Qwen3.5 mha support * precheckin --------- Co-authored-by: wuhuikx <hattie.wu@amd.com>

add work log

29408d1

[ATOM-SGL][Attn refrac] Separate model-specific MLA from SGL full att…

e1d06c7

…ention backend

remove work log

3598254

[ATOM SGL] runtime extraction

4aec32a

ZhiweiYan-96 force-pushed the zhiwei/attn_refractory_runtime branch from edde9c6 to 4aec32a Compare May 19, 2026 11:43

ZhiweiYan-96 mentioned this pull request May 21, 2026

[ATOM SGLang] SGL plugin Attention Refractory ROCm/ATOM#863

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ATOM SGL] runtime extraction#30

[ATOM SGL] runtime extraction#30
ZhiweiYan-96 wants to merge 76 commits into
zhiwei/attn_refractoryfrom
zhiwei/attn_refractory_runtime

ZhiweiYan-96 commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

ZhiweiYan-96 commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants