Skip to content

[ATOM SGL] runtime extraction#30

Open
ZhiweiYan-96 wants to merge 76 commits into
zhiwei/attn_refractoryfrom
zhiwei/attn_refractory_runtime
Open

[ATOM SGL] runtime extraction#30
ZhiweiYan-96 wants to merge 76 commits into
zhiwei/attn_refractoryfrom
zhiwei/attn_refractory_runtime

Conversation

@ZhiweiYan-96
Copy link
Copy Markdown
Collaborator

No description provided.

ganyi1996ppo and others added 30 commits May 8, 2026 15:21
)

* enable configurable weight bpreshuffle for fp8 blockscale gemm

Signed-off-by: ganyi <ygan@amd.com>

* add moe for configurable bpreshuffle

Signed-off-by: ganyi <ygan@amd.com>

---------

Signed-off-by: ganyi <ygan@amd.com>
Co-authored-by: wuhuikx <hattie.wu@amd.com>
* Support deepseek v4 torch compile

* remove hc head custom op
* sparse attn mtp wip

* maybe I need these

* wip

* remove logs

* remove logs

* fix mem

* run pass

* rm logs

* fix format

* add test

* fix draft model

* dsa: remove block_table_convert_triton in dsa

* commit

* commit

* clear code

* function pass

* fix zeros

* fix format

* Fix merge isssue

---------

Co-authored-by: chenjun <junchen2@amd.com>
Co-authored-by: la <46212055+junhaha666@users.noreply.github.com>
* Add the benchmark flow for OOT

* Update the benchmark

* Add attention-backend

* Update the vLLM-ATOM GPT-OSS recipe

* Change the recipe of kimi-k2

* revert the changes in recipe

* Update atom-vllm-benchmark-guide.md
…usion (ROCm#723)

* perf(deepseek_v4): fused_compress kernel + DualRMSNorm fusion + decode buffer cleanup

Fused-compress kernel (atom/model_ops/v4_kernels/fused_compress.py)
  - Split K-loop into two phases (state-only `[0, window_len)` then
    input-only `[window_len, K)`): on AMD CDNA masked tl.load still
    issues the LD instruction, predicate only suppresses the register
    write. Issuing only the live side's loads cuts HBM traffic ~40%
    in the bandwidth-bound regime.
    Microbench (median per launch, BATCH=50, 200 iters):
      HCA (ratio=128 K=128): 35.07us → 28.76us (-18%)
      CSA (ratio=4   K=8 ): unchanged (launch-overhead floor at K=8)
  - Padding invariant verified: window_len = K - min(j_in_seq+1, K)
    bounds K-1-position, so padding (`s < 0`) lies entirely in the
    state phase — input phase needs no padding mask.
  - Eviction hints: ragged kv_in/score_in marked evict_first
    (single-use per program), ape evict_last (small + reused).
  - FP8 quant fusion path: `tl.clamp + plain .to(fp8)` (aiter style;
    avoids the slow `fp_downcast_rounding="rtne"` path on AMD that
    bypasses `v_cvt_pk_fp8_f32`), with UE8M0 scale + MFMA 16x16
    preshuffle + .cs streaming stores.
  - Bit-exact match vs `fused_compress_attn_reference` (verified at
    BF16 precision; ≤1 BF16 ULP due to `tl.exp` HW vs libm).

DualRMSNorm fusion (atom/model_ops/layernorm.py, atom/models/deepseek_v4.py)
  - q_norm2 + kv_norm (per-head Q + KV, both head_dim=128) routed
    through existing `DualRMSNorm` + `_fused_qk_norm_single_kernel`.
  - q_norm2 carries no learnable weight — added `_make_weightless_rmsnorm`
    factory (`del weight; weight = None`) so the parameter is absent
    from `state_dict` (no loader warning), and a `Q_HAS_WEIGHT` constexpr
    in the fused kernel skips the load when the weight is None.
  - `DualRMSNorm._eps` resolved once with explicit None-check fallback
    (handles both `variance_epsilon` and `eps` attribute names).

Compressor refactor (atom/models/deepseek_v4.py)
  - Side-effecting `forward()` returns None — the prior caller-visible
    BF16 return was vestigial (paged_decode/paged_prefill read scattered
    entries directly from `unified_kv` / FP8 indexer pool).
  - `cache_scale` strided fp32 view binding for the Indexer-inner
    Compressor (FP8 scale region of the same allocation).
  - Auto-detects quant via `kv_cache.dtype != bfloat16`.

CompressPlan decode capacity (atom/model_ops/v4_kernels/compress_plan.py)
  - New `decode_capacity_per_ratio` arg: when supplied, the returned
    `compress_plan_gpu` slice has fixed length = decode-tight bound
    (`max_decode_tokens // ratio + max_bs`) instead of prefill worst
    case (~13× larger), with sentinel-fill of trailing rows so the
    captured kernel grid is decode-sized but address-stable.
  - Empty-fwd path now also produces CompressPlans pointing at the
    pre-allocated buffers (sentinel-filled), so capture-time and
    replay-time addresses match even on a zero-token fwd.

V4 attention builder (atom/model_ops/attentions/deepseek_v4_attn.py)
  - `_decode_compress_cap[ratio]` plumbing for CG decode path.
  - Indexer-inner Compressor `cache_scale` view bound from
    `runner.v4_csa_idx_kv` per-layer.
  - Removed `_build_indexer_compress_slot_mapping` and
    `compress_slot_mapping_gpu` (Indexer-inner now uses
    block_tables directly).
  - Dropped `v4_indexer_decode_logits` and
    `v4_indexer_decode_topk_indices` from the metadata pool — these
    are write-once GPU scratch with no CPU mirror; allocated per-fwd
    via `torch.empty` in `Indexer._score_topk_decode`. Under CG capture
    the allocations land in the graph's private pool and replay
    reuses the same address (saves ~2 MiB pinned host + ~2 MiB GPU
    on the prior `CpuGpuBuffer` overhead).

mark_trace typing (atom/utils/decorators.py)
  - `@overload` + ParamSpec: pyright/pylance no longer flags
    DualRMSNorm-style decorated callables as "not callable".

Triton MoE block_m default (atom/model_ops/fused_moe_triton.py)
  - `ATOM_TRITON_MOE_BLOCK_M` default 64 → 32 (better MI355X tile
    occupancy at typical MoE shapes).

GSM8K nshot=5 (DeepSeek-V4-Pro, --level 0, ATOM_USE_TRITON_MOE=1):
  flexible-extract 0.9522 ±0.0059 / strict-match 0.9530 ±0.0058
  (baseline 0.953/0.954 — within 1σ, no regression)

* test remove toch compile

* remove level0
* CI: add runner model-cache diagnostics in atom tests.

* CI: gate ATOM tests on Pre Checkin workflow status

Replace the signal artifact handoff with direct workflow run status checks so downstream CI avoids artifact lookup latency and pagination issues.
* [fix](gpt-oss): change the accuracy test for gpt-oss

* [fix](ci): change gpt-oss accuracy test in ci

* [feat](ci): add client command for gpt-oss

---------

Co-authored-by: perzhang <perzhang@amd.com>
* add triton fallback for ds & gptoss

* fix format

* add triton mha layout

* remove useless

* update limit

* refactor block table
…cipe (ROCm#670)

* correct the quant type based on recipe for fuse_qknorm_quant

* remove debug print

* use quant type

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* use unified aiter fused_qk_norm interface

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* change the import path for fused_qknorm

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* fix qknorm quant compile path

* document qknorm quant compile guard

* simplify qknorm quant dispatch

* use single rmsnorm quant entrypoint

* update api

* clean the qk rmsnorm code

* fix corner case

---------

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Co-authored-by: Guanbao Yu <gyu@amd.com>
Co-authored-by: zejunchen-zejun <zejun.chen@amd.com>
Co-authored-by: wuhuikx <hattie.wu@amd.com>
Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com>
* [DSV4]moe flydsl

* update

* update

* refactor moe shuffle

* update

* update

* modify atom env

* update env

---------

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
)

Replace the chunk + double torch.clamp + F.silu * up sequence in
Expert.forward with a single aiter.silu_and_mul(out, combined, limit)
call. The new limit parameter folds the swiglu_limit clamp (gate <=
limit, up in [-limit, limit]) into the kernel via the v_med3_f32
intrinsic, removing several launch-bound ops on the per-token critical
path.

Requires aiter PR ROCm/aiter#3104 (already merged), which adds the
limit parameter and HAS_LIMIT compile-time specialization to
silu_and_mul.

Verified on DeepSeek-V4-Pro tp=8 --level 0:
  GSM8K nshot=5 (AITER_BF16_FP8_MOE_BOUND=0 + ATOM_MOE_GU_ITLV=1):
    run 1: 0.9515 / 0.9522 (flexible / strict)
    run 2: 0.9522 / 0.9530
  Matches V4-Pro baseline (0.9522 / 0.9530), within 1 sigma stderr.
Co-authored-by: Guanbao Yu <gyu@amd.com>
* temp reinstall triton==3.6.0

* reinstall triton for vLLM-ATOM

* update comments for better tracking

---------

Co-authored-by: Guanbao Yu <gyu@amd.com>
* add _aiter_fused_routing_from_topk switch

* ruff fix

* change import

* black format
* remove chunk split

Signed-off-by: ganyi <ygan@amd.com>

* fix accuracy

Signed-off-by: ganyi <ygan@amd.com>

* add None input for not quantize case

Signed-off-by: ganyi <ygan@amd.com>

* fix ci

Signed-off-by: ganyi <ygan@amd.com>

* maybe fix the ci benchmark crash

Signed-off-by: ganyi <ygan@amd.com>

* add env var for bare atom acc test

Signed-off-by: ganyi <ygan@amd.com>

* add fp8 kv cache for qwen3next bare atom ci

Signed-off-by: ganyi <ygan@amd.com>

* make fusion as default

Signed-off-by: ganyi <ygan@amd.com>

* disable non-persistent pa

Signed-off-by: ganyi <ygan@amd.com>

* add num tokens

Signed-off-by: ganyi <ygan@amd.com>

---------

Signed-off-by: ganyi <ygan@amd.com>
ROCm#733)

Three related changes that together cut input-quant overhead on V4-Pro
attention:

1. atom/model_ops/layernorm.py — single dispatch covering all dynamic
   quant types
   - Replace the mxfp4-only `mxfp4_rms_quant_fuse` (triton wrapper) with
     a unified `_aiter_rms_quant` helper that dispatches on quant_type
     to aiter.{rmsnorm_quant, add_rmsnorm_quant} (HIP).
   - Now supports per_1x32 (MXFP4 fp4x2 + UE8M0 scale + preshuffle),
     per_1x128 (FP8 block + transpose-aware scale), and per_Token
     (FP8 dynamic per-token).
   - `_aiter_transpose_scale` resolved once at __init__ instead of
     per-forward env lookup.

2. atom/models/deepseek_v4.py — fuse q_norm with wq_b input quant
   - q_norm now constructed with fused_quant=True, quant_config=qc;
     emits (qr_fp8, qr_scale) in one launch.
   - Outer wq_b (ColumnParallel) and Indexer.wq_b (Replicated) both
     consume the pre-quantized pair via `x_scale=` — saves two
     redundant per_1x128 input quants per layer.
   - Drop `otype=torch.float32` on wkv_gate (Compressor): kernel's
     internal fp32 accumulator already handles the upcast; bf16
     intermediate halves the (kv, score) buffer bandwidth.

3. .github/workflows/atom-benchmark.yaml — fix env_vars expression
   injection
   - Multi-line `env_vars` JSON values (e.g.
     "AITER_BF16_FP8_MOE_BOUND=0\nATOM_MOE_GU_ITLV=1") were inlined
     directly into a `for ev in ${{ ... }}` loop, breaking host bash
     with `syntax error near unexpected token`.
   - Switch both Start CI container blocks to the env-block + docker
     --env-file pattern already used by atom-test.yaml.
   - Drop the redundant inline `${{ matrix.cell.env_vars }}` prefix on
     the regression launch step (vars are now in the container env via
     --env-file) — also fixes a silent bug that dropped the first var.

Verified on DeepSeek-V4-Pro tp=8 --level 0:
  GSM8K nshot=5 (AITER_BF16_FP8_MOE_BOUND=0 + ATOM_MOE_GU_ITLV=1):
    baseline                : 0.9522 / 0.9530
    + rmsnorm_quant refactor: 0.9507 / 0.9515
    + q_norm fuse           : 0.9538 / 0.9538
    + drop wkv_gate fp32    : 0.9583 / 0.9591  ← this PR
  All within 1 sigma stderr (~0.0058) of baseline; no regression.

  1024/1024 c=64 microbench (q_norm fuse vs prior commit):
    Total token throughput : 2735.80 → 2813.00 tok/s  (+2.8%)
    Median TPOT            : 41.28 → 38.83 ms          (-5.9%)
…fault cudagraph capture to 512 (ROCm#737)

ParallelHead refactor:
- Delegate vocab-axis sharding, weight_loader, last-token slicing,
  bf16 a16w16 GEMM, and TP all-gather to ParallelLMHead.
- Remove ~30 lines of duplicated logic.
- Drop fp32 weight (CDNA3/CDNA4 bf16 MFMA accumulates in fp32 natively;
  no precision change). Saves ~1.85 GB cluster VRAM.
- Hoist Expert.forward's local silu_and_mul import to top-level.

CUDAGraph capture default:
- Add 512 to default --cudagraph-capture-sizes list. Required to capture
  the full graph at max_num_seqs=512; previously the cap was 256 and
  size-512 graphs were filtered out by model_runner.

Validated: GSM8K 1319q 0.9591 +/- 0.0058 (statistically equivalent to
baseline ~0.952). c=128 1k/1k throughput unchanged at ~4855 tok/s.
c=512 1k/1k throughput +50.7% (7427 -> 11195 tok/s) when paired with
--max-num-seqs 512.
* [Kimi] support Eagle3 speculative decoding for Kimi K2.5

Adds Eagle3 spec decode for Kimi K2.5 (MLA target + standard MHA draft):

- Eagle3LlamaModel: 1-layer Llama draft (dual-norm input, wide QKV,
  independent embed/lm_head) matching the lightseekorg/kimi-k2.5-eagle3
  checkpoint
- Eagle3DraftBuilder: implements the post-ROCm#659 builder protocol
  (compute_block_bytes / allocate_kv_cache_tensors / build_kv_cache_tensor)
  for the draft's independent non-MLA KV cache, attached to the runner from
  EagleProposer.__init__ via runner.eagle3_draft_builder. ModelRunner
  delegates KV pool sizing, allocation, and per-module binding through this
  hook with no eagle3-specific code in the runner KV path
- Aux hidden state pipeline: target forward returns
  (hidden, aux_hidden_states), captured through CUDAGraph via
  graph_aux_hidden and fed to the draft's combine_hidden_states (fc) as
  input
- SpeculativeConfig: --method eagle3 + --draft-model CLI; eagle3 vs MTP
  branching at construction time; fail-fast if draft is MLA
- Scheduler: spec_stats only updated when speculation actually ran
  (matches vLLM's gating)
- propose: draft-perspective predicate `draft_uses_mha = hasattr(runner,
  "eagle3_draft_builder")` drives both the metadata-flow special-cases
  (slot_mapping re-slice, context_lens += 1, tuple-unpack of the draft
  return value); is_eagle3 string comparison is gone from the hot path

Result on Kimi-K2.5-MXFP4 + kimi-k2.5-eagle3, 8x MI355X, gsm8k 5-shot:
acceptance 67.85%, accuracy 93.78%.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: add nightly Eagle3 spec-decode accuracy test for Kimi-K2.5

Reuses the base Kimi-K2.5-MXFP4 model + lightseekorg/kimi-k2.5-eagle3
draft, runs at TP=8 (Eagle3 draft KV needs full 8-rank sharding) under
nightly schedule. Local case_verify_v9_gluon measured GSM8K 5-shot
flexible-extract = 0.9257 (vLLM = 0.9280); threshold set to 0.91 with
~1.5pp noise headroom.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
* support three stream in Deepseek V4

* modify hca dual stream if no indexer

* clean code logic
* Update the vLLM-ATOM benchmark scope

* Update benchmark model tags and weekday scheduling.

Normalize model tagging to MET/OOB/AW prefixes and labels, and replace nightly A/B/C date rotation with weekday-based grouping (Mon/Wed MET, Tue/Thu AW, Fri ALL, weekends skipped) for clearer benchmark cadence control.

* Align AW Minimax serve args and warmup behavior.

Add --kv-cache-dtype fp8 for MiniMax-M2.5 AW TP2/4/8 so startup flags match expected cache settings, and always pass --num-warmups=$((2 * CONC)) for vllm bench serve runs to keep warmup load consistent.

* Normalize AW metadata and gpt-oss memory env settings.

Add nightly_group=B for all AW model entries, align all gpt-oss variants to OOT_GPU_MEMORY_UTILIZATION=0.5, and restrict scheduled benchmark cron to weekdays to avoid weekend empty runs.
…c quant recipe (ROCm#747)

* [fix][acc] fix accuracy of fp8 attn weights model using ptpc quant recipe

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>
* [atom-vllm benchmark] refine atom-vllm benchmark UI

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* add

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* add

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* add

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* add

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* add

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* add

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* add

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* add

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* add

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* add

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* Restore main's (MET)/(OOB)/(AW) naming, weekday nightly rotation, and AW benchmark logic

- Restore (MET)/(OOB)/(AW) suffix in all model display names and prefixes
- Remove feature-branch-only TP variants, keep exactly main's 27 models
- Add all 15 AW model variants from main with bench_args "--random-range-ratio 1"
- Restore weekday-based nightly rotation: Mon/Wed A-MET, Tue/Thu B-AW, Fri C-ALL
- Restore is_aw_model logic in build-benchmark-matrix (AW ISL/OSL pairs + extra concurrency)
- Restore per-model BENCH_CLIENT override: AW models auto-use vLLM bench client

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

* Fix dashboard upload defaults and AW model data accuracy

- Enable both dashboard uploads (gh-pages + LLM-Booster) for nightly schedule runs
- Set publish_to_dashboard default to true for manual runs (matching main)
- Fix AW model benchmark_client in LLM-Booster data: read per-result client
  from enriched JSON instead of using global env (AW models override to vLLM bench)
- Fix AW model random_range_ratio: set to "1" in matrix params so filename,
  benchmark command, and dashboard data are all consistent
- Add dashboard_model for DeepSeek-R1 FP8/MXFP4 and gpt-oss-120b MET to avoid
  (MET) suffix leaking into LLM-Booster display names

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

---------

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Co-authored-by: Claude Opus 4 <noreply@anthropic.com>
* Add upload step for runner fleet report

Added a step to upload the runner fleet report as an artifact.

* Add error handling for missing files in artifact upload
* [MoE] align generic MXFP4 shuffle layout

Use the shared separated-layout weight shuffle path for non-interleaved MXFP4 MoE weights and keep the interleaved scale shuffle explicit, matching AITER GateMode layout expectations.

Co-authored-by: Cursor <cursoragent@cursor.com>

* [MoE] mark shuffled MXFP4 weights

* update code

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Guanbao Yu <gyu@amd.com>
* update kimi k2/k2.5 recipe

* update command

---------

Co-authored-by: Guanbao Yu <gyu@amd.com>
* [fix](gpt-oss): fix gpt-oss model accuracy bug

* [fix](server): fix gpu-memory-utilization args

---------

Co-authored-by: perzhang <perzhang@amd.com>
jiayyu and others added 28 commits May 14, 2026 22:16
* fix all tokens hit issue

* refine
* preshuffle indexer cache

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* use persistent mode mla

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* Format sparse MLA plugin files

* Remove --block-size 1 from DeepSeek-V3.2 benchmark and accuracy configs

The sparse MLA plugin now defaults to block_size=64 for preshuffled
indexer cache. The hardcoded --block-size 1 in CI configs would override
this default and prevent the performance gains from taking effect.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

---------

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Co-authored-by: XiaobingZhang <xiaobingzhangupc@gmail.com>
Co-authored-by: zejunchen-zejun <zejun.chen@amd.com>
Co-authored-by: Claude Opus 4 <noreply@anthropic.com>
…OCm#777)

* add qwen3.5 cases for sglang-atom

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

* remove redundant env

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

---------

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>
* fix(qwen3_5): packed_modules_mapping for in_proj_qkvzba

BF16 weight has a fused in_proj_qkvzba; FP8/MXFP4 ship pre-split
in_proj_qkvz and in_proj_ba. Add a helper that rewrites the
packed_modules_mapping when quant_dtype is bfloat16 so the loader
slices the fused tensor along the qkv/z/b/a axes correctly.

Apply at the two atom-native conditional-generation entrypoints; the
vllm plugin path keeps its existing inline fixup.

* ci(accuracy): add Qwen3.5-397B-A17B BF16 nightly entry

Adds the BF16 weight variant alongside the existing FP8 / MXFP4
entries. Use --kv_cache_dtype fp8 to avoid the known MI355 BF16 KV
cache accuracy issue. Threshold/baseline are placeholders copied
from the FP8 entry; refresh after the first CI measurement.

---------

Co-authored-by: JiaoliangYu <jiaolyu@hjbog-srdc-39.amd.com>
* mtp 1 acc right

Signed-off-by: ganyi <ygan@amd.com>

* add recipe for qwen3-next-mtp

Signed-off-by: ganyi <ygan@amd.com>

* modify some qwen3.5 recipe

Signed-off-by: ganyi <ygan@amd.com>

* black

Signed-off-by: ganyi <ygan@amd.com>

* remove redundant code

Signed-off-by: ganyi <ygan@amd.com>

* remove redundant code

Signed-off-by: ganyi <ygan@amd.com>

* add spec decode convert for vllm plugin

Signed-off-by: ganyi <ygan@amd.com>

* remove vllm related branch

Signed-off-by: ganyi <ygan@amd.com>

* use atom spec decode config for plugin loading

Signed-off-by: ganyi <ygan@amd.com>

* remove unnecessary changes in modeling

Signed-off-by: ganyi <ygan@amd.com>

* format

Signed-off-by: ganyi <ygan@amd.com>

* add qwen3next mtp into benchmark

Signed-off-by: ganyi <ygan@amd.com>

* [ci] disable FP8 blockscale weight preshuffle for Qwen3.5/Qwen3-Next

Add ATOM_FP8_BLOCKSCALE_WEIGHT_PRESHUFFLE=0 to all Qwen3.5 and
Qwen3-Next model configs across benchmark, nightly accuracy, and
recipe files.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

* [ci] fix Qwen3-Next MTP benchmark label from MET to AW

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

* [docs] fix Qwen3.5 recipe: update env var count and add preshuffle doc

Remove stale "three" count (now variable list), add
ATOM_FP8_BLOCKSCALE_WEIGHT_PRESHUFFLE=0 to both the Important section
and Key Environment Variables section.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

---------

Signed-off-by: ganyi <ygan@amd.com>
Co-authored-by: zejunchen-zejun <zejun.chen@amd.com>
Co-authored-by: Claude Opus 4 <noreply@anthropic.com>
…#785)

* ci: branch-aware docker release + fix benchmark model selection

docker-release: use github.ref_name for branch-aware builds so non-default
branches produce tagged images (nightly_YYYYMMDDHHMM-{branch}) without
overwriting the latest tag.

atom-benchmark: fix model selection bug where MiniMax and GLM-5.1 models
were always included regardless of checkbox state, caused by input name
mismatches (MiniMax-M2.5 vs M2.7 prefix, missing glm-5.1-fp4 input).

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

* support tag by self

---------

Co-authored-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: Claude Opus 4 <noreply@anthropic.com>
* adapt  mtp for glm5 (vllm plugin)

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

* add patch to support mtp>1

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

* fix model load failure of draft model

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

* adapt full graph with mtp enabled

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

* fix MLA MTP acceptance issue

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

* fall back to vllm-style mtp position

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

* fix embedding sharing failure for mtp

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

* fix lint

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

* fix comment

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

* remove warnig log

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

* add mtp support for glm4

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* fix rope double apply for mha

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* guard vllm forward context retrieval

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* add glm4.7 mtp to workflow

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* clean up

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* format: fix black formatting in glm4_moe_mtp.py

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

---------

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae-amd <kuanfliu@amd.com>
Co-authored-by: whx-sjtu <xiaowang990929@gmail.com>
Co-authored-by: kuanfliu@amd.com <kuanfliu@amd.com@mia1-vm-amd-prj3-k8s-005.amd3.mia.tensorwave.lan>
Co-authored-by: zejunchen-zejun <zejun.chen@amd.com>
Co-authored-by: Claude Opus 4 <noreply@anthropic.com>
…ROCm#795)

* Add inferencex-pr skill for ATOM benchmark comparison and PR creation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Replace inferencex-pr skill with inferencex-sync

Renames the skill to better reflect its purpose: comparing ATOM upstream
benchmark results against InferenceX and reporting regression/improvement.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Map vLLM cache dtype aliases before passing them into AITER metadata setup so the default auto value does not fail dtype parsing.

Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
* [perf](deepseek): add fused indexer path

Co-authored-by: Cursor <cursoragent@cursor.com>

* [fix](deepseek): clear fused indexer rope args

Co-authored-by: Cursor <cursoragent@cursor.com>

* [fix](deepseek): simplify fused indexer path

Co-authored-by: Cursor <cursoragent@cursor.com>

* [fix](deepseek): drop unrelated indexer weights change

Co-authored-by: Cursor <cursoragent@cursor.com>

* [fix](deepseek): gate fused indexer path

Co-authored-by: Cursor <cursoragent@cursor.com>

* [fix](deepseek): guard indexer wk fusion by dtype

Co-authored-by: Cursor <cursoragent@cursor.com>

* [fix](deepseek): remove unused sparse import

Co-authored-by: Cursor <cursoragent@cursor.com>

* [fix](deepseek): address fused indexer review feedback

Keep indexer weight fusion decisions consistent with quant fallback paths and make dummy/profile behavior deterministic.

Co-authored-by: Cursor <cursoragent@cursor.com>

* [fix](deepseek): align indexer review fixes with black

Keep the PR compatible with the repository's Black formatting check after the review fixes.

Co-authored-by: Cursor <cursoragent@cursor.com>

* [fix](deepseek): simplify indexer fusion guard

Collapse the review helper stack while keeping a single model-level fusion decision for weight loading.

Co-authored-by: Cursor <cursoragent@cursor.com>

* [fix](deepseek): move pending fp8 wk before dequant

Ensure threaded checkpoint loading dequantizes pending FP8 indexer wk weights on the same device as their scales.

Co-authored-by: Cursor <cursoragent@cursor.com>

* [fix](deepseek): tighten indexer fusion fallback

Keep the env-disabled path aligned with the unfused baseline and fail clearly when fused wk weights are incomplete.

Co-authored-by: Cursor <cursoragent@cursor.com>

* [fix](deepseek): decouple indexer wk fusion guard

Keep wk/weights projection fusion capability independent from the QK RoPE cache fusion runtime switch.

Co-authored-by: Cursor <cursoragent@cursor.com>

* [fix](deepseek): restrict fused indexer kernel configs

Avoid enabling the fused indexer kernels for GLM configs with incompatible RoPE cache shapes while keeping compatible sparse indexer configs eligible.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
* modify logic with running p1 daily and p1/p2 on weekends

* Monday-Thuesday: P0; Fridy: P1&P2; Weekend: C-ALL

* modify code

* modify the running time logic

* Add MTP cases and run the case according to P0/P1/P2

* Merge 'cron'

* Change the scheduling time

---------

Co-authored-by: root <root@hjbog-srdc-19.amd.com>
* Add benchmark checkout ref inputs

Allow manual benchmark runs to target a specific ATOM branch or commit while preserving the default scheduled behavior.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Simplify benchmark ref input

Keep only the commit override input for manual benchmark runs and place it after the regular benchmark parameters.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: root <root@SHAHYI01.amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…X PR creation (ROCm#802)

* Add inferencex-pr skill for ATOM benchmark comparison and PR creation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Replace inferencex-pr skill with inferencex-sync

Renames the skill to better reflect its purpose: comparing ATOM upstream
benchmark results against InferenceX and reporting regression/improvement.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* pr flow update

Signed-off-by: seungrokj <seungrok.jung@amd.com>

* (skill)[inferencex-sync]: Clarify all-jobs-passed condition for latest run

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Signed-off-by: seungrokj <seungrok.jung@amd.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* [fix](deepseek): handle padded sparse indexer inputs

Keep fused indexer cache writes limited to actual vLLM plugin tokens and pass RoPE caches in the 2D shape expected by the AITER indexer kernel.

Co-authored-by: Cursor <cursoragent@cursor.com>

* [fix](deepseek): pass unsliced sparse indexer inputs

Rely on AITER's mapped-token bounds for padded graph inputs so the sparse indexer plugin can pass the original tensors through both fused and fallback cache update paths.

* [fix](deepseek): keep sparse indexer plugin unchanged

Leave the plugin sparse indexer path unchanged so this branch only carries the DeepSeek-V3.2 RoPE cache layout fix.

---------

Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: perzhang <perzhang@amd.com>
* [ATOM SGL] Qwen3.5 mha support

* precheckin

---------

Co-authored-by: wuhuikx <hattie.wu@amd.com>
…MTP-1 CI (ROCm#817)

* feat(deepseek_v4_mtp): wrapper-owned MTP with share_with_target hook

Redesign V4 MTP to match V2/V3/Qwen convention:
- Target (DeepseekV4ForCausalLM) carries no MTP modules / weights
- Wrapper (DeepseekV4MTP) self-contains MTP blocks + loads mtp.* via
  the standard load_model path with spec_decode=True
- EagleProposer.share_with_target() rebinds embed/head from the
  target's already-loaded instances; no double KV / double weight load

Loader: replace per-call `not spec_decode` mtp short-circuit with
`need_load_mtp = spec_decode and any("mtp" in n for n in params_dict)`
so target loads and Eagle3 drafts (no `mtp.*` params) skip cleanly.

EagleProposer: add `share_with_target` escape valve for models whose
embed/head naming diverges from the standard `embed_tokens`/`lm_head`
(V4 uses `model.embed`/`model.head`).

ModelRunner: thread `extra_output_dims` so V4's `[N, hc_mult, dim]`
residual-stack output (hc_head + LM head deferred to compute_logits)
flows correctly into the outputs buffer.

SpeculativeConfig: register `deepseek_v4` -> `deepseek_v4_mtp` mapping
and stamp `model_type=deepseek_v4_mtp` onto the draft hf_config; the
runner's is_deepseek_v4() now accepts both so V4-specific paths fire
on the draft too.

* fix(deepseek_v4): SWA+MTP cache aliasing, num_rejected rollback, EOS truncation

- SWA ring buffer: cache_size = window_size + max_spec_steps to prevent
  draft-vs-verified slot aliasing when k>=1 spec tokens share a row's window.
- Per-row window_topk indexing collapses case_a/b/c into a single formula:
  (pos - W + 1 + w) % cache_size with abs<0 / abs>pos masked to -1.
- prepare_decode rolls back context_lens by num_rejected (mirrors aiter_mla),
  fixing stale ctx after MTP rejections.
- scheduler.postprocess truncates bonus tokens emitted after EOS / stop_token /
  stop_sequence (rejection_sampler greedy kernel does not check EOS), removing
  trailing <BOS> in MTP outputs.
- Cleanup: drop dead _build_window_topk_batched, drop start_pos_per_token
  param, rename kernel param win -> cache_size for clarity.

Validated: GSM8K 0.9545 (flex/strict), MTP-1 acceptance 86.27%.

* wip

* feat(deepseek_v4): MTP-K>1 support via prepare_mtp_decode + backend-agnostic eagle mid-step

- eagle.propose mid-step: gate flat kv_indices/kv_indptr reads on `has_flat_kv`
  (hoisted out of the loop) so V4 path skips them; MLA/MHA unchanged
- DeepseekV4AttentionMetadataBuilder.prepare_mtp_decode: per-draft-step V4
  region metadata rebuild for 1-token-per-seq shape, zero D2H (mirrors
  eagle's GPU context_lens += 1 on CPU side instead of reading back).
  Entry short-circuits when max_per_req_cache_slots is unset (warmup case;
  attr is set by ModelRunner.get_num_blocks which runs after warmup_model).
- CommonAttentionBuilder.__init__: centralize num_attention_heads (was
  duplicated in V4, MLA, MHA builders).
- start_simple_inference.sh: drop env-var translation for engine flags,
  all engine config now passes through EXTRA_ARGS as native CLI args.
- stop_atom_server.sh: also SIGTERM lm_eval so the next launch starts clean.

Validated: V4 simple_inference (MTP1 + MTP3), V4 GSM8K full-1319
flexible-extract MTP3=0.9568 / MTP1=0.9545 (within stderr, no regression),
acceptance distribution {0: 6.5%, 1: 24.1%, 2: 39.7%, 3: 29.7%} for MTP3,
avg toks/fwd 2.92 vs 1.87 for MTP1 (+56%).

* fix(deepseek_v4): wire write_v4_paged_decode_indices kernel + tooling

- _attach_v4_paged_decode_meta: replace 6 transient tensors +
  2 x index_copy_ chain with the existing write_v4_paged_decode_indices
  Triton kernel that reads only persistent forward_vars buffers.
  Eliminates allocator-churn race surfaced as ASSERT_TRAP in
  index_copy_kernel_impl<OpaqueType<4>> under MTP-3 long prefill.
- _stage: harden silent astype auto-cast into an assertion so dtype
  drift fails fast at the call site instead of corrupting downstream
  triton reads.
- scripts: move run_benchmark{,_sweep}.sh to atom/scripts/ and add
  wait_infer_drain.sh (hang vs drain monitor used during the V4 MTP
  investigation).
- skills: convert flat .claude/skills/*.md into the SKILL.md/<name>
  subdirectory layout that the Skill tool actually loads; add new
  capture-trace skill (English, correct script paths, structured
  to match peer skills).

* refactor(deepseek_v4): rename seq_lens→context_lens, require plan_buffers, drop dead device arg

- `_build_compress_plans` + `make_compress_plans`: rename `seq_lens_np` /
  `seq_lens_cpu` → `context_lens_np` / `context_lens_cpu`. Every caller
  already passes `context_lens`; the old name suggested "raw seq length"
  while the actual semantics is "absolute seq_len AFTER the new extend
  tokens" (= prefix + extend). Internal `prefix = context_lens - extend_lens`
  reconstruction unchanged.
- `make_compress_plans`: `plan_buffers` is now required (was `Optional[dict]`).
  The legacy fresh-`torch.from_numpy(...).to(device)` fallback is removed,
  along with the now-unused `device` positional arg. That fallback reproduces
  the exact allocator-churn race we fixed in
  `write_v4_paged_decode_indices` (transient tensors handed to in-flight
  kernels). Forcing every caller through the pre-allocated CpuGpuBuffer
  pool keeps data pointers stable and prevents the race from being
  re-introduced.
- `_build_compress_plans`: keep the `assert isinstance(..., np.ndarray)`
  guards added in the same series so callers can't silently pass torch
  tensors and trigger a hidden D2H sync.

Verified on V4-Pro no-MTP GSM8K (TP=8, level=0, CG on):
  HEAD c454322f: 0.95 flexible-extract
  this commit:  0.9560 flexible-extract  (within 1σ; 1319/1319)

* docs(skill): debug-agent-locate-kernel — `--enforce-eager`/`--level 0` are optional fallbacks

Reword pre-flight item 2 + the matching anti-pattern + the example launch
command. Previous text claimed both flags were required ("Always pass both").
That's wrong: the debug agent runs fine under hipgraph in most cases.
The flags only become useful when the symptom points at graph mode or
Inductor:
  - `--enforce-eager` for faults that don't reproduce in eager / capture-
    replay crashes under the agent's no-caching-allocator behavior;
  - `--level 0` for AMD Inductor `cluster_dims` autotune crashes at warmup.

Adding them blindly hides graph-mode-only bugs (current branch: V4-Pro
no-MTP server-mode prefill hangs ONLY under `--enforce-eager`; CG path
works fine — we'd never have spotted that with the old "always pass both"
rule).

CLAUDE.md `run_debug_agent.sh` row updated in lockstep.

* fix(deepseek_v4): gate alt_stream async-compress on CUDAGraph capture

eager mode triggered a hipStream deadlock on V4-Pro: 60 layers of
main Compressor launches accumulate on alt_stream, and the first
splitk GEMM workspace allocation (the third bf16gemm_*_splitk_clean
kernel load) hits a caching-allocator race with the MoE shared_experts
GEMM also targeting alt_stream. Verified hang on both small (~800
token) and large (>2k token) prefill batches in eager.

Inside a CUDAGraph capture block this same pattern is safe: the graph
records the fork-join edges and replay re-uses the same stream layout
without per-launch allocator contention. So the fix is to gate
`use_async_compress` on a new `ForwardContext.in_hipgraph` flag set
only by `model_runner.capture_model`. Replay does not re-execute
Python forward, so the flag is irrelevant there.

While here:
  - Cache `torch.cuda.current_stream()` once per fwd in
    `ForwardContext.main_stream` and have V4 attention / MoE / launcher
    read it instead of querying repeatedly (eliminated per-call
    handle allocation; metadata builder still queries directly because
    it runs before `set_forward_context`).
  - Module-level cache `_CUDA_AVAILABLE` so the cuda-availability
    check doesn't fire per `set_forward_context`.
  - Fix `wait_infer_drain.sh` drain detection: once the eval client
    is gone and a single poll shows no new output, declare drain
    immediately instead of waiting STUCK_POLLS polls (the old logic
    added ~120 s of false-positive wait to every run).

Verified end-to-end:
  - V4-Pro eager + full GSM8K (1319 reqs): 420s, 0.9553/0.9560 (was
    deadlocking; now matches CG baseline)
  - V4-Pro CG mode + full GSM8K: 240s, 0.9492/0.9500
  - V4-Pro CG + MTP1 + full GSM8K: 180s, 0.9575/0.9583
  - V4-Pro CG + MTP3 still crashes on large prefill batch (pre-existing
    MEMORY_VIOLATION unrelated to these changes)

* fix(deepseek_v4): race-free swa_write_indices + unify is_pure_decode source

Under concurrent serving the swa_write kernel hit MEMORY_VIOLATION on
`tl.load(kv_ptr + src_id * head_dim + d_offsets)` because the shared
pinned `v4_meta_swa_write_indices` buffer was rewritten by the next
fwd's CPU side BEFORE the previous fwd's async H2D DMA actually fired,
producing torn `src_id` values that exceed `kv.shape[0]`. Captured
under rocm-debug-agent: all faulting waves at PC +1144
(`s_waitcnt vmcnt(0)`) in `_swa_write_kernel`, confirming the prior
async kv load as the OOB source. The fix writes into the pre-allocated
shared GPU buffer with a static GPU iota source for the pure-decode
arange path and a fresh local numpy for prefill — never touches the
pinned `.np` alias whose next-fwd rewrite would tear the in-flight DMA.

Co-changes:
- Hoist `is_pure_decode` to AttentionMetaData_DSV4 construction time
  (single source of truth: prepare_decode / prepare_prefill /
  build_for_cudagraph_capture each declare their own semantics).
  `_attach_v4_per_fwd_meta` and `_attach_v4_paged_decode_meta` now
  read instead of recomputing.
- Replace the old `(token_num_per_seq == 1).all()` arange-shortcut
  proxy with `is_pure_decode` (MTP verifier batches no longer take
  the slow concat path).
- Tighten swa-write grid: prefill (eager) uses `num_write` exactly,
  decode/MTP CG uses the existing `padded_bs * (1+max_spec_steps)`
  bucket. Long-prefill chunks no longer launch up to ~64x sentinel-
  bail programs.
- Tighten `v4_meta_swa_write_indices` from `mnbt` to `max_bs * win`
  (~16x smaller; universal worst case across paths).
- Pre-allocate static GPU iota `self._swa_iota` as the H2D-free source
  for the is_pure_decode arange.
- scripts/wait_infer_drain.sh: add offline simple_inference mode
  (process-exit + no-fault = drain) and detect "Memory access fault
  by GPU" alongside MEMORY_VIOLATION / ASSERT_TRAP.

Verified: eager mode gsm8k 5-shot conc=65, 0 MEMORY_VIOLATION,
exact_match flexible 0.9553 / strict 0.9545 (baseline ~0.95+).

* ci(deepseek_v4): add MTP-1 per-PR accuracy + benchmark entries

* fix(deepseek_v4): ruff lint + restore prep_stream sync + cache_size in fused swa_write

- Drop unused `device` / `lru_cache` / first `prep_stream` assignment
- Restore `prep_stream.wait_stream` / `with torch.cuda.stream(prep_stream)`
  around prepare_prefill H2D staging (suspected root for CG+MTP-3 tail
  deadlock at recv_mtp_status_async)
- fused_qk_norm_rope_swa_write: pass `cache_size` instead of undefined
  `win` so the ring stride matches `swa_kv.shape[1]` under MTP

* feat(benchmark): custom message encoder fallback + wait_infer_drain client-log support

- benchmark_serving.py: route `--use-chat-template` through
  atom.entrypoints.openai.chat_encoders so models without a Jinja
  chat_template in tokenizer_config.json (e.g. DeepSeek-V4-Pro) render
  via the model-shipped custom encoder
- wait_infer_drain.sh: treat LOG_FILE mtime growth as a progress signal
  alongside the engine "output send" marker, so passing a client log
  (e.g. benchmark.log with tqdm output) no longer triggers a false-
  positive HANG

* refactor(deepseek_v4): inline window_topk + ragged-packed decode kv_indices

Two coupled simplifications to the V4 paged-decode index build (decode/MTP
path); GSM8K 3-shot 0.949 and 1k/1k c=64 MTP1 4247 tok/s match pre-refactor.

1) Drop the [mnbt, win] `v4_meta_window_topk` CG buffer (~4 MB) and the
   CPU `_build_window_topk_np` function. `write_v4_paged_decode_indices`
   now derives `n = min(positions[t]+1, win)` and `ring = (pos - n + 1
   + i) % cs` inline from `var["positions"].gpu` — no intermediate.

2) Decode `kv_indices_swa/csa/hca` now use ragged-packed per-token slot
   counts (`actual_swa_count[t] + n_compress[t]`), matching the prefill
   layout instead of the prior uniform `win + n_compress` with -1
   sentinel padding. `skip_prefix_len_csa[t]` is now `actual_swa_count[t]`
   in both paths (was hard-coded `win` in decode).

* refactor(deepseek_v4): metadata int32 sweep + drain log auto-discovery + run-atom-workload skill

deepseek_v4_attn.py:
- Drop all unnecessary int64 in CPU metadata paths (cu_seqlens, indptr
  cumsums, segment indices, n_committed_{csa,hca}_per_seq). Keeps int64
  only where GPU fancy-index ABI requires it (batch_id_per_token,
  swa_write_indices). Saves widen/narrow churn; no perf change but
  cleaner mental model.
- Hoist n_committed_{csa,hca}_per_seq_cpu as dataclass single source of
  truth, replacing 3 independent ctx//k recomputes in paged_decode_meta /
  paged_prefill_meta / v4_indexer_meta.
- Drop dead `start_pos_per_seq_cpu` dataclass field — only one real
  consumer (_build_paged_prefill_meta); inline as local var, drop 3
  dead setters (prepare_decode, CG capture, prepare_mtp_decode).
- Drop dead `positions_np` + `start_pos_per_seq_cpu` parameters from
  _attach_v4_per_fwd_meta; update 5 call sites.
- Drop `cu_seqlens_q_np` + `start_pos_per_seq` params from
  _attach_v4_indexer_meta (read from attn_metadata.n_committed_csa_per_seq_cpu).
- Drop dead helper `_clear_v4_paged_decode_meta` (defaults already None).
- Drop dead CpuGpuBuffers: v4_meta_start_pos_per_seq + v4_meta_token_num_per_seq
  (allocated but never staged).
- Drop redundant `if True:` wrapper + 3 redundant `import numpy as np`
  inside methods (module-level import suffices).
- net -178 lines, ruff clean.

scripts/wait_infer_drain.sh (v1.2):
- Auto-discover server log via readlink /proc/<pid>/fd/1 of the
  atom.entrypoints process. Eliminates the recurring bug where callers
  passed the wrong LOG_FILE (e.g. lm_eval's silent gsm8k_eval.log)
  causing false HANG verdicts.
- Caller-supplied LOG_FILE is now optional (was: defaulted to hardcoded
  /app/logs_claude/atom_server.log). Used only as supplementary signal
  for fault grep (dual-log scan) and tqdm-style mtime detection.
- Path-portable across repo layouts; no hardcoded log paths.

.claude/skills/run-atom-workload/SKILL.md (new):
- Codifies the canonical 4-step ATOM workload flow
  (stop → start → workload-in-shell-bg → wait_infer_drain → stop)
  for accuracy eval (GSM8K), benchmark, sweep, offline simple_inference,
  and fault repro under rocm-debug-agent.
- Pins model-family env vars (V4-Pro: AITER_BF16_FP8_MOE_BOUND=0
  ATOM_MOE_GU_ITLV=1; Kimi: HSA_NO_SCRATCH_RECLAIM=1; etc.).
- Hard rules block past failure modes: no wrapper scripts in
  /app/logs_claude/, no && chaining, no double-backgrounding the server,
  always step 5 teardown.
- Uses project-relative scripts/ paths (repo-portable).

Verification:
- MTP-1 GSM8K 3-shot = 0.9492 +/- 0.006 (baseline 0.9545 +/- 0.0057,
  within 1 sigma, no regression).
- MTP-1 benchmark 1024/1024 c=64 = 4748 Total tok/s, 24.3 ms TPOT
  (vs prior baseline 4666 tok/s, +1.8% / no regression).
- Drain auto-discovery smoke-tested on live MTP-1 server: discovered
  /app/logs_claude/atom_server.log via /proc, detected DRAINED cleanly.
- MTP-3 + CG still hangs on high-concurrency GSM8K (known issue per
  feedback_v4_cg_mtp_status_deadlock.md); this commit does not address it.
* MTP(num_step=1) for DeeepSeek

* Add work log for claude debug

* adopt new attn constructor args

* rm worklog

* use atom_parameter

* kwargs handle

* rebase main

* precheckin

* fix k_scale v_scale error

* new commit

* fix blank

* fix qwen3.5 acc

---------

Co-authored-by: ZhiweiYan-96 <ZhiweiYan@amd.com>
Co-authored-by: zhuyuhua-v <yuhzhu@amd.com>
Move the SGLang DeepSeek MLA runtime entry from legacy forward glue into
SGLangDeepseekMLAAttention while keeping RadixAttention and the full-attention
backend as the host/backend layers. Shrink deepseek_mla_forward.py into a
helper module and clarify absorbed vs non-absorbed path naming.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.