Skip to content

[Feat] Intranode Dispatch&Combine Kernel #522

Draft
yanboshao wants to merge 1 commit into
mainfrom
yanbo/dispatch_combine
Draft

[Feat] Intranode Dispatch&Combine Kernel #522
yanboshao wants to merge 1 commit into
mainfrom
yanbo/dispatch_combine

Conversation

@yanboshao
Copy link
Copy Markdown
Contributor

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

@yanboshao yanboshao marked this pull request as draft May 14, 2026 07:19
@yanboshao yanboshao changed the title feat(dispatch_combine): intranode dispatch/combine kernel [Feat]: intranode dispatch/combine kernel May 14, 2026
@yanboshao yanboshao changed the title [Feat]: intranode dispatch/combine kernel [Feat] Intranode Dispatch&Combine Kernel May 14, 2026
…lpers

Introduce the FlyDSL-implemented intranode EP dispatch / combine kernel
plus the FlyDSL Python-layer extensions it depends on, rebased on
origin/main.

Kernel (kernels/):
* dispatch_combine_intranode_kernel.py
  4-phase dispatch + 3+1-stage combine (P2P scatter / CrossDeviceBarrier
  / WarpAccum / weight accum). Supports bf16/fp8 transport, optional
  fp8-direct-cast Stage-3 bf16<->fp8, StdMoE atomic path, weight scatter
  pinned to the combine-kernel fabric, and the skip_stage1 fast-path
  for upstream fused GEMM2 epilogues (weight scatter implicitly enabled
  iff enable_weights=True; no separate keep_wts flag).
* dispatch_combine_intranode_op.py
  Launcher / JIT cache, including the combine_no_stage1 variant.
* tests/kernels/test_profiler_dispatch_combine.py
  Verify + bench harness (--mode verify/profile/bench, --chip gfx942/950).

FlyDSL helper extensions (python/flydsl/):
* expr/arith.py: divui, remui, zext_i64, select_by_index.
* expr/vector.py: bitcast_i32_to_v2bf16, bitcast_v2bf16_to_i32.
* expr/rocdl/__init__.py: ballot_i64, readlane.
* compiler/ast_rewriter.py: stop treating method / attribute calls
  (e.g. ptr.store(...)) as variable assignments when collecting the
  scf.for / scf.if carried-set.

Style: rewritten on FlyDSL high-level surface (Python operators,
scf.if-from-Python-if, FlyDSL Vector API for accumulators).  Low-level
mlir scaffolding only kept where strictly required (ROCDL intrinsics,
buffer_ops, raw-LLVM atomic / pointer helpers).

Bug fixes folded in:
* dispatch StdMoE Phase 4: use unsigned comparison for is_local.

Verify (8 GPU, gfx942, bf16):
* default                bs=256/h=7168/k=8 -> ALL PASS
* --enable-std-moe       bs=512/h=7168/k=8 -> ALL PASS
@yanboshao yanboshao force-pushed the yanbo/dispatch_combine branch from 7f53c40 to 1a8596c Compare May 14, 2026 07:53
Comment thread python/flydsl/compiler/ast_rewriter.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants