[Feat] Align quant and fused rmsnorm kernels with aiter/triton#481
[Feat] Align quant and fused rmsnorm kernels with aiter/triton#481cschenjunlin wants to merge 13 commits into
Conversation
| y = (added * rrms) * g | ||
| _store_vec(_to_elem_vec(y), out_div, idx) | ||
|
|
||
| else: |
There was a problem hiding this comment.
no need to copy all the codes in else?
|
@cschenjunlin any update? |
I have pushed new commits to solve the duplication issue mentioned above. Some common functions are added, to reduce code duplication within the variant kernels. However, the scope of refactoring here is limited, as the variants have introduced some new data flow logic. If further reduction of code duplication is needed, I can attempt to converge the |
|
@coderfeli Sorry for the delay, please check this PR again. Thanks! |
| return allocator, red_offset, red2_offset | ||
|
|
||
|
|
||
| def _load_scalar(copy_atom, scalar_reg_ty, scalar_reg_lay, divided_tensor, index): |
There was a problem hiding this comment.
This now conflicts with main, which moved these register temporaries to fx.make_rmem_tensor/internal types. Please rebase and keep the shared helpers on the new API instead of reintroducing MemRefType + memref_alloca.
| return ok, flydsl_gpu_us | ||
|
|
||
|
|
||
| def test_rmsnorm_fused_add_dynamicquant(): |
There was a problem hiding this comment.
These new variant tests are pytest-only today. run_benchmark.sh executes this file as a script, but main only calls test_all(), so the fused/quant variants are not exercised in that benchmark path.
Resolve RMSNorm conflicts by keeping the branch variants on the current make_rmem_tensor-based register helper API. Co-authored-by: Cursor <cursoragent@cursor.com>
Motivation
Align quant and fused kernels with aiter/triton
Technical Details
Test Plan
Test Result
Tested on MI308+ROCm7.1:
quant rmsnorm performance compare:
fused_add rmsnorm performance compare:
fused add quant rmsnorm performance compare:
Submission Checklist