[Feat] Align quant and fused rmsnorm kernels with aiter/triton by cschenjunlin · Pull Request #481 · ROCm/FlyDSL

cschenjunlin · 2026-05-08T08:14:12Z

Motivation

Align quant and fused kernels with aiter/triton

Technical Details

Test Plan

Test Result

Tested on MI308+ROCm7.1:

quant rmsnorm performance compare:

====================================================================================================
Perf Compare (gpu us): FlyDSL vs AIter
====================================================================================================
op         shape              dtype  FlyDSL(gpu us)  AIter(gpu us)    speedup
rmsnorm_dq 64x256             f32              28.6           37.0      1.29x
rmsnorm_dq 128x1024           f32              28.1           37.5      1.33x
rmsnorm_dq 32x128             f16              29.5           36.9      1.25x
rmsnorm_dq 64x2000            f32              29.3           38.7      1.32x
rmsnorm_dq 16x512             bf16             29.1           37.1      1.27x
rmsnorm_dq 1024x8192          bf16             28.9           37.4      1.30x
rmsnorm_dq 32768x8192         bf16            400.8        1,089.9      2.72x
rmsnorm_sq 64x256             f32              30.9           38.8      1.25x
rmsnorm_sq 128x1024           f32              30.7           38.0      1.24x
rmsnorm_sq 32x128             f16              30.1           38.7      1.29x
rmsnorm_sq 64x2000            f32              30.3           38.4      1.27x
rmsnorm_sq 16x512             bf16             30.5           39.7      1.30x
rmsnorm_sq 1024x8192          bf16             30.6           46.6      1.52x
rmsnorm_sq 32768x8192         bf16            535.8        1,476.0      2.75x
====================================================================================================

fused_add rmsnorm performance compare:

====================================================================================================
Perf Compare (gpu us): FlyDSL vs AIter
====================================================================================================
op         shape              dtype  FlyDSL(gpu us)  AIter(gpu us)    speedup
rmsnorm_add 64x256             f32              31.5           52.1      1.65x
rmsnorm_add 128x1024           f32              31.5           51.5      1.63x
rmsnorm_add 32x128             f16              31.3           50.6      1.62x
rmsnorm_add 64x2000            f32              30.8           51.4      1.67x
rmsnorm_add 16x512             bf16             31.1           51.3      1.65x
rmsnorm_add 1024x8192          bf16             31.2           55.7      1.79x
rmsnorm_add 32768x8192         bf16            814.7        1,661.4      2.04x
====================================================================================================

fused add quant rmsnorm performance compare:

====================================================================================================
Perf Compare (gpu us): FlyDSL vs AIter
====================================================================================================
op         shape              dtype  FlyDSL(gpu us)  AIter(gpu us)    speedup
rmsnorm_add_dq 64x256             f32              32.8           38.8      1.18x
rmsnorm_add_dq 128x1024           f32              33.3           37.6      1.13x
rmsnorm_add_dq 32x128             f16              33.2           37.6      1.13x
rmsnorm_add_dq 64x2000            f32              33.5           39.3      1.17x
rmsnorm_add_dq 16x512             bf16             32.8           39.2      1.20x
rmsnorm_add_dq 1024x8192          bf16             33.3           54.6      1.64x
rmsnorm_add_dq 32768x8192         bf16            731.4        1,562.0      2.14x
rmsnorm_add_sq 64x256             f32              35.2           42.2      1.20x
rmsnorm_add_sq 128x1024           f32              35.1           41.9      1.19x
rmsnorm_add_sq 32x128             f16              35.5           39.8      1.12x
rmsnorm_add_sq 64x2000            f32              36.4           40.8      1.12x
rmsnorm_add_sq 16x512             bf16             35.1           43.1      1.23x
rmsnorm_add_sq 1024x8192          bf16             35.8           64.1      1.79x
rmsnorm_add_sq 32768x8192         bf16            888.1        1,847.8      2.08x
====================================================================================================

Submission Checklist

quant_rms_norm_kernel
fused_add_rmsnorm_kernel
quant_fused_add_rmsnorm_kernel
rmsnorm_kernel_large_m_small_n

coderfeli · 2026-05-08T10:58:43Z

+                y = (added * rrms) * g
+                _store_vec(_to_elem_vec(y), out_div, idx)
+
+        else:


no need to copy all the codes in else?

coderfeli · 2026-05-11T10:00:40Z

@cschenjunlin any update?

…ed codes

cschenjunlin · 2026-05-13T10:18:20Z

@cschenjunlin any update?

I have pushed new commits to solve the duplication issue mentioned above.

Some common functions are added, to reduce code duplication within the variant kernels. However, the scope of refactoring here is limited, as the variants have introduced some new data flow logic.
For each variant kernel, a standalone build_xx_module function is retained. This is to align with the implementation of aiter/triton, where each variant kernel is also implemented as an standalone kernel separate from the base version.

If further reduction of code duplication is needed, I can attempt to converge the build_xx_module functions of all variants into a unified implementation, and use flag parameters to differentiate branch logic internally.

i-chaochen · 2026-05-14T12:06:19Z

@coderfeli Sorry for the delay, please check this PR again.

Thanks!

coderfeli · 2026-05-16T02:17:06Z

+    return allocator, red_offset, red2_offset
+
+
+def _load_scalar(copy_atom, scalar_reg_ty, scalar_reg_lay, divided_tensor, index):


This now conflicts with main, which moved these register temporaries to fx.make_rmem_tensor/internal types. Please rebase and keep the shared helpers on the new API instead of reintroducing MemRefType + memref_alloca.

coderfeli · 2026-05-16T02:17:15Z

+    return ok, flydsl_gpu_us
+
+
+def test_rmsnorm_fused_add_dynamicquant():


These new variant tests are pytest-only today. run_benchmark.sh executes this file as a script, but main only calls test_all(), so the fused/quant variants are not exercised in that benchmark path.

Resolve RMSNorm conflicts by keeping the branch variants on the current make_rmem_tensor-based register helper API. Co-authored-by: Cursor <cursoragent@cursor.com>

cschenjunlin added 6 commits May 6, 2026 20:57

add fused_add_rmsnorm kernel

6036646

refactor: align rmsnorm quant kernels with base kernel style

03bb5ee

test: separate rmsnorm variant configs and refine fused_add checks

18924e6

align fused_add rmsnorm semantics and refine rmsnorm variant tests

4c1a1da

align smoothquant xscale to fp32 path in kernel and tests

e5ce156

add fused_add rmsnorm quant kernels and tests

1d09d31

coderfeli reviewed May 8, 2026

View reviewed changes

cschenjunlin added 6 commits May 12, 2026 16:34

add large_m_small_n rmsnorm path and refine generic fused_add flow

5e676ac

add some annotations for the variants

4e409cd

refactor: add common functions in rmsnorm varients to reduce duplicat…

057b70e

…ed codes

fix python style check

e6fa2f9

fix python style check of the tests

a8e7cde

Merge branch 'main' into cjl/fused_quant_rmsnorm

e2da1f9

coderfeli reviewed May 16, 2026

View reviewed changes

Merge main into fused quant RMSNorm branch

7e117e2

Resolve RMSNorm conflicts by keeping the branch variants on the current make_rmem_tensor-based register helper API. Co-authored-by: Cursor <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Align quant and fused rmsnorm kernels with aiter/triton#481

[Feat] Align quant and fused rmsnorm kernels with aiter/triton#481
cschenjunlin wants to merge 13 commits into
mainfrom
cjl/fused_quant_rmsnorm

cschenjunlin commented May 8, 2026 •

edited

Loading

Uh oh!

coderfeli May 8, 2026

Uh oh!

coderfeli commented May 11, 2026

Uh oh!

cschenjunlin commented May 13, 2026

Uh oh!

i-chaochen commented May 14, 2026

Uh oh!

coderfeli May 16, 2026

Uh oh!

coderfeli May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return allocator, red_offset, red2_offset


		def _load_scalar(copy_atom, scalar_reg_ty, scalar_reg_lay, divided_tensor, index):

		return ok, flydsl_gpu_us


		def test_rmsnorm_fused_add_dynamicquant():

Conversation

cschenjunlin commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

coderfeli May 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderfeli commented May 11, 2026

Uh oh!

cschenjunlin commented May 13, 2026

Uh oh!

i-chaochen commented May 14, 2026

Uh oh!

coderfeli May 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderfeli May 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cschenjunlin commented May 8, 2026 •

edited

Loading