Group 1 - Hardware-Aware Transformer Optimisation: Integrating Programmable Attention, Triton Kernel Fusion, and Multi-Objective NAS by aahaidar01 · Pull Request #315 · DeepWok/mase

aahaidar01 · 2026-03-27T22:45:34Z

Authors:

Ali Haidar, Dorijan Donaj Magasic, Yash Agarwal, Mahmoud El Etreby

Summary

FlexAttention integration pass — module-level transform replacing SDPA with torch.nn.attention.flex_attention, supporting causal, sliding-window, ALiBi, and document masking patterns with block-sparse acceleration. Up to 1.72x training speedup for sliding-window attention at seq=4096.
Fused Add+RMSNorm Triton kernel — custom forward/backward Triton kernel fusing the residual-add → RMSNorm pattern in transformer decoder layers. 2.98x faster than unfused PyTorch, 1.42x faster than Liger-Kernel, with 60% peak memory reduction per fusion site. Includes both FX graph-level and module-level MASE passes.
Automated multi-objective search pipeline — fills MASE's LatencyRunner stub with GPU timing, adds a ModuleSearchSpaceQuantizationFusion search space covering bit-width × fusion strategy, and wires everything into Optuna NSGA-II search. Produces Pareto frontiers over accuracy/perplexity, latency, and average bitwidth across BERT, TinyLlama, and Mistral.

Key results

Optimization	Best speedup	Condition
FlexAttention SWA (inference)	1.46x	seq=4096, Llama
FlexAttention SWA (training throughput)	1.73x	seq=4096, 25K vs 14.5K tok/s
FlexAttention document masking	2.25x	seq=8192 vs SDPA mask
Fused RMSNorm kernel vs PyTorch	2.98x	L40S, BF16
Fused RMSNorm kernel vs Liger-Kernel	1.42x	L40S, BF16
Fused RMSNorm memory reduction	60%	per fusion site (forward)
Mistral sliding-window (search pipeline)	7%	consistent from seq≥512

Files changed

New passes

src/chop/passes/module/transforms/attention/flex_attention_transform.py — FlexAttention pass
src/chop/passes/module/transforms/attention/score_mods.py — score/mask modification library
src/chop/passes/graph/transforms/fused_rmsnorm/ — Triton kernel + FX graph pass
src/chop/passes/module/transforms/fused_ops/rmsnorm_residual_fusion.py — module-level fusion pass

Search infrastructure

src/chop/actions/search/strategies/runners/hardware/latency.py — GPU latency runner
src/chop/actions/search/search_space/quantization/module_fusion.py — fusion search space
src/chop/pipelines/optimization.py — pass-chain wrapper
configs/search/quantization_fusion_{bert,llama,mistral}.toml — search configs

Experiments & benchmarks

experiments/flex_attention/ — 12 experiments with JSON results and figures
scripts/ — search runners, benchmarks, kernel profiling
test/ — FlexAttention tests (40 tests), LatencyRunner tests

Documentations

Flex Attention: https://github.com/aahaidar01/mase/blob/main/FLEXATTENTION_REPORT.md
FUSED_RMS_NORM_RESIDUL_Triton: https://github.com/aahaidar01/mase/blob/main/FUSED_KERNEL_PROJECT_SPEC.md
Automated Search Pipeline: https://github.com/aahaidar01/mase/blob/main/Automated_Search_Pipeline.md

Models tested

BERT-base (SST-2 classification) — FlexAttention + quantization search
TinyLlama-1.1B (WikiText-2 LM) — full pipeline with causal FlexAttention
Mistral-7B (WikiText-2 LM) — sliding-window FlexAttention in float16

Hardware

NVIDIA L40S (48GB), PyTorch 2.6.0, Triton 3.3.1

- Patch to support 2D inputs in Binary quantizers (previously hardcoded for 4D). - Fix signatures in and to match PyTorch autograd requirements.

…n add our plots with relative paths to that folder and prepare it for .zip submission. Added further results to .md file including lab 1.

…nalities) into mase /src files. Added pytest test scripts to test the implemented functionalities.

…ch.compile compatibility, return tuple fixes, and add training/bf16/seq512 tests

…s flex throughput

…edup, fix GQA via enable_gqa=True, add block_mask caching, and expand test suite to 39 tests

…as it was. failing silently and falling back to eager causing OOM.

…stematic ablation results

…Pytorch2.x documented issue

…xperiments. Add alibi score mod and compound alibi and sw into score_mods.py. Uploaded results temporarily to share with collaborators.

…ice instead of CPU.

…computations. Added 3 more experiments: decode generation, throughput (tokens/s), gqa isolation.

…iments

…ntegrated)

…_ import - Checkout fused_rmsnorm Triton kernel files from feature/fused-rmsnorm-residual branch (triton_fused_add_rmsnorm.py was missing, causing ImportError at chop startup and crashing all search jobs) - Wrap fused_ops import in transforms/__init__.py with try/except so a missing Triton kernel never cascades to crash the entire chop framework Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Both module_fusion.py and benchmark_seqlen.py imported rmsnorm_residual_fusion_pass which does not exist. The actual function is fused_rmsnorm_residual_transform_pass. In the search space the error was silently swallowed by except ImportError, meaning fused_rmsnorm was never applied in any trial. In the benchmark it caused [ERROR] for int8_rmsnorm and int8_both strategies. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Without explicit cleanup, each trial's deep-copied model accumulates on GPU memory across 100 trials. Move to CPU, delete, gc.collect(), and empty_cache() after metrics are computed each trial. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…f flex_attention. Removed redundant .pbs job scripts not required for PR.

FP32 shows no benefit from the fused RMSNorm kernel (per Section III Fig 5). Loading in float16 matches production inference dtype and surfaces the ~1.03x model-level latency improvement.

….x API)

… chart

Anchors reference lines to baseline latency at seq=1024, making the sub-quadratic FlexAttention-SWA scaling claim visually explicit.

…ight

…arker for both

aahaidar01 and others added 30 commits February 5, 2026 23:23

[Bug fix] Fix binary/block quantizers for Linear layers

e341194

- Patch to support 2D inputs in Binary quantizers (previously hardcoded for 4D). - Fix signatures in and to match PyTorch autograd requirements.

[Add] Combined current progress in lab practicals.

e428369

[Modify] Modified md file lab 3 section

b2c79ac

[Add] Added a plot folder in the same parent path of labs.md so we ca…

d108887

…n add our plots with relative paths to that folder and prepare it for .zip submission. Added further results to .md file including lab 1.

[Add] Additional edits and results for lab 3 added to the .md file

263e78f

[Add] More modifications on lab 3 section in .md file

7750d21

added ADLS lab 4 results and pruning

24ea8a8

added lab 2 to the .md file

d1bc45e

final images for lab 4

f98c580

added tutorial 4 pruning to repo

b2e02d6

Added flex attention transform pass and score mods (diff score funtio…

a8bdd91

…nalities) into mase /src files. Added pytest test scripts to test the implemented functionalities.

Fix FlexAttention CUDA forward pass: position_embeddings support, tor…

4594327

…ch.compile compatibility, return tuple fixes, and add training/bf16/seq512 tests

Added sdpa vs flex experiments and latency benchmark to assess sdpa v…

7c46457

…s flex throughput

Add block_mask support with create_block_mask() for FlexAttention spe…

631e0f1

…edup, fix GQA via enable_gqa=True, add block_mask caching, and expand test suite to 39 tests

removed as_tensor from sliding_window_score_mod and causal_score_mod …

4011789

…as it was. failing silently and falling back to eager causing OOM.

added series of experiments to benchmark flexattention and prepare sy…

0be0812

…stematic ablation results

fix flex attention compile mode to max-autotune-no-cudagraphs as per …

b77150c

…Pytorch2.x documented issue

add kernel profiling, document masking, compound masking (alibi+SW) e…

286c471

…xperiments. Add alibi score mod and compound alibi and sw into score_mods.py. Uploaded results temporarily to share with collaborators.

Add exeriment results to share with collaborators

f7b728b

bug fix in alibi score mod when computing tensors by forcing CUDA dev…

f5f7785

…ice instead of CPU.

bug fix in score_mods.py to hardcode device to CUDA for alibi tensor …

865e19d

…computations. Added 3 more experiments: decode generation, throughput (tokens/s), gqa isolation.

added experiments output .json files

98318fb

updated exp12 and uploaded exp12 .json results

0dafa17

Added plot results script and generated plots for all conducted exper…

7a6d079

…iments

Implemented Automated Search Pipeline: Ready to Test

42da521

Fixed smoke test error

7ac7cf1

Fix PBS scripts: add module load and correct CLI entrypoint

adbf5f6

Fix PBS: --save-dir -> --project-dir

0b52579

Replace ch CLI with standalone search scripts for BERT and LLaMA

112cd09

Fix module_fusion: remove 'default' key from quantize_by_type args

e0282de

MahmoudEletreby and others added 30 commits March 25, 2026 16:01

Updated .md file with sequence length experiment results

bb331a2

Add int8_rmsnorm and int8_both configs to scaling benchmark (Part 2 i…

dd047de

…ntegrated)

Add fused RMSNorm residual fusion pass (Part 2)

32186ee

Add fused RMSNorm residual fusion pass (Part 2)

65e8b20

Fix 2D hidden_states guard in fused_forward and FlexAttention variants

cb6a982

Added MD file as supplementary reference for all experiment results o…

a4521a4

…f flex_attention. Removed redundant .pbs job scripts not required for PR.

Changed experiment results MD file path

f4530e7

Load TinyLlama in float16 for search to enable fused_rmsnorm speedup

c9d1dac

FP32 shows no benefit from the fused RMSNorm kernel (per Section III Fig 5). Loading in float16 matches production inference dtype and surfaces the ~1.03x model-level latency improvement.

Add kernel launch profiling script (Part 3 experiment)

5aa1a16

Fix profiler compat: use cuda_time_total fallback for newer PyTorch

c561f97

Fix profiler: add CPU activity for kernel attribution, fix div-by-zero

116129f

Debug: print profiler event time attributes

bcd7b3f

Fix profiler: use device_time_total/self_device_time_total (PyTorch 2…

49ebbfd

….x API)

Fix Mistral OOM: reload model from disk per strategy instead of deepcopy

dd43a3b

Fix Mistral OOM: gc.collect() between strategies, free profiler buffers

413c086

Add plot_results.py: Pareto scatter, seq-len scaling, memory scaling

85059d3

Fix plots: Pareto labels, Mistral seqlen scaling, kernel dispatch bar…

052859f

… chart

Add O(n²) and O(n·w) reference slope lines to Fig 2 seqlen scaling plot

d31bb5c

Anchors reference lines to baseline latency at seq=1024, making the sub-quadratic FlexAttention-SWA scaling claim visually explicit.

Remove misleading reference slope lines; move Fig 2 legend to lower r…

6755e9c

…ight

Fix Fig1/Fig2 aesthetics: smaller legend, annotation below title, P m…

de8d09b

…arker for both

Shrink Fig 2 legend further

8ea5f60

Fig 3: shorten FusedRMSNorm label to RMSNorm, reduce x-tick fontsize

aa17278

Merge branch 'flexattention' into Automated_Search_Pipeline

d5f3ffb

Removed .pbs training scripts. Added RMS_NORM Kernel Fusion MD file

373bb78

additional repo cleanup

f7a44e6

Merge remote-tracking branch 'origin/Automated_Search_Pipeline'

267e48a

changed .md file name into Automated_Search_Pipeline for clear reference

120f689

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Group 1 - Hardware-Aware Transformer Optimisation: Integrating Programmable Attention, Triton Kernel Fusion, and Multi-Objective NAS#315

Group 1 - Hardware-Aware Transformer Optimisation: Integrating Programmable Attention, Triton Kernel Fusion, and Multi-Objective NAS#315
aahaidar01 wants to merge 83 commits intoDeepWok:mainfrom
aahaidar01:main

aahaidar01 commented Mar 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

aahaidar01 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Authors:

Summary

Key results

Files changed

New passes

Search infrastructure

Experiments & benchmarks

Documentations

Models tested

Hardware

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aahaidar01 commented Mar 27, 2026 •

edited

Loading