[feat] Kernel-level fusion #251

HobbitQia · 2026-01-28T12:06:45Z

FuseKernelPass is an MLIR optimization pass that merges neura.kernel operations.

Step 1: Pattern Detection
- Producer-Consumer: kernel_A → kernel_B (data dependency)
- Sibling: kernel_A || kernel_B (shared inputs, no dependency)
Step 2: Profitability Analysis
- Creates temporary test modules with cloned kernel bodies. Then runs full Neura transformation pipeline (assign-accelerator → lower-arith → canonicalize → transform-ctrl-to-dataflow)
- Computes real metrics: RecMII, ResMII, max fanout, operation count
- Estimates final MII using formula: MII_est = ⌈(1 + α×ops/tiles) × (1 + β×max(fanout-4, 0)) × max(RecMII, ResMII)⌉. Compares fused vs. separate execution: fusion only proceeds if MII_fused ≤ max(MII_k1, MII_k2). Note that this equation is designed to take the resources and congestion (fanout) into consideration and stay tuned.
Step 3: Kernel Fusion
- Producer-Consumer: Merges producer body into consumer, eliminates intermediate buffer
- Sibling: Combines both kernel bodies, deduplicates shared inputs
- Iteratively applies fusion until no more profitable opportunities exist. Here we prioritize Producer-Consumer fusion.

An example is shown below, where the two loops are wrapped in the same kernel:

// Before fusion
func.func @test_producer_consumer_fusion(%arg0: memref<?xf32>, %arg1: memref<?xf32>, %arg2: memref<?xf32>, %arg3: memref<?xf32>) {
  %cst = arith.constant 2.000000e+00 : f32
  affine.for %arg4 = 0 to 64 {
    %0 = memref.load %arg0[%arg4] : memref<?xf32>
    %1 = memref.load %arg1[%arg4] : memref<?xf32>
    %2 = arith.addf %0, %1 : f32
    memref.store %2, %arg2[%arg4] : memref<?xf32>
  }
  affine.for %arg4 = 0 to 64 {
    %0 = memref.load %arg2[%arg4] : memref<?xf32>
    %1 = arith.mulf %0, %cst : f32
    memref.store %1, %arg3[%arg4] : memref<?xf32>
  }
  return
}
// After fusion
func.func @test_producer_consumer_fusion(%arg0: memref<?xf32>, %arg1: memref<?xf32>, %arg2: memref<?xf32>, %arg3: memref<?xf32>) {
    %cst = arith.constant 2.000000e+00 : f32
    neura.kernel ins(%arg0, %arg1, %arg2, %cst, %arg3 : memref<?xf32>, memref<?xf32>, memref<?xf32>, f32, memref<?xf32>) attributes {kernel_name = "fused_sibling"} {
    ^bb0(%arg4: memref<?xf32>, %arg5: memref<?xf32>, %arg6: memref<?xf32>, %arg7: f32, %arg8: memref<?xf32>):
      affine.for %arg9 = 0 to 64 {
        %0 = memref.load %arg0[%arg9] : memref<?xf32>
        %1 = memref.load %arg1[%arg9] : memref<?xf32>
        %2 = arith.addf %0, %1 : f32
        memref.store %2, %arg2[%arg9] : memref<?xf32>
      }
      affine.for %arg9 = 0 to 64 {
        %0 = memref.load %arg2[%arg9] : memref<?xf32>
        %1 = arith.mulf %0, %cst : f32
        memref.store %1, %arg3[%arg9] : memref<?xf32>
      }
    }
    return
  }

…_cycle

…into kernel_fusion

tancheng · 2026-01-28T23:55:51Z

test/neura/kernel_fusion/kernel.cpp

How do we use this file for testing?

This file can be used for testing with Polygeist as front-end, which can lower C++ to IRs in affine. Here the IRs after lowering are provided in test.mlir and kernel.cpp is only for reference.

Then can we really leverage Polygeist to compile it, instead of just a reference?

tancheng · 2026-01-28T23:56:56Z

include/NeuraDialect/Mapping/MappingState.h

+#define START_PIPE_OCCUPY 1 // A multi-cycle op starts in the FU
+#define END_PIPE_OCCUPY   2 // A multi-cycle op ends in the FU
+#define IN_PIPE_OCCUPY    3 // A multi-cycle op is occupying the FU (pipelined)


Aren't the 3 *_PIPE_OCCUPY overlapping with each other?

Actually 3 means the multi-cycle op will not occupies the input and output ports of the tile so we can map other operations onto this tile, which is inclusive execution we proposed before in our DATE paper.

However, I have not finished the implementation and test of inclusive execution so far. Here I just copied some content from CGRA-Mapper. Will tune it in the future.

So IN_PIPE_OCCUPY does not include start and end, right?

HobbitQia added 12 commits January 2, 2026 22:21

init InitExecLatencyPass

f16b9f7

finish initExecLatencyPass

c050454

Merge branch 'main' of https://github.com/coredac/dataflow into multi…

f3b7bc8

…_cycle

support fused_op in GenerateCodePass

7a0309a

fix bug in InitExecLantecy

b3e67c5

support multi-cycle mapping

e8dfc72

init FuseKernelPass

7fb646f

update element-wise fusion

e5b325d

update kernel fusion with metric

4e0da4d

Merge branch 'kernel_fusion' of https://github.com/HobbitQia/dataflow …

c384679

…into kernel_fusion

upadte FuseKernelPass

efe682a

add test cases of FuseKernelPass

27479c5

HobbitQia requested a review from tancheng January 28, 2026 12:06

tancheng reviewed Jan 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Kernel-level fusion #251

[feat] Kernel-level fusion #251

HobbitQia commented Jan 28, 2026 •

edited

Loading

Uh oh!

tancheng Jan 28, 2026

Uh oh!

HobbitQia Jan 29, 2026

Uh oh!

tancheng Jan 29, 2026

Uh oh!

tancheng Jan 28, 2026

Uh oh!

HobbitQia Jan 29, 2026 •

edited

Loading

Uh oh!

tancheng Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[feat] Kernel-level fusion #251

Are you sure you want to change the base?

[feat] Kernel-level fusion #251

Conversation

HobbitQia commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tancheng Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

HobbitQia Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

tancheng Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

tancheng Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

HobbitQia Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tancheng Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HobbitQia commented Jan 28, 2026 •

edited

Loading

HobbitQia Jan 29, 2026 •

edited

Loading