Skip to content

Conversation

@HobbitQia
Copy link
Collaborator

@HobbitQia HobbitQia commented Jan 28, 2026

FuseKernelPass is an MLIR optimization pass that merges neura.kernel operations.

  • Step 1: Pattern Detection
    • Producer-Consumer: kernel_A → kernel_B (data dependency)
    • Sibling: kernel_A || kernel_B (shared inputs, no dependency)
  • Step 2: Profitability Analysis
    • Creates temporary test modules with cloned kernel bodies. Then runs full Neura transformation pipeline (assign-accelerator → lower-arith → canonicalize → transform-ctrl-to-dataflow)
    • Computes real metrics: RecMII, ResMII, max fanout, operation count
    • Estimates final MII using formula: MII_est = ⌈(1 + α×ops/tiles) × (1 + β×max(fanout-4, 0)) × max(RecMII, ResMII)⌉. Compares fused vs. separate execution: fusion only proceeds if MII_fused ≤ max(MII_k1, MII_k2). Note that this equation is designed to take the resources and congestion (fanout) into consideration and stay tuned.
  • Step 3: Kernel Fusion
    • Producer-Consumer: Merges producer body into consumer, eliminates intermediate buffer
    • Sibling: Combines both kernel bodies, deduplicates shared inputs
    • Iteratively applies fusion until no more profitable opportunities exist. Here we prioritize Producer-Consumer fusion.

An example is shown below, where the two loops are wrapped in the same kernel:

// Before fusion
func.func @test_producer_consumer_fusion(%arg0: memref<?xf32>, %arg1: memref<?xf32>, %arg2: memref<?xf32>, %arg3: memref<?xf32>) {
  %cst = arith.constant 2.000000e+00 : f32
  affine.for %arg4 = 0 to 64 {
    %0 = memref.load %arg0[%arg4] : memref<?xf32>
    %1 = memref.load %arg1[%arg4] : memref<?xf32>
    %2 = arith.addf %0, %1 : f32
    memref.store %2, %arg2[%arg4] : memref<?xf32>
  }
  affine.for %arg4 = 0 to 64 {
    %0 = memref.load %arg2[%arg4] : memref<?xf32>
    %1 = arith.mulf %0, %cst : f32
    memref.store %1, %arg3[%arg4] : memref<?xf32>
  }
  return
}
// After fusion
func.func @test_producer_consumer_fusion(%arg0: memref<?xf32>, %arg1: memref<?xf32>, %arg2: memref<?xf32>, %arg3: memref<?xf32>) {
    %cst = arith.constant 2.000000e+00 : f32
    neura.kernel ins(%arg0, %arg1, %arg2, %cst, %arg3 : memref<?xf32>, memref<?xf32>, memref<?xf32>, f32, memref<?xf32>) attributes {kernel_name = "fused_sibling"} {
    ^bb0(%arg4: memref<?xf32>, %arg5: memref<?xf32>, %arg6: memref<?xf32>, %arg7: f32, %arg8: memref<?xf32>):
      affine.for %arg9 = 0 to 64 {
        %0 = memref.load %arg0[%arg9] : memref<?xf32>
        %1 = memref.load %arg1[%arg9] : memref<?xf32>
        %2 = arith.addf %0, %1 : f32
        memref.store %2, %arg2[%arg9] : memref<?xf32>
      }
      affine.for %arg9 = 0 to 64 {
        %0 = memref.load %arg2[%arg9] : memref<?xf32>
        %1 = arith.mulf %0, %cst : f32
        memref.store %1, %arg3[%arg9] : memref<?xf32>
      }
    }
    return
  }

@HobbitQia HobbitQia requested a review from tancheng January 28, 2026 12:06
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we use this file for testing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file can be used for testing with Polygeist as front-end, which can lower C++ to IRs in affine. Here the IRs after lowering are provided in test.mlir and kernel.cpp is only for reference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then can we really leverage Polygeist to compile it, instead of just a reference?

Comment on lines +16 to +18
#define START_PIPE_OCCUPY 1 // A multi-cycle op starts in the FU
#define END_PIPE_OCCUPY 2 // A multi-cycle op ends in the FU
#define IN_PIPE_OCCUPY 3 // A multi-cycle op is occupying the FU (pipelined)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't the 3 *_PIPE_OCCUPY overlapping with each other?

Copy link
Collaborator Author

@HobbitQia HobbitQia Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually 3 means the multi-cycle op will not occupies the input and output ports of the tile so we can map other operations onto this tile, which is inclusive execution we proposed before in our DATE paper.

However, I have not finished the implementation and test of inclusive execution so far. Here I just copied some content from CGRA-Mapper. Will tune it in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So IN_PIPE_OCCUPY does not include start and end, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants