Skip to content

Comments

Add inline GEMM optimizations and general performance improvements#226

Merged
sdatkinson merged 16 commits intosdatkinson:mainfrom
jfsantos:feature/inline-gemm
Feb 23, 2026
Merged

Add inline GEMM optimizations and general performance improvements#226
sdatkinson merged 16 commits intosdatkinson:mainfrom
jfsantos:feature/inline-gemm

Conversation

@jfsantos
Copy link
Contributor

Hand-optimized GEMM kernels for small matrices common in NAM models, gated by #ifdef NAM_USE_INLINE_GEMM with Eigen fallback. Includes:

  • Specialized Conv1D kernels: fused 4x4 and 2x2 kernel_size=3, plus fully-unrolled paths for 2x2 through 8x8 channel configurations
  • Conv1x1 inline specializations for all common size combinations
  • FiLM inline path with 4-element loop unrolling
  • GatingActivation/BlendingActivation inline paths
  • Branchless hardswish, 4-element loop unrolling for all activations
  • SiLU added to LUT enable/disable
  • Ring buffer refactored to Eigen block operations
  • memcpy replacements for pure copy operations in wavenet
  • Optimized single-channel output path in WaveNet::process
  • Buffer size benchmark tool (benchmodel_bufsize)

Developed with support and sponsorship from TONE3000

João Felipe Santos and others added 3 commits February 6, 2026 09:45
Hand-optimized GEMM kernels for small matrices common in NAM models,
gated by #ifdef NAM_USE_INLINE_GEMM with Eigen fallback. Includes:

- Specialized Conv1D kernels: fused 4x4 and 2x2 kernel_size=3, plus
  fully-unrolled paths for 2x2 through 8x8 channel configurations
- Conv1x1 inline specializations for all common size combinations
- FiLM inline path with 4-element loop unrolling
- GatingActivation/BlendingActivation inline paths
- Branchless hardswish, 4-element loop unrolling for all activations
- SiLU added to LUT enable/disable
- Ring buffer refactored to Eigen block operations
- memcpy replacements for pure copy operations in wavenet
- Optimized single-channel output path in WaveNet::process
- Buffer size benchmark tool (benchmodel_bufsize)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Owner

@sdatkinson sdatkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Haven't finished reviewing conv1d, film, gating_activations, wavenet.cpp, and benchmodel]

  • One crit on comments funny business.
  • Another crit: Can you add tests to ensure that the code is correct?
  • Other nits.

…ise ops

  ARM assembly analysis (-O2 -DNDEBUG) confirmed:
  - GCC auto-unrolls simple activation loops; manual 4-wide gives no benefit
  - expf() serializes sigmoid/SiLU; unrolling can't help
  - Eigen element-wise ops (.leftCols + .leftCols) produce identical codegen
    to raw float* loops when assertions are disabled

  Simplify 5 activation classes to use inline helpers (relu, sigmoid, etc.)
  and revert 3 wavenet element-wise operations back to Eigen expressions.

  Inline GEMM (Conv1x1/Conv1D), depthwise unrolling, FiLM unrolling,
  bias broadcast, and memcpy optimizations are retained — those show
  measurable wins on both desktop and Cortex-M7.

Also restored comments that were accidentally removed from wavenet.h.
@jfsantos jfsantos force-pushed the feature/inline-gemm branch from 704f309 to 7844a41 Compare February 16, 2026 18:44
Copy link
Owner

@sdatkinson sdatkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a potential huge risk with non-contiguous matrices (like when we're doing gated/blended activations).

Can you tell me if there are any "snapshot" tests that would verify that the calculations with and without this flag are the same? I'm just really concerned about something being different that I missed.

[No changes required if you can verify that the things I was concerned about are correct. Sorry; it's just too much for me to fit in my head all at once and some "simple" proof would be a big help.]

// Validate input dimensions (assert for real-time performance)
const int total_channels = 2 * num_channels;
assert(input.rows() == total_channels);
assert(input.rows() == 2 * num_channels);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'd changed my mind on these in favor of a throw that can be compiled out with #define NDEBUG

// Use the GatingActivation class
// Extract the blocks first to avoid temporary reference issues
auto input_block = this->_z.leftCols(num_frames);
auto output_block = this->_z.topRows(bottleneck).leftCols(num_frames);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cf. the non-contiguous concern.

Either this does an allocation, which means this shouldn't be real-time safe...

...but I haven't seen those tests fail, so that must mean that they address the memory as it's stored, and the activation concern I said must be actually an issue?

I really need to get to verifying that the results match the PyTorch...

@jfsantos
Copy link
Contributor Author

You were on the spot about the risk with non-contiguous operations. I added tests and made sure the inline ops all work on non-contiguous matrices now.

João Felipe Santos added 5 commits February 23, 2026 11:44
…ise ops

  ARM assembly analysis (-O2 -DNDEBUG) confirmed:
  - GCC auto-unrolls simple activation loops; manual 4-wide gives no benefit
  - expf() serializes sigmoid/SiLU; unrolling can't help
  - Eigen element-wise ops (.leftCols + .leftCols) produce identical codegen
    to raw float* loops when assertions are disabled

  Simplify 5 activation classes to use inline helpers (relu, sigmoid, etc.)
  and revert 3 wavenet element-wise operations back to Eigen expressions.

  Inline GEMM (Conv1x1/Conv1D), depthwise unrolling, FiLM unrolling,
  bias broadcast, and memcpy optimizations are retained — those show
  measurable wins on both desktop and Cortex-M7.

Also restored comments that were accidentally removed from wavenet.h.
@jfsantos jfsantos force-pushed the feature/inline-gemm branch from c327ae0 to 53568a8 Compare February 23, 2026 19:59
@jfsantos jfsantos force-pushed the feature/inline-gemm branch from 53568a8 to 7b29c19 Compare February 23, 2026 20:17
@sdatkinson
Copy link
Owner

Adding the benchmark report vs main (92fab9d):

==========================================
Performance Benchmark Comparison Report
==========================================

Model: ../../REAPER/core-tests/Models/a1-nano.nam
Number of runs per branch: 10
Date: Mon Feb 23 15:45:51 PST 2026

----------------------------------------
Branch: main
----------------------------------------
Commit:   92fab9d826535f729da89778e2171ccc5e817595
Mean:     35.351 ms
Median:   33.554 ms
Min:      33.194 ms
Max:      51.246 ms
Std Dev:  5.308 ms

----------------------------------------
Branch: feature/inline-gemm
----------------------------------------
Commit:   0ba430bcf7437f3278d00c75406f1529d70cd480
Mean:     31.315 ms
Median:   31.163 ms
Min:      31.008 ms
Max:      31.867 ms
Std Dev:  0.322 ms

----------------------------------------
Comparison
----------------------------------------
Mean:     feature/inline-gemm is 11.00% FASTER than main
Median:   feature/inline-gemm is 7.00% FASTER than main

Raw Results (main):
  51.246 ms
  34.480 ms
  33.438 ms
  33.589 ms
  33.436 ms
  33.330 ms
  33.674 ms
  33.194 ms
  33.521 ms
  33.603 ms

Raw Results (feature/inline-gemm):
  31.365 ms
  31.867 ms
  31.754 ms
  31.718 ms
  31.008 ms
  31.040 ms
  31.198 ms
  31.042 ms
  31.130 ms
  31.028 ms

Neat!

Copy link
Owner

@sdatkinson sdatkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!


for (int c = 0; c < 2; c++)
for (int f = 0; f < num_frames; f++)
assert(std::abs(actual(c, f) - expected(c, f)) < 1e-5f);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: For style (I thought this was enforced?), I prefer for all for-loops/ifs/etc to be enclosed w/ braces even if it's a one-liner. [It's been a source of bugs in the past...along with me as the other source 😉 ]

@sdatkinson sdatkinson merged commit a0c93c0 into sdatkinson:main Feb 23, 2026
3 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in NeuralAmpModelerCore v0.4.1 Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants