Add inline GEMM optimizations and general performance improvements by jfsantos · Pull Request #226 · sdatkinson/NeuralAmpModelerCore

jfsantos · 2026-02-12T23:52:28Z

Hand-optimized GEMM kernels for small matrices common in NAM models, gated by #ifdef NAM_USE_INLINE_GEMM with Eigen fallback. Includes:

Specialized Conv1D kernels: fused 4x4 and 2x2 kernel_size=3, plus fully-unrolled paths for 2x2 through 8x8 channel configurations
Conv1x1 inline specializations for all common size combinations
FiLM inline path with 4-element loop unrolling
GatingActivation/BlendingActivation inline paths
Branchless hardswish, 4-element loop unrolling for all activations
SiLU added to LUT enable/disable
Ring buffer refactored to Eigen block operations
memcpy replacements for pure copy operations in wavenet
Optimized single-channel output path in WaveNet::process
Buffer size benchmark tool (benchmodel_bufsize)

Developed with support and sponsorship from TONE3000

Hand-optimized GEMM kernels for small matrices common in NAM models, gated by #ifdef NAM_USE_INLINE_GEMM with Eigen fallback. Includes: - Specialized Conv1D kernels: fused 4x4 and 2x2 kernel_size=3, plus fully-unrolled paths for 2x2 through 8x8 channel configurations - Conv1x1 inline specializations for all common size combinations - FiLM inline path with 4-element loop unrolling - GatingActivation/BlendingActivation inline paths - Branchless hardswish, 4-element loop unrolling for all activations - SiLU added to LUT enable/disable - Ring buffer refactored to Eigen block operations - memcpy replacements for pure copy operations in wavenet - Optimized single-channel output path in WaveNet::process - Buffer size benchmark tool (benchmodel_bufsize) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sdatkinson

[Haven't finished reviewing conv1d, film, gating_activations, wavenet.cpp, and benchmodel]

One crit on comments funny business.
Another crit: Can you add tests to ensure that the code is correct?
Other nits.

NAM/activations.cpp

NAM/activations.h

NAM/dsp.cpp

NAM/ring_buffer.cpp

NAM/wavenet.h

tools/benchmodel_bufsize.cpp

…ise ops ARM assembly analysis (-O2 -DNDEBUG) confirmed: - GCC auto-unrolls simple activation loops; manual 4-wide gives no benefit - expf() serializes sigmoid/SiLU; unrolling can't help - Eigen element-wise ops (.leftCols + .leftCols) produce identical codegen to raw float* loops when assertions are disabled Simplify 5 activation classes to use inline helpers (relu, sigmoid, etc.) and revert 3 wavenet element-wise operations back to Eigen expressions. Inline GEMM (Conv1x1/Conv1D), depthwise unrolling, FiLM unrolling, bias broadcast, and memcpy optimizations are retained — those show measurable wins on both desktop and Cortex-M7. Also restored comments that were accidentally removed from wavenet.h.

sdatkinson

There's a potential huge risk with non-contiguous matrices (like when we're doing gated/blended activations).

Can you tell me if there are any "snapshot" tests that would verify that the calculations with and without this flag are the same? I'm just really concerned about something being different that I missed.

[No changes required if you can verify that the things I was concerned about are correct. Sorry; it's just too much for me to fit in my head all at once and some "simple" proof would be a big help.]

NAM/activations.h

NAM/film.h

sdatkinson · 2026-02-18T00:13:18Z

NAM/gating_activations.h

    // Validate input dimensions (assert for real-time performance)
-    const int total_channels = 2 * num_channels;
-    assert(input.rows() == total_channels);
+    assert(input.rows() == 2 * num_channels);


Nit: I'd changed my mind on these in favor of a throw that can be compiled out with #define NDEBUG

sdatkinson · 2026-02-18T00:20:05Z

NAM/wavenet.cpp

    // Use the GatingActivation class
    // Extract the blocks first to avoid temporary reference issues
    auto input_block = this->_z.leftCols(num_frames);
    auto output_block = this->_z.topRows(bottleneck).leftCols(num_frames);


cf. the non-contiguous concern.

Either this does an allocation, which means this shouldn't be real-time safe...

...but I haven't seen those tests fail, so that must mean that they address the memory as it's stored, and the activation concern I said must be actually an issue?

I really need to get to verifying that the results match the PyTorch...

tools/benchmodel.cpp

…s matrices

jfsantos · 2026-02-19T03:23:51Z

You were on the spot about the risk with non-contiguous operations. I added tests and made sure the inline ops all work on non-contiguous matrices now.

…ise ops ARM assembly analysis (-O2 -DNDEBUG) confirmed: - GCC auto-unrolls simple activation loops; manual 4-wide gives no benefit - expf() serializes sigmoid/SiLU; unrolling can't help - Eigen element-wise ops (.leftCols + .leftCols) produce identical codegen to raw float* loops when assertions are disabled Simplify 5 activation classes to use inline helpers (relu, sigmoid, etc.) and revert 3 wavenet element-wise operations back to Eigen expressions. Inline GEMM (Conv1x1/Conv1D), depthwise unrolling, FiLM unrolling, bias broadcast, and memcpy optimizations are retained — those show measurable wins on both desktop and Cortex-M7. Also restored comments that were accidentally removed from wavenet.h.

…s matrices

…delerCore into feature/inline-gemm

sdatkinson · 2026-02-23T23:48:05Z

Adding the benchmark report vs main (92fab9d):

==========================================
Performance Benchmark Comparison Report
==========================================

Model: ../../REAPER/core-tests/Models/a1-nano.nam
Number of runs per branch: 10
Date: Mon Feb 23 15:45:51 PST 2026

----------------------------------------
Branch: main
----------------------------------------
Commit:   92fab9d826535f729da89778e2171ccc5e817595
Mean:     35.351 ms
Median:   33.554 ms
Min:      33.194 ms
Max:      51.246 ms
Std Dev:  5.308 ms

----------------------------------------
Branch: feature/inline-gemm
----------------------------------------
Commit:   0ba430bcf7437f3278d00c75406f1529d70cd480
Mean:     31.315 ms
Median:   31.163 ms
Min:      31.008 ms
Max:      31.867 ms
Std Dev:  0.322 ms

----------------------------------------
Comparison
----------------------------------------
Mean:     feature/inline-gemm is 11.00% FASTER than main
Median:   feature/inline-gemm is 7.00% FASTER than main

Raw Results (main):
  51.246 ms
  34.480 ms
  33.438 ms
  33.589 ms
  33.436 ms
  33.330 ms
  33.674 ms
  33.194 ms
  33.521 ms
  33.603 ms

Raw Results (feature/inline-gemm):
  31.365 ms
  31.867 ms
  31.754 ms
  31.718 ms
  31.008 ms
  31.040 ms
  31.198 ms
  31.042 ms
  31.130 ms
  31.028 ms

Neat!

sdatkinson

LGTM!

sdatkinson · 2026-02-23T23:34:03Z

tools/test/test_noncontiguous_blocks.cpp

+
+  for (int c = 0; c < 2; c++)
+    for (int f = 0; f < num_frames; f++)
+      assert(std::abs(actual(c, f) - expected(c, f)) < 1e-5f);


Nit: For style (I thought this was enforced?), I prefer for all for-loops/ifs/etc to be enclosed w/ braces even if it's a one-liner. [It's been a source of bugs in the past...along with me as the other source 😉 ]

João Felipe Santos and others added 3 commits February 6, 2026 09:45

Fixed minor bug when building without inline GEMM

6c0942b

Added NAM_INLINE_GEMM build/test task to CI

5d9ed6c

sdatkinson requested changes Feb 14, 2026

View reviewed changes

jfsantos force-pushed the feature/inline-gemm branch from 704f309 to 7844a41 Compare February 16, 2026 18:44

Added option to disable fasttanh to benchmodel and benchmodel_bufsize

75db6b9

sdatkinson requested changes Feb 18, 2026

View reviewed changes

Fixed bugs and added tests to ensure operations work on non-contiguou…

f323492

…s matrices

Merge branch 'main' into feature/inline-gemm

2ae11d0

sdatkinson added this to NeuralAmpModelerCore v0.4.1 Feb 20, 2026

github-project-automation bot moved this to Backlog in NeuralAmpModelerCore v0.4.1 Feb 20, 2026

jfsantos requested a review from sdatkinson February 23, 2026 19:15

João Felipe Santos added 5 commits February 23, 2026 11:44

Merging to main, updated CI build to test both Eigen and inline paths

253f5e7

Fixed minor bug when building without inline GEMM

490706f

Added option to disable fasttanh to benchmodel and benchmodel_bufsize

659d3f0

Fixed bugs and added tests to ensure operations work on non-contiguou…

7137533

…s matrices

jfsantos force-pushed the feature/inline-gemm branch from c327ae0 to 53568a8 Compare February 23, 2026 19:59

Merge branch 'feature/inline-gemm' of github.com:jfsantos/NeuralAmpMo…

7b29c19

…delerCore into feature/inline-gemm

jfsantos force-pushed the feature/inline-gemm branch from 53568a8 to 7b29c19 Compare February 23, 2026 20:17

João Felipe Santos added 3 commits February 23, 2026 12:23

Fixed build for inline path

272ceb4

Fixed build for inline path

603f451

Merge branch 'feature/inline-gemm' of github.com:jfsantos/NeuralAmpMo…

0ba430b

…delerCore into feature/inline-gemm

sdatkinson approved these changes Feb 23, 2026

View reviewed changes

sdatkinson merged commit a0c93c0 into sdatkinson:main Feb 23, 2026
3 checks passed

github-project-automation bot moved this from Backlog to Done in NeuralAmpModelerCore v0.4.1 Feb 23, 2026

Comments

Conversation

jfsantos commented Feb 12, 2026

Uh oh!

sdatkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sdatkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sdatkinson Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

sdatkinson Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jfsantos commented Feb 19, 2026

Uh oh!

sdatkinson commented Feb 23, 2026

Uh oh!

sdatkinson left a comment

Choose a reason for hiding this comment

Uh oh!

sdatkinson Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants