Add inline GEMM optimizations and general performance improvements#226
Add inline GEMM optimizations and general performance improvements#226sdatkinson merged 16 commits intosdatkinson:mainfrom
Conversation
Hand-optimized GEMM kernels for small matrices common in NAM models, gated by #ifdef NAM_USE_INLINE_GEMM with Eigen fallback. Includes: - Specialized Conv1D kernels: fused 4x4 and 2x2 kernel_size=3, plus fully-unrolled paths for 2x2 through 8x8 channel configurations - Conv1x1 inline specializations for all common size combinations - FiLM inline path with 4-element loop unrolling - GatingActivation/BlendingActivation inline paths - Branchless hardswish, 4-element loop unrolling for all activations - SiLU added to LUT enable/disable - Ring buffer refactored to Eigen block operations - memcpy replacements for pure copy operations in wavenet - Optimized single-channel output path in WaveNet::process - Buffer size benchmark tool (benchmodel_bufsize) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sdatkinson
left a comment
There was a problem hiding this comment.
[Haven't finished reviewing conv1d, film, gating_activations, wavenet.cpp, and benchmodel]
- One crit on comments funny business.
- Another crit: Can you add tests to ensure that the code is correct?
- Other nits.
…ise ops
ARM assembly analysis (-O2 -DNDEBUG) confirmed:
- GCC auto-unrolls simple activation loops; manual 4-wide gives no benefit
- expf() serializes sigmoid/SiLU; unrolling can't help
- Eigen element-wise ops (.leftCols + .leftCols) produce identical codegen
to raw float* loops when assertions are disabled
Simplify 5 activation classes to use inline helpers (relu, sigmoid, etc.)
and revert 3 wavenet element-wise operations back to Eigen expressions.
Inline GEMM (Conv1x1/Conv1D), depthwise unrolling, FiLM unrolling,
bias broadcast, and memcpy optimizations are retained — those show
measurable wins on both desktop and Cortex-M7.
Also restored comments that were accidentally removed from wavenet.h.
704f309 to
7844a41
Compare
sdatkinson
left a comment
There was a problem hiding this comment.
There's a potential huge risk with non-contiguous matrices (like when we're doing gated/blended activations).
Can you tell me if there are any "snapshot" tests that would verify that the calculations with and without this flag are the same? I'm just really concerned about something being different that I missed.
[No changes required if you can verify that the things I was concerned about are correct. Sorry; it's just too much for me to fit in my head all at once and some "simple" proof would be a big help.]
| // Validate input dimensions (assert for real-time performance) | ||
| const int total_channels = 2 * num_channels; | ||
| assert(input.rows() == total_channels); | ||
| assert(input.rows() == 2 * num_channels); |
There was a problem hiding this comment.
Nit: I'd changed my mind on these in favor of a throw that can be compiled out with #define NDEBUG
| // Use the GatingActivation class | ||
| // Extract the blocks first to avoid temporary reference issues | ||
| auto input_block = this->_z.leftCols(num_frames); | ||
| auto output_block = this->_z.topRows(bottleneck).leftCols(num_frames); |
There was a problem hiding this comment.
cf. the non-contiguous concern.
Either this does an allocation, which means this shouldn't be real-time safe...
...but I haven't seen those tests fail, so that must mean that they address the memory as it's stored, and the activation concern I said must be actually an issue?
I really need to get to verifying that the results match the PyTorch...
|
You were on the spot about the risk with non-contiguous operations. I added tests and made sure the inline ops all work on non-contiguous matrices now. |
…ise ops
ARM assembly analysis (-O2 -DNDEBUG) confirmed:
- GCC auto-unrolls simple activation loops; manual 4-wide gives no benefit
- expf() serializes sigmoid/SiLU; unrolling can't help
- Eigen element-wise ops (.leftCols + .leftCols) produce identical codegen
to raw float* loops when assertions are disabled
Simplify 5 activation classes to use inline helpers (relu, sigmoid, etc.)
and revert 3 wavenet element-wise operations back to Eigen expressions.
Inline GEMM (Conv1x1/Conv1D), depthwise unrolling, FiLM unrolling,
bias broadcast, and memcpy optimizations are retained — those show
measurable wins on both desktop and Cortex-M7.
Also restored comments that were accidentally removed from wavenet.h.
c327ae0 to
53568a8
Compare
…delerCore into feature/inline-gemm
53568a8 to
7b29c19
Compare
…delerCore into feature/inline-gemm
|
Adding the benchmark report vs Neat! |
|
|
||
| for (int c = 0; c < 2; c++) | ||
| for (int f = 0; f < num_frames; f++) | ||
| assert(std::abs(actual(c, f) - expected(c, f)) < 1e-5f); |
There was a problem hiding this comment.
Nit: For style (I thought this was enforced?), I prefer for all for-loops/ifs/etc to be enclosed w/ braces even if it's a one-liner. [It's been a source of bugs in the past...along with me as the other source 😉 ]
Hand-optimized GEMM kernels for small matrices common in NAM models, gated by #ifdef NAM_USE_INLINE_GEMM with Eigen fallback. Includes:
Developed with support and sponsorship from TONE3000