Add inline GEMM optimizations and general performance improvements by jfsantos · Pull Request #226 · sdatkinson/NeuralAmpModelerCore

jfsantos · 2026-02-12T23:52:28Z

Hand-optimized GEMM kernels for small matrices common in NAM models, gated by #ifdef NAM_USE_INLINE_GEMM with Eigen fallback. Includes:

Specialized Conv1D kernels: fused 4x4 and 2x2 kernel_size=3, plus fully-unrolled paths for 2x2 through 8x8 channel configurations
Conv1x1 inline specializations for all common size combinations
FiLM inline path with 4-element loop unrolling
GatingActivation/BlendingActivation inline paths
Branchless hardswish, 4-element loop unrolling for all activations
SiLU added to LUT enable/disable
Ring buffer refactored to Eigen block operations
memcpy replacements for pure copy operations in wavenet
Optimized single-channel output path in WaveNet::process
Buffer size benchmark tool (benchmodel_bufsize)

Developed with support and sponsorship from TONE3000

Hand-optimized GEMM kernels for small matrices common in NAM models, gated by #ifdef NAM_USE_INLINE_GEMM with Eigen fallback. Includes: - Specialized Conv1D kernels: fused 4x4 and 2x2 kernel_size=3, plus fully-unrolled paths for 2x2 through 8x8 channel configurations - Conv1x1 inline specializations for all common size combinations - FiLM inline path with 4-element loop unrolling - GatingActivation/BlendingActivation inline paths - Branchless hardswish, 4-element loop unrolling for all activations - SiLU added to LUT enable/disable - Ring buffer refactored to Eigen block operations - memcpy replacements for pure copy operations in wavenet - Optimized single-channel output path in WaveNet::process - Buffer size benchmark tool (benchmodel_bufsize) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sdatkinson

[Haven't finished reviewing conv1d, film, gating_activations, wavenet.cpp, and benchmodel]

One crit on comments funny business.
Another crit: Can you add tests to ensure that the code is correct?
Other nits.

sdatkinson · 2026-02-14T01:10:39Z

NAM/activations.cpp

@@ -37,6 +37,7 @@ std::unordered_map<std::string, nam::activations::Activation::Ptr> nam::activati

 nam::activations::Activation::Ptr tanh_bak = nullptr;


Nit (though not your fault originally): would be nice to point out the purpose of these "bak" pointers with a comment.

sdatkinson · 2026-02-14T01:11:14Z

NAM/activations.cpp

  else
  {
-    throw std::runtime_error("Tried to disable LUT for a function other than Tanh or Sigmoid");
+    throw std::runtime_error("Tried to disable LUT for a function other than Tanh, Sigmoid, or SiLU");


This is where I understood why L40 was a new feature and not a bug fix :)

sdatkinson · 2026-02-14T01:12:41Z

NAM/activations.h

+  // hardswish(x) = x * relu6(x + 3) / 6
+  //              = x * clamp(x + 3, 0, 6) / 6
+  const float t = x + 3.0f;
+  const float clamped = t < 0.0f ? 0.0f : (t > 6.0f ? 6.0f : t);


Interesting if this really is better? I'd be surprised if a compiler wouldn't figure out that these are the same.

sdatkinson · 2026-02-14T01:13:34Z

NAM/activations.h

+    for (; pos + 3 < size; pos += 4)
+    {
+      // Branchless ReLU using conditional
+      const float v0 = data[pos], v1 = data[pos + 1];


Very interesting...I assume that this only works better on specific chips? No way some (most?) compilers don't know to do this?

sdatkinson · 2026-02-14T01:16:29Z

NAM/activations.h

+    // Process 4 elements at a time: swish(x) = x * sigmoid(x) = x / (1 + exp(-x))
+    for (; pos + 3 < size; pos += 4)
+    {
+      const float x0 = data[pos], x1 = data[pos + 1];


Nit: occurs to me looking at this versus the swish(data[pos]) on the left that some inlined "swish4" could look a tad cleaner? Not sure though.

sdatkinson · 2026-02-14T01:28:11Z

NAM/dsp.cpp

+    float* __restrict__ output_ptr = _output.data();
+    const float* __restrict__ bias_ptr = this->_bias.data();
+
+    // Specialized paths for common small channel counts


Nit: is it worth doing everything from 1 to 8?

sdatkinson · 2026-02-14T01:29:46Z

NAM/ring_buffer.cpp

+  // Write the input data at the write position using Eigen block operations
+  // This is more efficient than element-by-element copy as it allows
+  // the compiler to vectorize the operation.
+  _storage.middleCols(_write_pos, num_frames).noalias() = input.leftCols(num_frames);


I wonder why I did that...is there a chance that this isn't real-time safe? I need to check the tests...

sdatkinson · 2026-02-14T01:30:36Z

NAM/wavenet.h

  // The layer objects
  std::vector<_Layer> _layers;
-  // Output from last layer (for next layer array)
+


Some funny business going on with comments here. Can you revert them in this file?

sdatkinson · 2026-02-14T01:31:51Z

tools/benchmodel_bufsize.cpp

+  }
+
+  // Turn on fast tanh approximation
+  nam::activations::Activation::enable_fast_tanh();


Nit: What do you think about making this an option?

I just got annoyed with it the other day independently of this PR :)

sdatkinson · 2026-02-14T01:33:51Z

tools/benchmodel_bufsize.cpp

@@ -0,0 +1,96 @@
+#include <iostream>


Can you tell me (docstring?) the difference between this and benchmodel.cpp?

João Felipe Santos and others added 3 commits February 6, 2026 09:45

Fixed minor bug when building without inline GEMM

6c0942b

Added NAM_INLINE_GEMM build/test task to CI

5d9ed6c

sdatkinson requested changes Feb 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add inline GEMM optimizations and general performance improvements#226

Add inline GEMM optimizations and general performance improvements#226
jfsantos wants to merge 3 commits intosdatkinson:mainfrom
jfsantos:feature/inline-gemm

jfsantos commented Feb 12, 2026

Uh oh!

sdatkinson left a comment

Uh oh!

sdatkinson Feb 14, 2026

Uh oh!

sdatkinson Feb 14, 2026

Uh oh!

sdatkinson Feb 14, 2026

Uh oh!

sdatkinson Feb 14, 2026

Uh oh!

sdatkinson Feb 14, 2026

Uh oh!

sdatkinson Feb 14, 2026

Uh oh!

sdatkinson Feb 14, 2026

Uh oh!

sdatkinson Feb 14, 2026

Uh oh!

sdatkinson Feb 14, 2026

Uh oh!

sdatkinson Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -37,6 +37,7 @@ std::unordered_map<std::string, nam::activations::Activation::Ptr> nam::activati

		nam::activations::Activation::Ptr tanh_bak = nullptr;

Conversation

jfsantos commented Feb 12, 2026

Uh oh!

sdatkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants