Skip to content

Add inline GEMM optimizations and general performance improvements#226

Open
jfsantos wants to merge 3 commits intosdatkinson:mainfrom
jfsantos:feature/inline-gemm
Open

Add inline GEMM optimizations and general performance improvements#226
jfsantos wants to merge 3 commits intosdatkinson:mainfrom
jfsantos:feature/inline-gemm

Conversation

@jfsantos
Copy link
Contributor

Hand-optimized GEMM kernels for small matrices common in NAM models, gated by #ifdef NAM_USE_INLINE_GEMM with Eigen fallback. Includes:

  • Specialized Conv1D kernels: fused 4x4 and 2x2 kernel_size=3, plus fully-unrolled paths for 2x2 through 8x8 channel configurations
  • Conv1x1 inline specializations for all common size combinations
  • FiLM inline path with 4-element loop unrolling
  • GatingActivation/BlendingActivation inline paths
  • Branchless hardswish, 4-element loop unrolling for all activations
  • SiLU added to LUT enable/disable
  • Ring buffer refactored to Eigen block operations
  • memcpy replacements for pure copy operations in wavenet
  • Optimized single-channel output path in WaveNet::process
  • Buffer size benchmark tool (benchmodel_bufsize)

Developed with support and sponsorship from TONE3000

João Felipe Santos and others added 3 commits February 6, 2026 09:45
Hand-optimized GEMM kernels for small matrices common in NAM models,
gated by #ifdef NAM_USE_INLINE_GEMM with Eigen fallback. Includes:

- Specialized Conv1D kernels: fused 4x4 and 2x2 kernel_size=3, plus
  fully-unrolled paths for 2x2 through 8x8 channel configurations
- Conv1x1 inline specializations for all common size combinations
- FiLM inline path with 4-element loop unrolling
- GatingActivation/BlendingActivation inline paths
- Branchless hardswish, 4-element loop unrolling for all activations
- SiLU added to LUT enable/disable
- Ring buffer refactored to Eigen block operations
- memcpy replacements for pure copy operations in wavenet
- Optimized single-channel output path in WaveNet::process
- Buffer size benchmark tool (benchmodel_bufsize)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Owner

@sdatkinson sdatkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Haven't finished reviewing conv1d, film, gating_activations, wavenet.cpp, and benchmodel]

  • One crit on comments funny business.
  • Another crit: Can you add tests to ensure that the code is correct?
  • Other nits.

@@ -37,6 +37,7 @@ std::unordered_map<std::string, nam::activations::Activation::Ptr> nam::activati

nam::activations::Activation::Ptr tanh_bak = nullptr;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (though not your fault originally): would be nice to point out the purpose of these "bak" pointers with a comment.

else
{
throw std::runtime_error("Tried to disable LUT for a function other than Tanh or Sigmoid");
throw std::runtime_error("Tried to disable LUT for a function other than Tanh, Sigmoid, or SiLU");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where I understood why L40 was a new feature and not a bug fix :)

// hardswish(x) = x * relu6(x + 3) / 6
// = x * clamp(x + 3, 0, 6) / 6
const float t = x + 3.0f;
const float clamped = t < 0.0f ? 0.0f : (t > 6.0f ? 6.0f : t);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting if this really is better? I'd be surprised if a compiler wouldn't figure out that these are the same.

for (; pos + 3 < size; pos += 4)
{
// Branchless ReLU using conditional
const float v0 = data[pos], v1 = data[pos + 1];
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting...I assume that this only works better on specific chips? No way some (most?) compilers don't know to do this?

// Process 4 elements at a time: swish(x) = x * sigmoid(x) = x / (1 + exp(-x))
for (; pos + 3 < size; pos += 4)
{
const float x0 = data[pos], x1 = data[pos + 1];
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: occurs to me looking at this versus the swish(data[pos]) on the left that some inlined "swish4" could look a tad cleaner? Not sure though.

float* __restrict__ output_ptr = _output.data();
const float* __restrict__ bias_ptr = this->_bias.data();

// Specialized paths for common small channel counts
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: is it worth doing everything from 1 to 8?

// Write the input data at the write position using Eigen block operations
// This is more efficient than element-by-element copy as it allows
// the compiler to vectorize the operation.
_storage.middleCols(_write_pos, num_frames).noalias() = input.leftCols(num_frames);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why I did that...is there a chance that this isn't real-time safe? I need to check the tests...

// The layer objects
std::vector<_Layer> _layers;
// Output from last layer (for next layer array)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some funny business going on with comments here. Can you revert them in this file?

}

// Turn on fast tanh approximation
nam::activations::Activation::enable_fast_tanh();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: What do you think about making this an option?

I just got annoyed with it the other day independently of this PR :)

@@ -0,0 +1,96 @@
#include <iostream>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you tell me (docstring?) the difference between this and benchmodel.cpp?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants