Add inline GEMM optimizations and general performance improvements#226
Add inline GEMM optimizations and general performance improvements#226jfsantos wants to merge 3 commits intosdatkinson:mainfrom
Conversation
Hand-optimized GEMM kernels for small matrices common in NAM models, gated by #ifdef NAM_USE_INLINE_GEMM with Eigen fallback. Includes: - Specialized Conv1D kernels: fused 4x4 and 2x2 kernel_size=3, plus fully-unrolled paths for 2x2 through 8x8 channel configurations - Conv1x1 inline specializations for all common size combinations - FiLM inline path with 4-element loop unrolling - GatingActivation/BlendingActivation inline paths - Branchless hardswish, 4-element loop unrolling for all activations - SiLU added to LUT enable/disable - Ring buffer refactored to Eigen block operations - memcpy replacements for pure copy operations in wavenet - Optimized single-channel output path in WaveNet::process - Buffer size benchmark tool (benchmodel_bufsize) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sdatkinson
left a comment
There was a problem hiding this comment.
[Haven't finished reviewing conv1d, film, gating_activations, wavenet.cpp, and benchmodel]
- One crit on comments funny business.
- Another crit: Can you add tests to ensure that the code is correct?
- Other nits.
| @@ -37,6 +37,7 @@ std::unordered_map<std::string, nam::activations::Activation::Ptr> nam::activati | |||
|
|
|||
| nam::activations::Activation::Ptr tanh_bak = nullptr; | |||
There was a problem hiding this comment.
Nit (though not your fault originally): would be nice to point out the purpose of these "bak" pointers with a comment.
| else | ||
| { | ||
| throw std::runtime_error("Tried to disable LUT for a function other than Tanh or Sigmoid"); | ||
| throw std::runtime_error("Tried to disable LUT for a function other than Tanh, Sigmoid, or SiLU"); |
There was a problem hiding this comment.
This is where I understood why L40 was a new feature and not a bug fix :)
| // hardswish(x) = x * relu6(x + 3) / 6 | ||
| // = x * clamp(x + 3, 0, 6) / 6 | ||
| const float t = x + 3.0f; | ||
| const float clamped = t < 0.0f ? 0.0f : (t > 6.0f ? 6.0f : t); |
There was a problem hiding this comment.
Interesting if this really is better? I'd be surprised if a compiler wouldn't figure out that these are the same.
| for (; pos + 3 < size; pos += 4) | ||
| { | ||
| // Branchless ReLU using conditional | ||
| const float v0 = data[pos], v1 = data[pos + 1]; |
There was a problem hiding this comment.
Very interesting...I assume that this only works better on specific chips? No way some (most?) compilers don't know to do this?
| // Process 4 elements at a time: swish(x) = x * sigmoid(x) = x / (1 + exp(-x)) | ||
| for (; pos + 3 < size; pos += 4) | ||
| { | ||
| const float x0 = data[pos], x1 = data[pos + 1]; |
There was a problem hiding this comment.
Nit: occurs to me looking at this versus the swish(data[pos]) on the left that some inlined "swish4" could look a tad cleaner? Not sure though.
| float* __restrict__ output_ptr = _output.data(); | ||
| const float* __restrict__ bias_ptr = this->_bias.data(); | ||
|
|
||
| // Specialized paths for common small channel counts |
There was a problem hiding this comment.
Nit: is it worth doing everything from 1 to 8?
| // Write the input data at the write position using Eigen block operations | ||
| // This is more efficient than element-by-element copy as it allows | ||
| // the compiler to vectorize the operation. | ||
| _storage.middleCols(_write_pos, num_frames).noalias() = input.leftCols(num_frames); |
There was a problem hiding this comment.
I wonder why I did that...is there a chance that this isn't real-time safe? I need to check the tests...
| // The layer objects | ||
| std::vector<_Layer> _layers; | ||
| // Output from last layer (for next layer array) | ||
|
|
There was a problem hiding this comment.
Some funny business going on with comments here. Can you revert them in this file?
| } | ||
|
|
||
| // Turn on fast tanh approximation | ||
| nam::activations::Activation::enable_fast_tanh(); |
There was a problem hiding this comment.
Nit: What do you think about making this an option?
I just got annoyed with it the other day independently of this PR :)
| @@ -0,0 +1,96 @@ | |||
| #include <iostream> | |||
There was a problem hiding this comment.
Can you tell me (docstring?) the difference between this and benchmodel.cpp?
Hand-optimized GEMM kernels for small matrices common in NAM models, gated by #ifdef NAM_USE_INLINE_GEMM with Eigen fallback. Includes:
Developed with support and sponsorship from TONE3000