-
Notifications
You must be signed in to change notification settings - Fork 1.5k
[WIP] Quantized SDPA #3026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[WIP] Quantized SDPA #3026
Conversation
b64b7dc to
11b24f5
Compare
|
The numbers seem quite good.. a little too good to be true 😅 What's the difference between SDPA and Attention in the benchmark? Also what's the query sequence length used for the benchmark? |
11b24f5 to
640ec94
Compare
Totally agree, must be missing something 🤔
Attention is a simple reference implementation built from The query sequence length here is 1 (q.shape = (1, 32, 1, 128)), so this benchmark is measuring the single-token decode case, where one new token attends to a long KV cache (L = 32768). |
|
@awni |
|
So if I’m understanding correctly the fused implementation is slower in the quantized case than the unfused ops-based one? |
|
Fused SDPA is faster: |
|
Very nice!! |
| if (qmode == QuantizationMode::Nvfp4) { | ||
| throw std::invalid_argument( | ||
| "[quantized_scaled_dot_product_attention] Mode 'nvfp4' is not supported for fast attention."); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not nvfp4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s on the way! I just wanted to make sure the PR structure was okay first.
| if (qmode == QuantizationMode::Affine) { | ||
| throw std::invalid_argument( | ||
| "[quantized_scaled_dot_product_attention] Only fp quantization modes are supported."); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not affine?
Proposed changes
Add Metal quantized SDPA vector kernels based on #1515
With M4, L=32768, H = 32, D = 128, Lq=1 :
TODO:
AffineandNVFP4What improve performance:
k/vclangloop optimizerChecklist
pre-commit run --all-filesto format my code / installed pre-commit prior to committing changes