[WIP] Quantized SDPA #3026

CC-Yeh · 2026-01-20T22:18:09Z

Proposed changes

Add Metal quantized SDPA vector kernels based on #1515

With M4, L=32768, H = 32, D = 128, Lq=1 :

Precision	SDPA (ms)	Quant SDPA (ms)	Ops-Based (ms)	Quant Ops-Based (ms)
mxfp4	98.63080	15.32626	43.72120	24.71464
mxfp8	97.37316	18.71779	42.89932	46.47875

TODO:

Support Affine and NVFP4
Adapt Faster two pass sdpa #3023 once it's merged.
Tuning on Larger devices (Need help).
Cleanup

What improve performance:

Removed thread storage k, v to reduce register pressure (was waiting on synchronization).
Fused computation with dequantization
Tuned reading size ('uint16_t'/'uin32_t') for quantized k/v
Manual unroll better than clang loop optimizer

Checklist

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works
I have updated the necessary documentation (if needed)

awni · 2026-01-21T00:44:14Z

The numbers seem quite good.. a little too good to be true 😅

What's the difference between SDPA and Attention in the benchmark? Also what's the query sequence length used for the benchmark?

CC-Yeh · 2026-01-21T01:32:35Z

The numbers seem quite good.. a little too good to be true 😅

Totally agree, must be missing something 🤔

What's the difference between SDPA and Attention in the benchmark? Also what's the query sequence length used for the benchmark?

Attention is a simple reference implementation built from matmul + softmax + matmul (Maybe too naive?).
SDPA uses mx.fast.scaled_dot_product_attention, which hits the sdpa_vector_2pass kernels when Lq ≤ 8 (this case).

The query sequence length here is 1 (q.shape = (1, 32, 1, 128)), so this benchmark is measuring the single-token decode case, where one new token attends to a long KV cache (L = 32768).

CC-Yeh · 2026-01-21T02:01:41Z

@awni
Fixed some bugs in dequantizing 8bit and benchmark(unneccessary dequantization steps).
Now the numbers make more sense 😃

awni · 2026-01-21T02:51:25Z

So if I’m understanding correctly the fused implementation is slower in the quantized case than the unfused ops-based one?

CC-Yeh · 2026-01-21T03:07:03Z

Fused SDPA is faster: MXFP4 15.33 ms vs 24.71 ms, and MXFP8 26.09 ms vs 46.48 ms to decode a single query.

awni · 2026-01-21T03:11:37Z

Very nice!!

awni · 2026-01-21T14:23:32Z

mlx/fast.cpp

+  if (qmode == QuantizationMode::Nvfp4) {
+    throw std::invalid_argument(
+        "[quantized_scaled_dot_product_attention] Mode 'nvfp4' is not supported for fast attention.");
+  }


Why not nvfp4?

It’s on the way! I just wanted to make sure the PR structure was okay first.

awni · 2026-01-21T14:23:59Z

mlx/fast.cpp

+  if (qmode == QuantizationMode::Affine) {
+    throw std::invalid_argument(
+        "[quantized_scaled_dot_product_attention] Only fp quantization modes are supported.");
+  }


Why not affine?

CC-Yeh force-pushed the quantized_sdpa branch from b64b7dc to 11b24f5 Compare January 20, 2026 22:23

first attempt

640ec94

CC-Yeh force-pushed the quantized_sdpa branch from 11b24f5 to 640ec94 Compare January 21, 2026 01:29

fix

4a73ab0

awni reviewed Jan 21, 2026

View reviewed changes

Unify mxfp4/8 paths and optimize mxfp8 fused calculation

f3dc49d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Quantized SDPA #3026

[WIP] Quantized SDPA #3026

CC-Yeh commented Jan 20, 2026 •

edited

Loading

Uh oh!

awni commented Jan 21, 2026

Uh oh!

CC-Yeh commented Jan 21, 2026

Uh oh!

CC-Yeh commented Jan 21, 2026

Uh oh!

awni commented Jan 21, 2026

Uh oh!

CC-Yeh commented Jan 21, 2026 •

edited

Loading

Uh oh!

awni commented Jan 21, 2026

Uh oh!

awni Jan 21, 2026

Uh oh!

CC-Yeh Jan 21, 2026

Uh oh!

awni Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[WIP] Quantized SDPA #3026

Are you sure you want to change the base?

[WIP] Quantized SDPA #3026

Conversation

CC-Yeh commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Checklist

Uh oh!

awni commented Jan 21, 2026

Uh oh!

CC-Yeh commented Jan 21, 2026

Uh oh!

CC-Yeh commented Jan 21, 2026

Uh oh!

awni commented Jan 21, 2026

Uh oh!

CC-Yeh commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

awni commented Jan 21, 2026

Uh oh!

awni Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

CC-Yeh Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

awni Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CC-Yeh commented Jan 20, 2026 •

edited

Loading

CC-Yeh commented Jan 21, 2026 •

edited

Loading