Sparse Attention as Compact Kernel Regression

Official code for the Sparse Attention as Compact Kernel Regression paper.

Saul Santos, Nuno Gonçalves, Daniel C. McNamee, Marcos Treviso, and André F.T Martins

**Abstract**: * Recent work has revealed a link between self-attention mechanisms in transformers and test-time kernel regression via the Nadaraya–Watson estimator, with standard softmax attention corresponding to a Gaussian kernel. However, a kernel-theoretic understanding of *sparse* attention mechanisms is currently missing. In this paper, we establish a formal correspondence between sparse attention and *compact* (bounded-support) kernels. We show that normalized ReLU and sparsemax attention arise from Epanechnikov kernel regression under fixed and adaptive normalizations, respectively. More generally, we demonstrate that widely used kernels in nonparametric density estimation—including Epanechnikov, biweight, and triweight—correspond to $\alpha$-entmax attention with $\alpha = 1 + \frac{1}{n}$ for $n \in \mathbb{N}$, while the softmax/Gaussian relationship emerges in the limit $n \to \infty$. This unified perspective explains how sparsity naturally emerges from kernel design and provides principled alternatives to heuristic top-$k$ attention and other associative memory mechanisms. Experiments with a kernel-regression-based variant of transformers—**Memory Mosaics**—show that kernel-based sparse attention achieves competitive performance on language modeling, in-context learning, and length generalization tasks, offering a principled framework for designing attention mechanisms.*

If you use this code in your work, please cite our paper.

Resources

[Paper](to add) (arXiv)

All material is made available under the MIT license. You can use, redistribute, and adapt the material for non-commercial purposes, as long as you give appropriate credit by citing our paper and indicating any changes that you've made.

Language Modeling

Installation and Reproducibility

Follow Memory Mosaics instructions with our code and use the hyparparameters mentioned in the Appendix.

You will need to:

cd Library
pip install -r requirements.txt

In-Context Learning

Installation

Follow Memory Mosaics instructions. You will need to:

cd ICLL
pip install -r requirements.txt

If you have problems with triton in adasplash package (which you have also to install), remove this line of code (line 3) of the init file of adasplash library:

from .triton_entmax import triton_entmax

Reproducibility

Simply run this command for all method with these tuned hyperparameters:

for n in 1000 2500 5000 10000 20000 40000; do python train.py train.test=True experiment=dfa/mm dataset.num_test_examples=1000 dataset.num_examples=$n model.n_layer=2 model.method=softmax model.n_embd=128 model.n_head=8 optimizer.lr=1e-4 optimizer.weight_decay=0.01; done

Length Generalization

Reuse the same environment from the language modeling task and run as an example with the hyperparameters presented in the appendix:

torchrun --standalone --nproc_per_node=1 \
train_memory_mosaics_synthetic.py --batch_size=256 --gradient_accumulation_steps=1 --n_layer 2 --n_head 8 --datapath=your_data_path --out_dir results_epoch/sort/1338 --task sort --dtype bfloat16 --seed=1338 --method top16_softmax

The data will be released soon or before that contact us so we can provide it.

Acknowledgments

This code is based on [MemoryMosaics][https://github.com/facebookresearch/MemoryMosaics]

hang, J., Nolte, N., Sadhukhan, R., Chen, B., and Bottou, L. Memory mosaics. In Yue, Y., Garg, A., Peng, N., Sha, F., and Yu, R. (eds.), International Conference on Representation Learning, volume 2025, pp. 36412–36433, 2025

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
BabiStories		BabiStories
ICLL		ICLL
Library		Library
README.md		README.md
overview.png		overview.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparse Attention as Compact Kernel Regression

Resources

Language Modeling

Installation and Reproducibility

In-Context Learning

Installation

Reproducibility

Length Generalization

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

deep-spin/sparse_att

Folders and files

Latest commit

History

Repository files navigation

Sparse Attention as Compact Kernel Regression

Resources

Language Modeling

Installation and Reproducibility

In-Context Learning

Installation

Reproducibility

Length Generalization

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages