Skip to content

Official code for the "Sparse Attention as Compact Kernel Regression" paper

Notifications You must be signed in to change notification settings

deep-spin/sparse_att

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sparse Attention as Compact Kernel Regression

Official code for the Sparse Attention as Compact Kernel Regression paper.

Saul Santos, Nuno Gonçalves, Daniel C. McNamee, Marcos Treviso, and André F.T Martins

Alt text

**Abstract**: * Recent work has revealed a link between self-attention mechanisms in transformers and test-time kernel regression via the Nadaraya–Watson estimator, with standard softmax attention corresponding to a Gaussian kernel. However, a kernel-theoretic understanding of *sparse* attention mechanisms is currently missing. In this paper, we establish a formal correspondence between sparse attention and *compact* (bounded-support) kernels. We show that normalized ReLU and sparsemax attention arise from Epanechnikov kernel regression under fixed and adaptive normalizations, respectively. More generally, we demonstrate that widely used kernels in nonparametric density estimation—including Epanechnikov, biweight, and triweight—correspond to $\alpha$-entmax attention with $\alpha = 1 + \frac{1}{n}$ for $n \in \mathbb{N}$, while the softmax/Gaussian relationship emerges in the limit $n \to \infty$. This unified perspective explains how sparsity naturally emerges from kernel design and provides principled alternatives to heuristic top-$k$ attention and other associative memory mechanisms. Experiments with a kernel-regression-based variant of transformers—**Memory Mosaics**—show that kernel-based sparse attention achieves competitive performance on language modeling, in-context learning, and length generalization tasks, offering a principled framework for designing attention mechanisms.*

If you use this code in your work, please cite our paper.


Resources

  • [Paper](to add) (arXiv)

    All material is made available under the MIT license. You can use, redistribute, and adapt the material for non-commercial purposes, as long as you give appropriate credit by citing our paper and indicating any changes that you've made.

Language Modeling

Installation and Reproducibility

Follow Memory Mosaics instructions with our code and use the hyparparameters mentioned in the Appendix.

You will need to:

cd Library
pip install -r requirements.txt

In-Context Learning

Installation

Follow Memory Mosaics instructions. You will need to:

cd ICLL
pip install -r requirements.txt

If you have problems with triton in adasplash package (which you have also to install), remove this line of code (line 3) of the init file of adasplash library:

from .triton_entmax import triton_entmax 

Reproducibility

Simply run this command for all method with these tuned hyperparameters:

for n in 1000 2500 5000 10000 20000 40000; do python train.py train.test=True experiment=dfa/mm dataset.num_test_examples=1000 dataset.num_examples=$n model.n_layer=2 model.method=softmax model.n_embd=128 model.n_head=8 optimizer.lr=1e-4 optimizer.weight_decay=0.01; done

Length Generalization

Reuse the same environment from the language modeling task and run as an example with the hyperparameters presented in the appendix:

torchrun --standalone --nproc_per_node=1 \
train_memory_mosaics_synthetic.py --batch_size=256 --gradient_accumulation_steps=1 --n_layer 2 --n_head 8 --datapath=your_data_path --out_dir results_epoch/sort/1338 --task sort --dtype bfloat16 --seed=1338 --method top16_softmax

The data will be released soon or before that contact us so we can provide it.

Acknowledgments

This code is based on [MemoryMosaics][https://github.com/facebookresearch/MemoryMosaics]

  • hang, J., Nolte, N., Sadhukhan, R., Chen, B., and Bottou, L. Memory mosaics. In Yue, Y., Garg, A., Peng, N., Sha, F., and Yu, R. (eds.), International Conference on Representation Learning, volume 2025, pp. 36412–36433, 2025

About

Official code for the "Sparse Attention as Compact Kernel Regression" paper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published