Presenter: Wenjie
| paper | conference | date | institution | group |
|---|---|---|---|---|
| Efficient streaming language models with attention sinks | ICLR’24 | Sept 23 | MIT | Song Han |
| MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention | NeurIPS’24 | July 24 | MSR | Lili Qiu |
| Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference | ICML’24 | June 24 | MIT | Song Han |
| Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention | ACL'25 (best paper award) | Feb 25 | DeepSeek-AI | DeepSeek-AI |
| paper | conference | date | institution | group |
|---|---|---|---|---|
| Your Efficient RL Framework Secretly Brings You Off-Policy RL Training | blog | Aug 25 | MS | Jianfeng Gao |
| On-Policy Distillation | blog | Oct 25 | Thinking Machines Lab | Thinking Machines Lab |
Presenter: Haolei
| paper | conference | date | institution | group |
|---|---|---|---|---|
| KernelBench: Can LLMs Write Efficient GPU Kernels? | ICML’25 | 14 Feb 2025 | Stanford | Azalia Mirhoseini |
| TritonBench: Benchmarking large language model capabilities for generating triton operators | ACL Findings’25 | 20 Feb 2025 | THUNLP | Maosong Sun |
| paper | conference | date | institution | group |
|---|---|---|---|---|
| CUDA-LLM: LLMs Can Write Efficient CUDA Kernels | arXiv | 10 Jun 2025 | SJTU | An Zou |
| paper | conference | date | institution | group |
|---|---|---|---|---|
| Kevin: Multi-Turn RL for Generating CUDA Kernels | ES-FoMo-III Workshop @ICML’25 | 16 Jul 2025 | Stanford, Cognition AI | Cognition AI |
| AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs | arXiv | 8 Jul 2025 | THUNLP | Maosong Sun |
| paper | conference | date | institution | group |
|---|---|---|---|---|
| Astra: A Multi-Agent System for GPU Kernel Performance Optimization | DL4C workshop @NeurIPS’25 | 9 Sep 2025 | Stanford | Stanford |
| paper | conference | date | institution | group |
|---|---|---|---|---|
| ConCuR: Conciseness Makes State-of-the-Art Kernel Generation | aXiv | 8 Oct 2025 | Westlake | Huan Wang |
| model | parameter | institution |
|---|---|---|
| KernelLLM | 8B | Meta |
| cudaLLM | 8B | ByteDance |
How Many Agents Does it Take to Beat PyTorch?
Presenter: Mingluo
| paper | date |
|---|---|
| Microscaling Data Formats for Deep Learning | Oct. 23 |
| NVFP4 | Jan. 25 |
| paper | conference | date | institution | group |
|---|---|---|---|---|
| AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration | MLSys'24 | June 2023 | MiT | Song Han |
| SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. | ICML'24 | Nov. 2022 | MiT | Song Han |
| OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models | ICLR'24 | Aug. 2023 | Shanghai AI Laboratory | Luo Ping |
| paper | conference | date | institution | group |
|---|---|---|---|---|
| AffineQuant: Affine Transformation Quantization for Large Language Models | ICLR'24 | March 2024 | XMU | Rongrong Ji |
| QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks | ICML'24 | March. 2024 | Cornell | Albert Tseng |
| QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs | NeurIPS'24 | Apr. 2024 | ETH | Saleh Ashkboos |
| DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs | NeurIPS'24 | June 2024 | ZJU | Ying Wei |
| SpinQuant: LLM quantization with learned rotations | ICLR'25 | June 2024 | Tijmen Blankevoort | |
| FlatQuant: Flatness Matters for LLM Quantization | ICML'25 | Oct. 2024 | Tsinghua | Jun Yao |