Skip to content

westlake-encode-lab/paper-sharing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 

Repository files navigation

Paper List

Oct. 31, 2025

Efficiency in Large Language Models

Presenter: Wenjie

KV Cache Compression

paper conference date institution group
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models NeurIPS'23 June 23 CMU Beidi Chen
SnapKV: LLM Knows What You Are Looking for before Generation NeurIPS'24 Apr 24 UIUC Deming Chen
R-KV: Redundancy-aware KV Cache Compression for Reasoning Models NeurIPS'25 May 25 UWisconsin Junjie Hu
PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference ACL'24 Findings May 24 SJTU Hai Zhao
CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences ICLR'25 Mar 25 Ant Jianguo Li
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs ICLR'24 Oct 23 MSR Jianfeng Gao
Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning ICLR'25 Oct 24 MSR Wen Xiao
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads ICLR'25 Oct 24 MIT Song Han
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads ICLR'25 July 24 Huawei Gongyi Wang
Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs? arXiv June 25 Princeton Danqi Chen
Which Heads Matter for Reasoning? RL-Guided KV Cache Compression arXiv Oct 25 Westlake Huan Wang

Sparse Attention

paper conference date institution group
Efficient streaming language models with attention sinks ICLR’24 Sept 23 MIT Song Han
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention NeurIPS’24 July 24 MSR Lili Qiu
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference ICML’24 June 24 MIT Song Han
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention ACL'25 (best paper award) Feb 25 DeepSeek-AI DeepSeek-AI

Efficiency for Training

paper conference date institution group
Your Efficient RL Framework Secretly Brings You Off-Policy RL Training blog Aug 25 MS Jianfeng Gao
On-Policy Distillation blog Oct 25 Thinking Machines Lab Thinking Machines Lab

LLM Kernel Generation

Presenter: Haolei

Benchmark

paper conference date institution group
KernelBench: Can LLMs Write Efficient GPU Kernels? ICML’25 14 Feb 2025 Stanford Azalia Mirhoseini
TritonBench: Benchmarking large language model capabilities for generating triton operators ACL Findings’25 20 Feb 2025 THUNLP Maosong Sun

Prompt Engineering

paper conference date institution group
CUDA-LLM: LLMs Can Write Efficient CUDA Kernels arXiv 10 Jun 2025 SJTU An Zou

Reinforcement Learning

paper conference date institution group
Kevin: Multi-Turn RL for Generating CUDA Kernels ES-FoMo-III Workshop @ICML’25 16 Jul 2025 Stanford, Cognition AI Cognition AI
AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs arXiv 8 Jul 2025 THUNLP Maosong Sun

Agent

paper conference date institution group
Astra: A Multi-Agent System for GPU Kernel Performance Optimization DL4C workshop @NeurIPS’25 9 Sep 2025 Stanford Stanford

Dataset

paper conference date institution group
ConCuR: Conciseness Makes State-of-the-Art Kernel Generation aXiv 8 Oct 2025 Westlake Huan Wang

Other Resources

Models

model parameter institution
KernelLLM 8B Meta
cudaLLM 8B ByteDance

Related Blogs

How Many Agents Does it Take to Beat PyTorch?

Nov. 06, 2025

Large Language Models Quantization

Presenter: Mingluo

Datatype

paper date
Microscaling Data Formats for Deep Learning Oct. 23
NVFP4 Jan. 25

Scaling

paper conference date institution group
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration MLSys'24 June 2023 MiT Song Han
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. ICML'24 Nov. 2022 MiT Song Han
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models ICLR'24 Aug. 2023 Shanghai AI Laboratory Luo Ping

Rotation

paper conference date institution group
AffineQuant: Affine Transformation Quantization for Large Language Models ICLR'24 March 2024 XMU Rongrong Ji
QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks ICML'24 March. 2024 Cornell Albert Tseng
QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs NeurIPS'24 Apr. 2024 ETH Saleh Ashkboos
DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs NeurIPS'24 June 2024 ZJU Ying Wei
SpinQuant: LLM quantization with learned rotations ICLR'25 June 2024 Facebook Tijmen Blankevoort
FlatQuant: Flatness Matters for LLM Quantization ICML'25 Oct. 2024 Tsinghua Jun Yao

About

A paper list shared in the group meeting

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •