SparAlloc: A Simple and Modular Framework for Decoupled Sparsity Allocation in Layerwise Pruning for LLM

📌 Note (2025.06.05):
SparAlloc is an actively maintained and evolving project. This current release is a demo version, and some functionalities are still under testing.
Upcoming updates will include detailed usage commands, more supported allocation strategies, and extended compatibility. Stay tuned!

SparAlloc is a modular and extensible framework for sparsity allocation in layerwise pruning methods for large language models (LLMs).
It decouples sparsity allocation from pruning algorithms and enables systematic exploration of how global sparsity distribution across layers impacts pruning performance.

SparAlloc serves both as a toolbox and a benchmark for sparsity allocation in LLM pruning pipelines.

❓ 1. Motivation

In the field of LLM pruning, layerwise pruning methods have demonstrated strong performance in one-shot (training-free) settings.
However, most research attention has focused on the pruning algorithms themselves—including the design of saliency criteria, weight update strategies, or retraining—while the role of global sparsity allocation across layers remains underexplored.

Many popular methods simply assign uniform sparsity to all transformer blocks, which ignores inter-layer importance differences and may lead to suboptimal results.

⚠️ Key Problems in Existing Work:

Uniform Allocation in Popular Algorithms
Methods such as Wanda and SparseGPT apply uniform sparsity across layers, missing opportunities for improved performance through intelligent allocation.
Confounding Factors in Coupled Designs
Approaches like SlimGPT show that better allocation significantly boosts results.
However, their allocation strategies treat the sparsity allocation as the part of default configuration, making it unclear whether gains are due to:
- the saliency criterion
- the update/retraining scheme, or
- the allocation policy
Limited Open Resources for Allocation
Some allocation-centric algorithms exist but are often not open-sourced or rely on computationally expensive method (like expensive search), limiting reproducibility and practicality.

🧰 2. What is SparAlloc?

As a Toolbox

Provides a modular set of sparsity allocation strategies (uniform, magnitude-based, saliency-aware, etc.)
Supports easy integration into pruning pipelines with just a few lines of code

As a Benchmark

Decouples sparsity allocation from pruning logic for clean comparisons
Enables fair and reproducible evaluation of sparsity allocation strategies
Encourages research into allocation strategies as a standalone problem

🧭 3. SparAlloc Pipeline Overview

The figure above illustrates the SparAlloc pipeline, which provides a unified and modular interface for sparsity allocation in LLM pruning workflows.

🔁 Overall Workflow

SparAlloc first determines the sparsity ratio for each local structure (e.g., Transformer block) using a selected sparsity allocation algorithm.
The generated sparsity configuration is then passed into a downstream pruning algorithm, such as Wanda or SparseGPT.
Finally, the performance of the pruned model can be lightly evaluated (e.g., via perplexity on a WikiText-2 subset).

🧠 4. Sources of Sparsity Allocation in SparAlloc

SparAlloc integrates a rich collection of sparsity allocation algorithms, which are broadly categorized into three sources:

4.1. Custom-designed Methods (Efficient & Practical)

These are lightweight and effective strategies we propose, suitable for fast layerwise evaluation and allocation. Examples include:

Using the L1 norm of weights in MHA or FFN layers
Measuring cosine similarity between hidden outputs of adjacent Transformer blocks
Estimating perplexity difference after removing each block independently

4.2. Decoupled from Existing Pruning Algorithms

Some allocation methods are extracted from pruning algorithms such as SlimGPT. These originally mixed sparsity allocation and pruning algorithm in a single framework. SparAlloc separates the allocation logic, allowing independent benchmarking.

4.3. Dedicated Allocation Algorithms

SparAlloc includes integration with advanced standalone sparsity allocation strategies, such as:

OWL (Outlier Weighted Layerwise), an open-source method designed specifically for optimized block-level sparsity distribution

🧠 5. Layerwise Pruning for LLMs

Method	Structured?	Paper	Code	SparAlloc Supported
Wanda	Unstructured	Paper	GitHub	✅
SparseGPT	Unstructured	Paper	GitHub	❌
SlimGPT	Structured	Paper	OpenReview	❌
SoBP	Structured	Paper	❌	❌

SparAlloc provides a clean interface to apply custom sparsity allocation to these pruning pipelines, enabling clearer analysis and reproducibility.

📚 6. Sparsity Allocation Methods For LLM Pruning (Decoupled or Specific)

Paper	Code	SparAlloc Supported
SlimGPT	code	✅
OWL	code	✅
AlphaPruning	code	❌
ALS	❌	❌
DSA	❌	❌

Here "SparAlloc Supported" means the SparAlloc collects the code from the specific sparsity allocation algorithm or decouples the sparsity allocation from the pruning algorithm.

📋 7. SparAlloc: Supported Sparsity Allocation Strategies

The following table summarizes all sparsity allocation strategies supported by SparAlloc, organized by category, allocation uniformity, algorithm granularity, and runtime cost:

🧠 Table 1: Strategy Overview

Category	Uniformity	Strategy	Strategy-Level	Time	Explanation
Default	Uniform	–	–	~0	Uniform sparsity allocation across all Transformer blocks.
Custom-designed	Non-uniform	Mag	L1-norm	~1s	Use the L1 norm of weights to compute layer importance.
			L2-norm	~1s	Use the L2 norm of weights to compute layer importance.
		Blockwise Perplexity	–	~1h	Remove each transformer block and measure perplexity change.
		Cosine-Similarity	–	~1min	Cosine similarity between adjacent hidden representations.
Extracted from Pruning Algorithm	Non-uniform	Simple-Function	Linear-Increase	~0	Sparsity increases linearly from bottom to top.
			Linear-Decrease	~0	Sparsity decreases linearly from bottom to top.
			Log-Increase	~0	Sparsity increases logarithmically across layers.
			Log-Decrease	~0	Sparsity decreases logarithmically across layers.
		FrontBackByPass	–	~0	Skip pruning for first `n` and last `m` blocks.
Open-source	Non-uniform	OWL	–	~5min	Based on outlier weight distribution across transformer layers.

🌲 8. SparAlloc: Method Tree Overview

You can use the following API to list all supported sparsity allocation strategies:

from lib.SparAlloc import show_sparalloc_method_tree
show_sparalloc_method_tree()

This will print a hierarchical overview of available strategies, with brief descriptions and estimated runtime (tested on a single A6000 GPU):

SparAlloc Strategies
├── Default
│   └── Uniform - Uniform sparsity allocation across all Transformer blocks. (~0s)
├── Custom-designed
│   ├── Mag
│   │   ├── L1-norm - Use the L1 norm of weights to compute layer importance. (~1s)
│   │   └── L2-norm - Use the L2 norm of weights to compute layer importance. (~1s)
│   ├── Blockwise Perplexity - Remove each transformer block and measure perplexity change. (~1h)
│   └── Cosine-Similarity - Measure cosine similarity between adjacent hidden representations. (~1min)
├── Extracted from pruning algorithm
│   ├── Simple-Function
│   │   ├── Linear-Increase - Sparsity increases linearly from bottom to top. (~0s)
│   │   ├── Linear-Decrease - Sparsity decreases linearly from bottom to top. (~0s)
│   │   ├── Log-Increase - Sparsity increases logarithmically across layers. (~0s)
│   │   └── Log-Decrease - Sparsity decreases logarithmically across layers. (~0s)
│   └── FrontBackByPass - Skip pruning for the first n and last m transformer blocks. (~0s)
└── Open-source
    └── OWL - Based on outlier weight distribution across transformer layers. (~5min)

📊 9. SparAlloc: Perplexity Comparison on LLaMA-7B (70% Sparsity, Wanda)

The following table reports the perplexity of each allocation strategy using LLaMA-7B, under 70% global sparsity (MHA + FFN), pruned via Wanda.

📈 Table 2: Perplexity Results

Category	Uniformity	Strategy	Strategy-Level	Perplexity
Default	Uniform	–	–	83.16
Custom-designed	Non-uniform	Mag	L1-norm	1160.62
			L2-norm	275.22
		Blockwise Perplexity	–	117.78
		Cosine-Similarity	–	56.11
Extracted from Pruning Algorithm	Non-uniform	Simple-Function	Linear-Increase	27.67
			Linear-Decrease	8916.27
			Log-Increase	23.25
			Log-Decrease	2455.70
		FrontBackByPass	n=1, m=1	307.50
Open-source	Non-uniform	OWL	M=5	24.48

🧪 10. Demo: Easily Integrate SparAlloc into Your Pruning Pipeline

The following example demonstrates how to integrate SparAlloc into a typical LLM pruning workflow using Wanda as the pruning algorithm and magnitude-based sparsity allocation.

You can:

🔄 Replace Step 2 to benchmark different sparsity allocation strategies (e.g., cosine similarity, OWL, blockwise perplexity)
🔄 Replace Step 3 to evaluate how different pruning algorithms respond to the same sparsity allocation

from lib.prune_all import prune_wanda
from lib.SparAlloc import mag_sparsity
 
# Step 1: Load model and set up arguments
# (You may use HuggingFace or your own checkpoint loader here)
# model, tokenizer = load_model(args)

# Step 2: Allocate sparsity using Mag-based strategy
sparsity_ratio = mag_sparsity(
    args, model, tokenizer, device, prune_n=prune_n, prune_m=prune_m
)

# Step 3: Prune the model using Wanda with the allocated sparsity
prune_wanda(
    args, model, tokenizer, device,
    prune_n=prune_n, prune_m=prune_m,
    all_layer_ratio=sparsity_ratio
)

# Step 4: Evaluate model performance
# evaluate(model, tokenizer, ...)

🧪 11. Case Study: Text Generation after Pruning with Different Sparsity Allocation Strategies

We evaluate the qualitative impact of different sparsity allocation strategies on text generation using LLaMA-7B, pruned with Wanda under 70% global sparsity.

Prompt used for generation:

"Once upon a time in a distant galaxy,"

All checkpoints below are generated with the same pruning algorithm (Wanda), only differing in the sparsity allocation method.

🧠 Strategy: Func (Log-Decrease)

📉 Perplexity: 2455.7
📝 Output: Once upon a time in a distant galaxy, OK Ö²NNOO DO Nor2 Generation N St Sex NorTN Ok0 DO OK EntertainmentDO Generation OK Sound Hash SuccessNN Se20 Nor0 ( New Sim Sports Success

🧠 Strategy: Mag (L2-norm)

📉 Perplexity: 275.22
📝 Output: Once upon a time in a distant galaxy, revers reverseiak reverse… tillobaum bit miak [ pro…quant fre deal hol ~[ hol… — — – revers ever… fancy repise rep pro ( bit … revers ( br

🧠 Strategy: Uniform

📉 Perplexity: 83.16
📝 Output: Once upon a time in a distant galaxy, OK Ö²NNOO DO Nor2 Generation N St Sex NorTN Ok0 DO OK EntertainmentDO Generation OK Sound Hash SuccessNN Se20 Nor0 ( New Sim Sports Success

🧠 Strategy: OWL

📉 Perplexity: 24.48
📝 Output: Once upon a time in a distant galaxy,ixedaisioP*************aiixedpertom Counio"*Comesent embre A few years ago, we can be happy to be one that is not

🧠 Strategy: Func (Log-Increase)

📉 Perplexity: 23.25
📝 Output: Once upon a time in a distant galaxy, our universe was not as it is now. The Universe has been completely changed by the people who were once known as the “Cosmos”. They are now called a Cosmos that can be

🙏 Acknowledgement

This repository is built upon the excellent OWL project.
We thank the authors for open-sourcing their work, which served as a valuable foundation.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
demo		demo
figure		figure
interdata		interdata
lib		lib
LICENSE		LICENSE
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SparAlloc: A Simple and Modular Framework for Decoupled Sparsity Allocation in Layerwise Pruning for LLM

❓ 1. Motivation

⚠️ Key Problems in Existing Work:

🧰 2. What is SparAlloc?

🧭 3. SparAlloc Pipeline Overview

🔁 Overall Workflow

🧠 4. Sources of Sparsity Allocation in SparAlloc

4.1. Custom-designed Methods (Efficient & Practical)

4.2. Decoupled from Existing Pruning Algorithms

4.3. Dedicated Allocation Algorithms

🧠 5. Layerwise Pruning for LLMs

📚 6. Sparsity Allocation Methods For LLM Pruning (Decoupled or Specific)

📋 7. SparAlloc: Supported Sparsity Allocation Strategies

🧠 Table 1: Strategy Overview

🌲 8. SparAlloc: Method Tree Overview

📊 9. SparAlloc: Perplexity Comparison on LLaMA-7B (70% Sparsity, Wanda)

📈 Table 2: Perplexity Results

🧪 10. Demo: Easily Integrate SparAlloc into Your Pruning Pipeline

🧪 11. Case Study: Text Generation after Pruning with Different Sparsity Allocation Strategies

🧠 Strategy: Func (Log-Decrease)

🧠 Strategy: Mag (L2-norm)

🧠 Strategy: Uniform

🧠 Strategy: OWL

🧠 Strategy: Func (Log-Increase)

🙏 Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

Alrightlone/SparAlloc

Folders and files

Latest commit

History

Repository files navigation

SparAlloc: A Simple and Modular Framework for Decoupled Sparsity Allocation in Layerwise Pruning for LLM

❓ 1. Motivation

⚠️ Key Problems in Existing Work:

🧰 2. What is SparAlloc?

🧭 3. SparAlloc Pipeline Overview

🔁 Overall Workflow

🧠 4. Sources of Sparsity Allocation in SparAlloc

4.1. Custom-designed Methods (Efficient & Practical)

4.2. Decoupled from Existing Pruning Algorithms

4.3. Dedicated Allocation Algorithms

🧠 5. Layerwise Pruning for LLMs

📚 6. Sparsity Allocation Methods For LLM Pruning (Decoupled or Specific)

📋 7. SparAlloc: Supported Sparsity Allocation Strategies

🧠 Table 1: Strategy Overview

🌲 8. SparAlloc: Method Tree Overview

📊 9. SparAlloc: Perplexity Comparison on LLaMA-7B (70% Sparsity, Wanda)

📈 Table 2: Perplexity Results

🧪 10. Demo: Easily Integrate SparAlloc into Your Pruning Pipeline

🧪 11. Case Study: Text Generation after Pruning with Different Sparsity Allocation Strategies

🧠 Strategy: Func (Log-Decrease)

🧠 Strategy: Mag (L2-norm)

🧠 Strategy: Uniform

🧠 Strategy: OWL

🧠 Strategy: Func (Log-Increase)

🙏 Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages