Skip to content

SparAlloc: A Simple and Modular Framework for Decoupled Sparsity Allocation in Layerwise Pruning for LLM

License

Notifications You must be signed in to change notification settings

Alrightlone/SparAlloc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SparAlloc: A Simple and Modular Framework for Decoupled Sparsity Allocation in Layerwise Pruning for LLM

📌 Note (2025.06.05):
SparAlloc is an actively maintained and evolving project. This current release is a demo version, and some functionalities are still under testing.
Upcoming updates will include detailed usage commands, more supported allocation strategies, and extended compatibility. Stay tuned!

SparAlloc Diagram

SparAlloc is a modular and extensible framework for sparsity allocation in layerwise pruning methods for large language models (LLMs).
It decouples sparsity allocation from pruning algorithms and enables systematic exploration of how global sparsity distribution across layers impacts pruning performance.

SparAlloc serves both as a toolbox and a benchmark for sparsity allocation in LLM pruning pipelines.


❓ 1. Motivation

In the field of LLM pruning, layerwise pruning methods have demonstrated strong performance in one-shot (training-free) settings.
However, most research attention has focused on the pruning algorithms themselves—including the design of saliency criteria, weight update strategies, or retraining—while the role of global sparsity allocation across layers remains underexplored.

Many popular methods simply assign uniform sparsity to all transformer blocks, which ignores inter-layer importance differences and may lead to suboptimal results.

⚠️ Key Problems in Existing Work:

  1. Uniform Allocation in Popular Algorithms
    Methods such as Wanda and SparseGPT apply uniform sparsity across layers, missing opportunities for improved performance through intelligent allocation.

  2. Confounding Factors in Coupled Designs
    Approaches like SlimGPT show that better allocation significantly boosts results.
    However, their allocation strategies treat the sparsity allocation as the part of default configuration, making it unclear whether gains are due to:

    • the saliency criterion
    • the update/retraining scheme, or
    • the allocation policy
  3. Limited Open Resources for Allocation
    Some allocation-centric algorithms exist but are often not open-sourced or rely on computationally expensive method (like expensive search), limiting reproducibility and practicality.


🧰 2. What is SparAlloc?

As a Toolbox

  • Provides a modular set of sparsity allocation strategies (uniform, magnitude-based, saliency-aware, etc.)
  • Supports easy integration into pruning pipelines with just a few lines of code

As a Benchmark

  • Decouples sparsity allocation from pruning logic for clean comparisons
  • Enables fair and reproducible evaluation of sparsity allocation strategies
  • Encourages research into allocation strategies as a standalone problem

🧭 3. SparAlloc Pipeline Overview

The figure above illustrates the SparAlloc pipeline, which provides a unified and modular interface for sparsity allocation in LLM pruning workflows.

🔁 Overall Workflow

  • SparAlloc first determines the sparsity ratio for each local structure (e.g., Transformer block) using a selected sparsity allocation algorithm.
  • The generated sparsity configuration is then passed into a downstream pruning algorithm, such as Wanda or SparseGPT.
  • Finally, the performance of the pruned model can be lightly evaluated (e.g., via perplexity on a WikiText-2 subset).

🧠 4. Sources of Sparsity Allocation in SparAlloc

SparAlloc integrates a rich collection of sparsity allocation algorithms, which are broadly categorized into three sources:

4.1. Custom-designed Methods (Efficient & Practical)

These are lightweight and effective strategies we propose, suitable for fast layerwise evaluation and allocation. Examples include:

  • Using the L1 norm of weights in MHA or FFN layers
  • Measuring cosine similarity between hidden outputs of adjacent Transformer blocks
  • Estimating perplexity difference after removing each block independently

4.2. Decoupled from Existing Pruning Algorithms

Some allocation methods are extracted from pruning algorithms such as SlimGPT. These originally mixed sparsity allocation and pruning algorithm in a single framework. SparAlloc separates the allocation logic, allowing independent benchmarking.

4.3. Dedicated Allocation Algorithms

SparAlloc includes integration with advanced standalone sparsity allocation strategies, such as:

  • OWL (Outlier Weighted Layerwise), an open-source method designed specifically for optimized block-level sparsity distribution

🧠 5. Layerwise Pruning for LLMs

Method Structured? Paper Code SparAlloc Supported
Wanda Unstructured Paper GitHub
SparseGPT Unstructured Paper GitHub
SlimGPT Structured Paper OpenReview
SoBP Structured Paper

SparAlloc provides a clean interface to apply custom sparsity allocation to these pruning pipelines, enabling clearer analysis and reproducibility.


📚 6. Sparsity Allocation Methods For LLM Pruning (Decoupled or Specific)

Paper Code SparAlloc Supported
SlimGPT code
OWL code
AlphaPruning code
ALS
DSA

Here "SparAlloc Supported" means the SparAlloc collects the code from the specific sparsity allocation algorithm or decouples the sparsity allocation from the pruning algorithm.


📋 7. SparAlloc: Supported Sparsity Allocation Strategies

The following table summarizes all sparsity allocation strategies supported by SparAlloc, organized by category, allocation uniformity, algorithm granularity, and runtime cost:

🧠 Table 1: Strategy Overview

Category Uniformity Strategy Strategy-Level Time Explanation
Default Uniform ~0 Uniform sparsity allocation across all Transformer blocks.
Custom-designed Non-uniform Mag L1-norm ~1s Use the L1 norm of weights to compute layer importance.
L2-norm ~1s Use the L2 norm of weights to compute layer importance.
Blockwise Perplexity ~1h Remove each transformer block and measure perplexity change.
Cosine-Similarity ~1min Cosine similarity between adjacent hidden representations.
Extracted from Pruning Algorithm Non-uniform Simple-Function Linear-Increase ~0 Sparsity increases linearly from bottom to top.
Linear-Decrease ~0 Sparsity decreases linearly from bottom to top.
Log-Increase ~0 Sparsity increases logarithmically across layers.
Log-Decrease ~0 Sparsity decreases logarithmically across layers.
FrontBackByPass ~0 Skip pruning for first n and last m blocks.
Open-source Non-uniform OWL ~5min Based on outlier weight distribution across transformer layers.

🌲 8. SparAlloc: Method Tree Overview

You can use the following API to list all supported sparsity allocation strategies:

from lib.SparAlloc import show_sparalloc_method_tree
show_sparalloc_method_tree()

This will print a hierarchical overview of available strategies, with brief descriptions and estimated runtime (tested on a single A6000 GPU):

SparAlloc Strategies
├── Default
│   └── Uniform - Uniform sparsity allocation across all Transformer blocks. (~0s)
├── Custom-designed
│   ├── Mag
│   │   ├── L1-norm - Use the L1 norm of weights to compute layer importance. (~1s)
│   │   └── L2-norm - Use the L2 norm of weights to compute layer importance. (~1s)
│   ├── Blockwise Perplexity - Remove each transformer block and measure perplexity change. (~1h)
│   └── Cosine-Similarity - Measure cosine similarity between adjacent hidden representations. (~1min)
├── Extracted from pruning algorithm
│   ├── Simple-Function
│   │   ├── Linear-Increase - Sparsity increases linearly from bottom to top. (~0s)
│   │   ├── Linear-Decrease - Sparsity decreases linearly from bottom to top. (~0s)
│   │   ├── Log-Increase - Sparsity increases logarithmically across layers. (~0s)
│   │   └── Log-Decrease - Sparsity decreases logarithmically across layers. (~0s)
│   └── FrontBackByPass - Skip pruning for the first n and last m transformer blocks. (~0s)
└── Open-source
    └── OWL - Based on outlier weight distribution across transformer layers. (~5min)

📊 9. SparAlloc: Perplexity Comparison on LLaMA-7B (70% Sparsity, Wanda)

The following table reports the perplexity of each allocation strategy using LLaMA-7B, under 70% global sparsity (MHA + FFN), pruned via Wanda.

📈 Table 2: Perplexity Results

Category Uniformity Strategy Strategy-Level Perplexity
Default Uniform 83.16
Custom-designed Non-uniform Mag L1-norm 1160.62
L2-norm 275.22
Blockwise Perplexity 117.78
Cosine-Similarity 56.11
Extracted from Pruning Algorithm Non-uniform Simple-Function Linear-Increase 27.67
Linear-Decrease 8916.27
Log-Increase 23.25
Log-Decrease 2455.70
FrontBackByPass n=1, m=1 307.50
Open-source Non-uniform OWL M=5 24.48

🧪 10. Demo: Easily Integrate SparAlloc into Your Pruning Pipeline

The following example demonstrates how to integrate SparAlloc into a typical LLM pruning workflow using Wanda as the pruning algorithm and magnitude-based sparsity allocation.

You can:

  • 🔄 Replace Step 2 to benchmark different sparsity allocation strategies (e.g., cosine similarity, OWL, blockwise perplexity)
  • 🔄 Replace Step 3 to evaluate how different pruning algorithms respond to the same sparsity allocation
from lib.prune_all import prune_wanda
from lib.SparAlloc import mag_sparsity
 
# Step 1: Load model and set up arguments
# (You may use HuggingFace or your own checkpoint loader here)
# model, tokenizer = load_model(args)

# Step 2: Allocate sparsity using Mag-based strategy
sparsity_ratio = mag_sparsity(
    args, model, tokenizer, device, prune_n=prune_n, prune_m=prune_m
)

# Step 3: Prune the model using Wanda with the allocated sparsity
prune_wanda(
    args, model, tokenizer, device,
    prune_n=prune_n, prune_m=prune_m,
    all_layer_ratio=sparsity_ratio
)

# Step 4: Evaluate model performance
# evaluate(model, tokenizer, ...)

🧪 11. Case Study: Text Generation after Pruning with Different Sparsity Allocation Strategies

We evaluate the qualitative impact of different sparsity allocation strategies on text generation using LLaMA-7B, pruned with Wanda under 70% global sparsity.

Prompt used for generation:

"Once upon a time in a distant galaxy,"

All checkpoints below are generated with the same pruning algorithm (Wanda), only differing in the sparsity allocation method.


🧠 Strategy: Func (Log-Decrease)

📉 Perplexity: 2455.7
📝 Output: Once upon a time in a distant galaxy, OK Ö²NNOO DO Nor2 Generation N St Sex NorTN Ok0 DO OK EntertainmentDO Generation OK Sound Hash SuccessNN Se20 Nor0 ( New Sim Sports Success


🧠 Strategy: Mag (L2-norm)

📉 Perplexity: 275.22
📝 Output: Once upon a time in a distant galaxy, revers reverseiak reverse… tillobaum bit miak [  pro…quant fre deal hol ~[ hol… — — – revers ever… fancy repise rep pro ( bit … revers ( br


🧠 Strategy: Uniform

📉 Perplexity: 83.16
📝 Output: Once upon a time in a distant galaxy, OK Ö²NNOO DO Nor2 Generation N St Sex NorTN Ok0 DO OK EntertainmentDO Generation OK Sound Hash SuccessNN Se20 Nor0 ( New Sim Sports Success


🧠 Strategy: OWL

📉 Perplexity: 24.48
📝 Output: Once upon a time in a distant galaxy,ixedaisioP*************aiixedpertom Counio"*Comesent embre A few years ago, we can be happy to be one that is not


🧠 Strategy: Func (Log-Increase)

📉 Perplexity: 23.25
📝 Output: Once upon a time in a distant galaxy, our universe was not as it is now. The Universe has been completely changed by the people who were once known as the “Cosmos”. They are now called a Cosmos that can be


🙏 Acknowledgement

This repository is built upon the excellent OWL project.
We thank the authors for open-sourcing their work, which served as a valuable foundation.

About

SparAlloc: A Simple and Modular Framework for Decoupled Sparsity Allocation in Layerwise Pruning for LLM

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published