GitHub - Wangyaoyuu/GeneralSparse: GeneralSparse: Bridging the Gap in SpMM for Pruned Large Language Model Inference on GPUs

🔍 Project Overview

GeneralSparse is an open-source project aimed at optimizing Sparse Matrix Multiplication (SpMM) on GPUs, particularly for diverse sparsity patterns in scientific computing and inference on GPUs.

Its core goals include:

Bridging the performance gap of SpMM in pruned LLM inference on GPUs.
Supporting various sparsity patterns, including both structured (e.g., N:M) and unstructured pruning.
Leveraging GPU parallelism through custom CUDA implementations to accelerate sparse computations.

Environment

CUDA 12.1 and NVIDIA A100/V100.

🗂️ Project Structure

The main components of the repository:

cuda_code/: Contains CUDA head file for high-performance sparse matrix operations on the GPU.
operator/: Defines sparse operation interfaces and logic, structured for maintainability and extension.
kernel_token/ & reduction_token/: Handle kernel identification and reduction operations to streamline computation.
transform_step/: Data preparation modules that convert inputs into formats suitable for sparse multiplication.
configor/: A configuration system that allows tuning for different models and hardware setups.
baseline/: Includes baseline implementations for performance comparisons.
code_generator.* & code_builder.*: Auto-generate and compile code tailored to specific sparsity patterns.
data_transform_*.*: Utilities for transforming data into sparse-compatible formats.
matrix_example/: Prepares the input matrix of pruned weight matrix and suitesparse matrix collection.
data_source/: Contains the output generated SpMM program when running our method.

⚙️ Technical Highlights

Custom CUDA Kernels: Designed specifically for sparse matrix multiplications to maximize GPU efficiency.
Modular Design: Each module is decoupled for flexibility and scalability.
Auto Code Generation: The code_generator and code_builder automate the generation of optimized kernels for various sparsity formats.
Cost Model: Currently, the cost model is still under development and will be further improved in future work.

📈 Use Cases

Inference of pruned LLMs: Speeds up inference while maintaining model accuracy.
High-performance computing: Suitable for scientific computing or engineering scenarios involving large-scale sparse matrix data.

Getting Started Instructions

❗️ Use Steps for Input Sparse Matrix

To fully utilize GeneralSparse, follow the step:

Prepare your sparse matrix: Choose the appropriate sparse file .mtx format as in matrix_example/.
Configurable Parameters: configure the parameter by global_config.json file. In this file, modify the ROOT_PATH_STR and spmv_header_file by the directory location, and adjust HALF whether to use half precision.
Compile the project: Run make token_test -j16 to generate executable file ./token_test.
Generate the tailed program for sparse matrix: Run ./token_test matrix_example/suite_collection/IG5-18.mtx 8 and 8 is the column number of dense matrix and can be adjusted.
View the generated code program: The generated programs can be viewed in data_source/ directory and can be executed by a.out in sub-directory.
Other baselines: The other methods can be viewed in baseline/ directory. Here, we provide the code implementation of cuSPARSE, and other methods are provided by their Github repo-link.

❕ Use Steps for Large Language Models

Integration with models: Our method uses the FasterTransformer framework to accelerate the model end-to-end.
Pruning the model: The pruned weight matrix is instantiated to the matrix_example/pruned_weight location.
Replace origin Library calls: The end-to-end usage is similar to Flash-LLM.
Here, we do not directly provide model-level code and binary implementations.

🧠 Other README.md position

baseline/: Each baseline has the guidence to generate executable file and command.
matrix_example/: Guidence to generate the sparse matrix input.

📚 Summary

GeneralSparse is a powerful and flexible framework for efficient sparse matrix multiplication on GPUs, ideal for pruned LLM inference. Its modular architecture and automation tools make it easy to integrate, extend, and adapt to various scenarios.

A highly recommended tool for developers aiming to deploy performant, sparse-aware models in real-world systems.

📞Communication and Citation

You are welcome to communicate and contact us via sdygwyy@163.com.

@inproceedings{10.5555/3768039.3768064,
author = {Wang, Yaoyu and Guo, Xiao and Xiao, Junmin and Chen, De and Tan, Guangming},
title = {GeneralSparse: bridging the gap in SpMM for pruned large language model inference on GPUs},
year = {2025},
isbn = {978-1-939133-48-9},
publisher = {USENIX Association},
address = {USA},
abstract = {The rapid growth of generative model parameters poses challenges in deployment, especially regarding weight storage and inference latency. The weight pruning is an effective technique to reduce the computational and memory overhead of Large Language Models (LLMs) while maintaining accuracy, which transforms the matmuls to Sparse Matrix Multiplication (SpMM) computation. However, the diverse pruning methods introduce varying sparsity patterns that challenge high-performance SpMM on GPUs. Existing solutions are limited with adaptability to these patterns, flexibility in handling different sparsity levels, and support for efficient optimizations.In this work, we present GeneralSparse, a novel solution that bridges this gap by leveraging the abstraction of memory access and reduction spaces. GeneralSparse designs the process of dividing box to adapt dynamically to diverse pruning patterns and proposes hierarchical reduction algorithms tailored to GPU hierarchies. Through evaluations on pruned LLM weight matrices and the SuiteSparse collection, GeneralSparse achieves up to 20.82\texttimes{} speedup over cuSPARSE libraries. At end-to-end inference time on LLMs, GeneralSparse achieves up to 2.33\texttimes{} speedup over counterparts.},
booktitle = {Proceedings of the 2025 USENIX Conference on Usenix Annual Technical Conference},
articleno = {25},
numpages = {16},
location = {Boston, MA, USA},
series = {USENIX ATC '25}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.vscode		.vscode
baseline		baseline
configor		configor
cuda_code		cuda_code
data_source		data_source
kernel_token		kernel_token
matrix_example		matrix_example
operator		operator
reduction_token		reduction_token
transform_step		transform_step
.gitmodules		.gitmodules
2025_GeneralSparse.pdf		2025_GeneralSparse.pdf
IO_of_reduction.hpp		IO_of_reduction.hpp
LICENSE		LICENSE
README.md		README.md
arr_optimization.cc		arr_optimization.cc
arr_optimization.hpp		arr_optimization.hpp
code_builder.cc		code_builder.cc
code_builder.hpp		code_builder.hpp
code_generator.cc		code_generator.cc
code_generator.hpp		code_generator.hpp
code_source_data.cc		code_source_data.cc
code_source_data.hpp		code_source_data.hpp
config.cc		config.cc
config.hpp		config.hpp
data_transform_common.cc		data_transform_common.cc
data_transform_common.hpp		data_transform_common.hpp
data_transform_graph.cc		data_transform_graph.cc
data_transform_graph.hpp		data_transform_graph.hpp
data_transform_step.cc		data_transform_step.cc
data_transform_step.hpp		data_transform_step.hpp
dataset_builder.cc		dataset_builder.cc
dataset_builder.hpp		dataset_builder.hpp
empty_op.cc		empty_op.cc
empty_op.hpp		empty_op.hpp
executor.cc		executor.cc
executor.hpp		executor.hpp
global_config.json		global_config.json
kernel_generator.cc		kernel_generator.cc
kernel_generator.h		kernel_generator.h
makefile		makefile
metadata_set.cc		metadata_set.cc
metadata_set.hpp		metadata_set.hpp
obtain_result.py		obtain_result.py
op_manager.cc		op_manager.cc
op_manager.hpp		op_manager.hpp
operator.cc		operator.cc
operator.hpp		operator.hpp
operator_executer.cc		operator_executer.cc
operator_executer.hpp		operator_executer.hpp
reduction_token.hpp		reduction_token.hpp
spmm_header_top.code		spmm_header_top.code
struct.cc		struct.cc
struct.hpp		struct.hpp
term_print.hpp		term_print.hpp
token_test.cc		token_test.cc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🔍 Project Overview

Environment

🗂️ Project Structure

⚙️ Technical Highlights

📈 Use Cases

Getting Started Instructions

❗️ Use Steps for Input Sparse Matrix

❕ Use Steps for Large Language Models

🧠 Other README.md position

📚 Summary

📞Communication and Citation

About

Uh oh!

Releases

Packages

Languages

License

Wangyaoyuu/GeneralSparse

Folders and files

Latest commit

History

Repository files navigation

🔍 Project Overview

Environment

🗂️ Project Structure

⚙️ Technical Highlights

📈 Use Cases

Getting Started Instructions

❗️ Use Steps for Input Sparse Matrix

❕ Use Steps for Large Language Models

🧠 Other README.md position

📚 Summary

📞Communication and Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages