MAtrix Adaptive Design for Highly Imbalanced SpMV Accelerator (with GeMV Support) on HBM-based FPGAs
MAD-HiSpMV is a high-performance FPGA accelerator for Sparse Matrix–Vector Multiplication (SpMV) with optional dense overlay for GeMV support. It builds on our previous HiSpMV work with several key enhancements:
- Scalable HBM support: Multiple HBM channels are used to load input vectors and matrices efficiently.
- Hybrid Row Distribution Network: Routes PE outputs to dedicated y_Ax handlers for accumulation, balancing workload.
- Adder Chain Groups (ACG): Optional pre-addition of multiplication results to avoid RAW dependency on output accumulation and reducing pipeline stalls.
- Dense Overlay Support: Allows a single kernel to handle both SpMV and GeMV for mixed sparse-dense workloads.
Data Flow Summary:
- Sparse matrix
Aand input vectorxare streamed from HBM channels to PEGs. - PEs multiply nonzero elements of
Awith the corresponding entries ofx. - Results are routed through the hybrid row distribution network to the correct y_Ax handlers.
- Optional adder chains pre-accumulate results before final accumulation.
- Final output
yis streamed back to HBM.
⚠️ Note: The PASTA+AutoBridge repo is private until publication. Please request access if needed.
-
Create and activate a Conda environment
Install PASTA following its instructions. -
Clone this repository and set up environment
load_vitis23 source miniconda3/bin/activate your_conda_env cd HiSpMV source setup cd - export CONDA_LOC=$(PWD)/miniconda3
load_vitis23: loads Vitis HLS & XRT path variables.setup: sets required environment variables for MAD-HiSpMV.
-
Install Python dependencies
pip install -r requirements.txt
-
Download benchmarking matrices
python get_tb_matrices.py
apps # Python apps: run SpMV/GeMV + sample DNN model
automation_tool # Scripts to auto-generate accelerator configs (matrix-adaptive)
builds # Source code + xclbin for U280/U50 configs, usage reports, floorplans
common # Common host + kernel source code
cpu # CPU benchmarking (Intel MKL SpMV/GeMV + power measurement)
gpu # GPU benchmarking (cuSPARSE SpMV + power measurement)
matrices # Storage for benchmarking matrices (downloaded by script)
pyhispmv # pybind11 wrapper to invoke FPGA kernels via XRT
get_tb_matrices.py # Script to fetch test/benchmarking matrices
requirements.txt # Python dependencies
setup # Environment setup script
README.md # Project documentation
-
Build the
pyhispmvpackagecd pyhispmv python setup.py build_ext --inplace cd ..
-
Run SpMV/GeMV tests
-
General test (no arguments):
cd apps python general_test.py -
DNN model test (configurable):
cd apps python model_test.py \ --batch_size 1 \ --input_size 4096 \ --hidden_size_1 8192 \ --hidden_size_2 8192 \ --output_size 1024 \ --density1 0.1 \ --density2 0.25 -
Note on device selection:
Both scripts require settingdevice_id(the FPGA index).
To find available devices, run:xbutil examine
Update
device_idin the scripts to match the U280 board.
-
cd cpu
make clean all
./run_spmv.sh # Run SpMV benchmarks
./run_gemv.sh # Run GeMV benchmarkscd gpu
make clean all
./run_all.sh # Run all SpMV benchmarksThe automation tool allows generating accelerator configurations either automatically (matrix-adaptive) or manually (explicit parameters).
automation_tool/src/main.py analyzes the input matrix and automatically chooses optimal parameters such as HBM channel usage and optimizations.
Command:
cd automation_tool/src
python main.py <build_dir> --device {U50|U280|V80} [--matrices <file_or_dir>] [--dense-overlay]Arguments:
build_dir(positional): Path to the build directory.--device: Target device (U50,U280, orV80) [required].--matrices: Path to a matrix file or a directory containing matrices.--dense-overlay: Enable dense overlay mode (SpMV kernel with GeMV support).
- In normal mode (without
--dense-overlay), the tool uses the input matrix to tailor the accelerator design. - In dense overlay mode, the design is not tailored to the input sparse matrix, and the
--matricesargument is ignored. The generated kernel supports both SpMV and GeMV for mixed workloads.
Examples:
- Generate SpMV design for U280 with matrix directory:
python main.py ../../builds --device U280 --matrices ../matrices/
- Generate SpMV+GeMV hybrid design for U50 (no matrices needed):
python main.py ../../builds --device U50 --dense-overlay
automation_tool/src/rsc/spmvcodegen.py provides fine-grained control over accelerator parameters instead of relying on automation.
Command:
cd automation_tool/src/
python spmvcodegen.py <output_dir> --device {U50|U280} [options]Arguments:
output_dir: Path to the output directory.--device: Target FPGA device (U50orU280) [required].--num-ch-A: Number of HBM channels for sparse matrix A (default: 16).--num-ch-x: Number of HBM channels for input vector x (default: 1).--num-ch-y: Number of HBM channels for output vector y (default: 1).--ch-width: Width of HBM channels in bits (default: 512).--urams-per-pe: URAM banks per PE for output accumulation (default: 2).--dense-overlay: Enable dense overlay for GeMV support.--pre-accumulator: Enable pre-accumulator optimization.--row-dist-net: Enable row distribution network.--high-freq: Build hardware for 400 MHz kernel clock.
Example (small dense-overlay design):
python ../../automation_tool/src/spmvcodegen.py ../ --device U280 \
--num-ch-A 4 --num-ch-x 1 --num-ch-y 1 --urams-per-pe 1 --row-dist-net --dense-overlayExample log output:
20250822:204011 [INFO] Resource: FPGAResource(bram=128, uram=32, dsp=613, lut=134724, reg=135873)
20250822:204011 [INFO] Successfully Generated Code at ../Dense-HI-SpMV-4-1-1
-
Navigate to the generated design directory
The script automatically names the directory with configuration info:cd ../Dense-HI-SpMV-4-1-1 -
Build host code
make host
-
Run C simulation (HLS source code)
- Sparse matrix input (SpMV):
./spmv-host ../../matrices/poli_large/poli_large.mtx
- Dense matrix input (dense overlay / GeMV):
where
./spmv-host 512 512
512 512specifies rows and columns of the dense matrix.
- Sparse matrix input (SpMV):
-
Run hardware-software co-simulation
First, synthesize the RTL code:make tapa
Then run co-simulation using the Vivado TAPA fast cosim:
./spmv-host 512 512 --bitstream="spmv.xilinx_u280_gen3x16_xdma_1_202211_1.hw.xo"
Note: More details about the TAPA fast co-simulation for RTL simulation can be found here https://tapa.readthedocs.io/en/main/user/cosim.html
-
Build final hardware bitstream
make hw
-
Run on actual FPGA hardware
./spmv-host ../../matrices/analytics/analytics.mtx \ --bitstream="vitis_run_hw/SpMV_xilinx_u280_gen3x16_xdma_1_202211_1.xclbin"
This workflow covers dense-overlay design generation, C simulation, co-simulation, and execution on real FPGA hardware.
If you use MAD-HiSpMV in your work, please cite our upcoming publication (to be added here after acceptance).
