This project provides an optimized implementation of Stochastic Block Partitioning (SBP) for community detection in graphs. It integrates 4 complementary acceleration heuristics into a unified, high-performance framework:
- Sampling via the SamBaS framework [IEEE HPEC Paper @ IEEE Xplore, Preprint, Journal Paper Preprint]
- Shared-memory parallelization via Hybrid SBP [ICPP Paper @ ACM, Preprint]
- Distributed-memory parallelization via EDiSt [IEEE CLUSTER Paper Preprint]
- Top-Down Computation via Top-Down SBP [ACM HPDC Paper]
Our SBP implementation is based on the Graph Challenge reference implementation and optimized for modern HPC systems with support for multi-threaded (OpenMP) and distributed (MPI) execution.
- C++17 compatible compiler (e.g., AMD Clang, GCC, Intel)
- CMake 3.10 or higher
- MPI implementation (OpenMPI, Cray MPICH, etc.)
- OpenMP support
git clone --recursive https://github.com/vtsynergy/IntegratedSBP.git
cd IntegratedSBP
mkdir build
cd build
cmake ..
makeFor testing and debugging purposes:
cmake -DSCOREP=ON ..- Enable Score-P instrumentationcmake -DVALGRIND=ON ..- Enable Valgrind instrumentationcmake -DDEBUG=ON ..- Enable AddressSanitizer (note: will cause test failures due to memory leak detection)
Run the test suite with:
make testNote: Some build options (particularly -DDEBUG=ON) will cause test failures.
For a single-node run with OpenMP parallelization:
./SBP --filepath /path/to/graph --threads <num_threads>For best results with the integrated sampling, shared-memory, top-down computation and distributed-memory approach:
mpiexec -n <num_cluster_nodes> ./TopDownSBP \
--filepath <path> \
-a hybrid_mcmc \
--threads <num_threads> \
--batches 2 \
--distribute none-edge-balanced \
--nonparametric \
-m 0.1 \
--samplingalg expansion_snowball \
--samplesize 0.5 \
--cachesize 20000
--split connectivity-snowball \
--splitinit randomSince most of these options have been made the default in newer builds, and you should get similar results using:
mpiexec -n <num_cluster_nodes> ./TopDownSBP --filepath <path>The code expects graph files in one of these formats:
- Matrix Market (
.mtx): Standard sparse matrix format - Tab-Separated Values (
.tsv): Edge list format with tab delimiter
File naming convention: If --filepath=/path/to/graph, the code will look for:
- Graph:
/path/to/graph.mtxor/path/to/graph.tsv - Ground truth (optional):
/path/to/graph_truePartition.tsv
-a, --algorithm <type>- MCMC algorithm to use:hybrid_mcmc(default, recommended): Hybrid approach balancing parallelism and accuracymetropolis_hastings: Classic MH algorithm (no parallel implementation)async_gibbs: Fully asynchronous, faster but may reduce quality
--threads <N>- Number of OpenMP threads (default: 1)--batches <N>- Number of batches for parallel algorithms (default: 2, recommended ≥ 2)--cachesize <N>- Size of logarithm cache (default: 20000)-m, --mh_percent <X>- Fraction of high-influence vertices for sequential processing (default: 0.1). High-influence vertices are identified by edge degree product (or vertex degree if--vertex-degree-sortis used)
--samplesize <X>- Fraction of vertices to sample (0.0 to 1.0, default: 1.0 = no sampling)--samplingalg <type>- Sampling algorithm:random: Uniform random samplingmax_degree: Select highest degree verticesexpansion_snowball: Graph expansion from high-degree seeds
--parametric- Use parametric entropy instead of nonparametric (not recommended; nonparametric is default and more accurate)--hastings-correction- Compute Hastings correction (only useful in parametric mode; disabled by default)--approximate- Use faster, less accurate block merge (not recommended with nonparametric mode)--degreecorrected- Degree-corrected entropy (WARNING: not fully implemented, will produce poor results with nonparametric mode)
--distribute <scheme>- Distribution scheme for MPI:none(default): No distributionnone-edge-balanced: Balanced edge distribution (recommended for MPI)2hop-round-robin,2hop-size-balanced,2hop-snowball: Alternative schemes (less tested)
--split <type>- Split strategy for top-down SBP:connectivity-snowball(default): Connectivity-aware graph partitioningrandom: Random vertex assignmentsnowball,single-snowball: Snowball expansion variants
--splitinit <type>- Split initialization:random(default): Random seed selectiondegree-weighted: Weighted by vertex degreehigh-degree: Select highest degree vertex
--no_transpose- Disable transpose matrix storage (saves memory, slower column access)
--vertex-degree-sort- Use vertex degree instead of edge degree product to identify high-influence vertices (may reduce accuracy)--ordered- Process vertices in order (disabled by default: vertices processed in a random order)--detach- Detach degree-1 vertices before community detection
-f, --filepath <path>- Path to graph file (without extension)-j, --json <dir>- Output directory for JSON results (default:output)--output_file <name>- Output filename (default: auto-generated from PID and timestamp)--evaluate- Evaluate and report accuracy metrics immediately (requires ground truth)--tag <string>- Custom tag for identifying runs
--undirected- Treat graph as undirected--noduplicates- Check for and remove duplicate edges--delimiter <char>- Delimiter for TSV files (default: tab)
Results are saved as JSON files in the specified output directory. Each file contains:
- Detected community assignments
- Runtime statistics
- Algorithm parameters
- Entropy/DL values
- Evaluation metrics (if ground truth provided and
--evaluateused)
- Nonparametric entropy (default) improves accuracy. Do not use
--parametricunless you have a specific reason. - Set
--batchesto at least 2 (default) for parallel/distributed runs to improve accuracy. - Edge degree product sorting (default) with
-m 0.1(default) improves accuracy in parallel/distributed runs.--vertex-degree-sortand/or lower values ofmcan lead to poor accuracy. - For distributed runs, use
--distribute none-edge-balanced. - For sampling,
--samplingalg expansion_snowballwith--samplesize 0.5provides good speed/accuracy tradeoff. For very sparse datasets, use--samplingalg max_degreeinstead. - Hastings correction is disabled by default. Avoid using
--hastings-correctionand--approximatetogether when in nonparametric mode. - For top-down SBP, the recommended
--split connectivity-snowball(default) and--splitinit random(default) provide best accuracy.
IntegratedSBP/
├── include/ # Header files
│ ├── args.hpp # Command-line argument parsing
│ ├── sbp.hpp # Main SBP algorithm
│ ├── blockmodel/ # Blockmodel data structures
│ ├── distributed/ # EDiSt distributed components
│ └── ...
├── src/ # Implementation files
│ ├── main.cpp # Entry point for SBP
│ ├── TopDownSBPMain.cpp # Entry point for top-down SBP
│ ├── sbp.cpp # Core SBP logic
│ ├── finetune.cpp # MCMC algorithms
│ ├── block_merge.cpp # Block merge phase
│ ├── entropy.cpp # Entropy calculations
│ ├── distributed/ # Distributed memory components
│ └── ...
├── test/ # Test suite
├── extern/ # External dependencies
├── scripts/ # Utility scripts
└── CMakeLists.txt # Build configuration
If you use this code, please cite our relevant papers:
Integrated SBP (IEEE HPEC 2023):
@INPROCEEDINGS{Wanye2023IntegratedSBP,
author={Wanye, Frank and Gleyzer, Vitaliy and Kao, Edward and Feng, Wu-chun},
booktitle={2023 IEEE High Performance Extreme Computing Conference (HPEC)},
title={An Integrated Approach for Accelerating Stochastic Block Partitioning},
year={2023},
volume={},
number={},
pages={1-7},
month={9},
address={Waltham, MA},
doi={10.1109/HPEC58863.2023.10363599}
}Top-Down SBP (ACM HPDC 2024):
@INPROCEEDINGS{Wanye2024TopDownSBP,
author={Wanye, Frank and Gleyzer, Vitaliy and Kao, Edward and Feng, Wu-chun},
booktitle={Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing (HPDC '24)},
title={Top-Down Stochastic Block Partitioning for Scalable Community Detection},
year={2024},
publisher={Association for Computing Machinery},
address={New York, NY, USA},
doi={10.1145/3731545.3731589}
}SamBaS (Sampling):
@INPROCEEDINGS{Gleyzer2022SamBaS,
author={Gleyzer, Vitaliy and Kao, Edward and Feng, Wu-chun},
booktitle={2022 IEEE High Performance Extreme Computing Conference (HPEC)},
title={SamBaS: Sampling-Based Stochastic Block Partitioning},
year={2022}
}Hybrid SBP (Shared-memory parallelization):
@INPROCEEDINGS{Kao2022HybridSBP,
author={Kao, Edward and Gleyzer, Vitaliy and Feng, Wu-chun},
booktitle={2022 International Conference on Parallel Processing (ICPP)},
title={Hybrid Stochastic Block Partitioning for Large-Scale Graphs},
year={2022}
}EDiSt (Distributed-memory parallelization):
@INPROCEEDINGS{Gleyzer2022EDiSt,
author={Gleyzer, Vitaliy and Kao, Edward and Feng, Wu-chun},
booktitle={2022 IEEE International Conference on Cluster Computing (CLUSTER)},
title={EDiSt: Edge-Balanced Distribution for Stochastic Block Partitioning},
year={2022}
}- The
--degreecorrectedoption is not fully implemented and will produce incorrect results in nonparametric mode (default) - The
--approximateoption may reduce quality in nonparametric mode (default) - Some distribution schemes other than
none-edge-balancedare not fully tested
© Virginia Polytechnic Institute and State University, 2023.
Licensed under the GNU Lesser General Public License v2.1. See the LICENSE file for details.
This work is based on the Graph Challenge Stochastic Block Partitioning reference implementation. We thank the Graph Challenge organizers and the original SBP authors for their foundational work.