From 06552115a69818e1a8748d0ea57cbf60af9277b0 Mon Sep 17 00:00:00 2001 From: Wayne Date: Thu, 27 Nov 2025 00:10:24 +0800 Subject: [PATCH] updata cuda data --- README.md | 137 ++++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 124 insertions(+), 13 deletions(-) diff --git a/README.md b/README.md index 7f1a9e4..7b5e90f 100644 --- a/README.md +++ b/README.md @@ -9,6 +9,8 @@ - **openmp**: Multi-threaded parallelization using OpenMP - **simd**: SIMD vectorization using AVX2 intrinsics - **full**: Complete optimization with OpenMP + SIMD (default) +- **cuda**: GPU acceleration with CUDA kernels +- **profiling**: Full version with detailed performance profiling enabled All variants provide the same API and functional correctness, differing only in performance characteristics. @@ -27,11 +29,15 @@ Similarity search is a fundamental problem in many domains, including informatio **Implemented Optimizations:** - Multi-threading with OpenMP (centroid search, list probing, batch queries) - SIMD vectorization with AVX2 (L2 distance calculations) +- GPU acceleration with CUDA (IVF-Flat search kernels) - Conditional compilation for easy performance comparison -**Future Optimization Directions:** -- Cache-aware data layouts -- GPU acceleration (CUDA) +**CUDA Implementation Highlights:** +- Full GPU-accelerated IVF-Flat search pipeline +- Hybrid kernel strategy: fast shared-memory kernel for small k, heap-based kernel for large k +- Automatic kernel selection based on shared memory requirements +- Supports k up to 100 with all nprobe values (1-64) +- Achieves 82K+ QPS on SIFT1M dataset (k=10, nprobe=1) ## Target Users @@ -141,16 +147,96 @@ ivf_index.write_index("my_index.bin") loaded_index = zenann.IVFFlatIndex.read_index("my_index.bin") ``` +### Using CUDA Acceleration + +The CUDA variant provides the same Python API with GPU acceleration: + +```python +import zenann +import numpy as np + +# Build with CUDA variant first: make cuda +data = np.random.rand(1000000, 128).astype('float32') +queries = np.random.rand(10000, 128).astype('float32') + +# Same API, GPU-accelerated backend +ivf_index = zenann.IVFFlatIndex(dim=128, nlist=1024, nprobe=16) +ivf_index.build(data) + +# GPU-accelerated search +results = ivf_index.search_batch(queries, k=10) + +# CUDA achieves 82K+ QPS on SIFT1M (k=10, nprobe=1) +``` + +**Note**: The CUDA variant automatically uses GPU for IVF-Flat search operations. No API changes required. + +## Benchmarking + +ZenANN provides comprehensive benchmarking tools to evaluate performance across different optimization variants. + +### Quick Start + +```bash +# Set library path +export LD_LIBRARY_PATH=extern/faiss/build/install/lib:$LD_LIBRARY_PATH + +# Run comprehensive benchmark on SIFT1M +python3 benchmark/comprehensive_bench.py \ + --base data/sift/sift_base.fvecs \ + --query data/sift/sift_query.fvecs \ + --groundtruth data/sift/sift_groundtruth.ivecs \ + --nlist 1024 \ + --nprobe-list "1,2,4,8,16,32,64" \ + --k-list "1,10,100" \ + --index-file sift_index.bin \ + --output-dir benchmark_results + +# Generate Recall-QPS trade-off plots +python3 benchmark/plot_tradeoff.py benchmark_results/*.json +``` + +### Benchmark Metrics + +The benchmark suite measures: +- **QPS (Queries Per Second)**: Throughput for batch queries +- **Latency**: Mean, p50, p95, p99 response times +- **Recall@k**: Accuracy for k=1, 10, 100 +- **Index Build Time**: Time to construct the index +- **Memory Usage**: Bytes per vector + +### Comparing Variants + +```bash +# Test OpenMP variant +make openmp +python3 benchmark/comprehensive_bench.py ... --output-dir results_openmp + +# Test CUDA variant +make cuda +python3 benchmark/comprehensive_bench.py ... --output-dir results_cuda + +# Compare results +python3 benchmark/plot_tradeoff.py results_*/*.json +``` + +See [benchmark/BENCHMARK_GUIDE.md](benchmark/BENCHMARK_GUIDE.md) for detailed instructions. + ## Build and Test ### Requirements +**Base Requirements:** - C++17 compiler (g++, clang++) - Python >= 3.10 - CMake >= 3.17 (for Faiss) - Ninja build system - OpenBLAS +**Additional Requirements for CUDA variant:** +- CUDA Toolkit >= 10.0 +- NVIDIA GPU with compute capability >= 6.0 (Pascal or newer) + ### Build Instructions ```bash @@ -176,6 +262,8 @@ make full # Same as above make naive # Build naive version (no optimizations) make openmp # Build OpenMP-only version make simd # Build SIMD-only version +make cuda # Build CUDA version (GPU acceleration) +make profiling # Build profiling version (Full + timing) # 4. Run tests LD_LIBRARY_PATH=extern/faiss/build/install/lib pytest tests/ @@ -191,6 +279,14 @@ Choose the appropriate variant for your needs: | `make openmp` | Multi-threading only | Study OpenMP impact | | `make simd` | SIMD (AVX2) only | Study vectorization impact | | `make full` | OpenMP + SIMD | Production use (default) | +| `make cuda` | GPU kernels | GPU acceleration, highest QPS | +| `make profiling` | Full + timing | Performance analysis | + +**CUDA Build Notes:** +- Ensure `nvcc` is in your PATH and CUDA Toolkit is properly installed +- Adjust `CUDA_ARCH` in Makefile to match your GPU (sm_60=Pascal, sm_75=Turing, sm_86=Ampere) +- The CUDA variant uses pure GPU acceleration (no OpenMP/SIMD) +- Hybrid kernel strategy automatically handles k values up to 100 ### Running Tests @@ -213,20 +309,23 @@ All variants provide **correct results** with different performance profiles: |---------|-------------|--------------| | **naive** | Baseline (1x) | Single-threaded, scalar operations | | **openmp** | ~10x faster | Multi-threaded parallelization | -| **simd** | ~3 faster | AVX2 vectorized distance calculations | -| **full** | ~15-20x faster | Combined OpenMP + SIMD optimizations | +| **simd** | ~3x faster | AVX2 vectorized distance calculations | +| **full** | ~15-20x faster | Combined OpenMP + SIMD optimizations , highest QPS in k = 100 | +| **cuda** | ~20-25x faster | GPU parallelization, highest QPS in k=1,10| **Performance factors:** - Actual speedup depends on dataset size, dimensionality, and hardware - OpenMP scales with CPU core count (tested on 8-core systems) - SIMD provides consistent 3x speedup for L2 distance calculations - Combining optimizations often yields multiplicative benefits +- CUDA achieves 82K+ QPS on SIFT1M (k=10, nprobe=1) with NVIDIA GPUs **Optimization breakdown:** - **Distance calculations**: SIMD provides ~3x speedup (processes 8 floats per instruction with AVX2) -- **Centroid search**: OpenMP parallelizes across centroids -- **List probing**: OpenMP parallelizes across probe lists with dynamic scheduling -- **Batch queries**: OpenMP parallelizes across multiple queries +- **Centroid search**: OpenMP parallelizes across centroids; CUDA uses GPU threads +- **List probing**: OpenMP parallelizes across probe lists; CUDA uses 2D grid mapping +- **Batch queries**: OpenMP parallelizes across multiple queries; CUDA processes batch on GPU +- **Top-K selection**: CUDA uses hybrid strategy (shared memory vs heap-based) for optimal performance ## Project Structure @@ -238,24 +337,36 @@ ZenANN/ │ ├── HNSWIndex.h │ ├── KDTreeIndex.h │ ├── VectorStore.h -│ └── SimdUtils.h # L2 distance with optional SIMD (conditional compilation) +│ ├── SimdUtils.h # L2 distance with optional SIMD (conditional compilation) +│ └── CudaUtils.h # CUDA kernel declarations ├── src/ # C++ implementation (with conditional OpenMP pragmas) +│ ├── IndexBase.cpp +│ ├── IVFFlatIndex.cpp +│ ├── KDTreeIndex.cpp +│ ├── HNSWIndex.cpp +│ └── CudaUtils.cu # CUDA kernel implementations ├── python/ # Python bindings (pybind11) ├── tests/ # Unit tests (pytest) ├── benchmark/ # Performance benchmarks +│ ├── comprehensive_bench.py # Complete benchmark suite +│ ├── ivf-bench.py # IVF-specific benchmarks +│ ├── hnsw-bench.py # HNSW-specific benchmarks +│ ├── plot_tradeoff.py # Visualization tools +│ └── BENCHMARK_GUIDE.md # Benchmarking documentation +├── doc/ # Technical documentation +│ ├── cuda.md # CUDA implementation guide +│ └── cuda-fix.md # CUDA k=100 fix documentation ├── extern/faiss/ # Faiss submodule └── Makefile # Build configuration with multiple targets ``` -## Documentation - -- **uml.md** - Architecture diagrams (Mermaid) +### Core Documentation - **tests/** - Usage examples in test files - **Makefile** - Run `make help` for build variant information ## Engineering Infrastructure -- **Build**: GNU Make, CMake +- **Build**: GNU Make - **Testing**: pytest - **CI/CD**: GitHub Actions (tests full variant) - **Version Control**: Git