5000user5000 · 5000user5000 · Nov 26, 2025 · Nov 26, 2025
diff --git a/README.md b/README.md
@@ -9,6 +9,8 @@
 - **openmp**: Multi-threaded parallelization using OpenMP
 - **simd**: SIMD vectorization using AVX2 intrinsics
 - **full**: Complete optimization with OpenMP + SIMD (default)
+- **cuda**: GPU acceleration with CUDA kernels
+- **profiling**: Full version with detailed performance profiling enabled
 
 All variants provide the same API and functional correctness, differing only in performance characteristics.
 
@@ -27,11 +29,15 @@ Similarity search is a fundamental problem in many domains, including informatio
 **Implemented Optimizations:**
 - Multi-threading with OpenMP (centroid search, list probing, batch queries)
 - SIMD vectorization with AVX2 (L2 distance calculations)
+- GPU acceleration with CUDA (IVF-Flat search kernels)
 - Conditional compilation for easy performance comparison
 
-**Future Optimization Directions:**
-- Cache-aware data layouts
-- GPU acceleration (CUDA)
+**CUDA Implementation Highlights:**
+- Full GPU-accelerated IVF-Flat search pipeline
+- Hybrid kernel strategy: fast shared-memory kernel for small k, heap-based kernel for large k
+- Automatic kernel selection based on shared memory requirements
+- Supports k up to 100 with all nprobe values (1-64)
+- Achieves 82K+ QPS on SIFT1M dataset (k=10, nprobe=1)
 
 ## Target Users
 
@@ -141,16 +147,96 @@ ivf_index.write_index("my_index.bin")
 loaded_index = zenann.IVFFlatIndex.read_index("my_index.bin")
 ```
 
+### Using CUDA Acceleration
+
+The CUDA variant provides the same Python API with GPU acceleration:
+
+```python
+import zenann
+import numpy as np
+
+# Build with CUDA variant first: make cuda
+data = np.random.rand(1000000, 128).astype('float32')
+queries = np.random.rand(10000, 128).astype('float32')
+
+# Same API, GPU-accelerated backend
+ivf_index = zenann.IVFFlatIndex(dim=128, nlist=1024, nprobe=16)
+ivf_index.build(data)
+
+# GPU-accelerated search
+results = ivf_index.search_batch(queries, k=10)
+
+# CUDA achieves 82K+ QPS on SIFT1M (k=10, nprobe=1)
+```
+
+**Note**: The CUDA variant automatically uses GPU for IVF-Flat search operations. No API changes required.
+
+## Benchmarking
+
+ZenANN provides comprehensive benchmarking tools to evaluate performance across different optimization variants.
+
+### Quick Start
+
+```bash
+# Set library path
+export LD_LIBRARY_PATH=extern/faiss/build/install/lib:$LD_LIBRARY_PATH
+
+# Run comprehensive benchmark on SIFT1M
+python3 benchmark/comprehensive_bench.py \
+    --base data/sift/sift_base.fvecs \
+    --query data/sift/sift_query.fvecs \
+    --groundtruth data/sift/sift_groundtruth.ivecs \
+    --nlist 1024 \
+    --nprobe-list "1,2,4,8,16,32,64" \
+    --k-list "1,10,100" \
+    --index-file sift_index.bin \
+    --output-dir benchmark_results
+
+# Generate Recall-QPS trade-off plots
+python3 benchmark/plot_tradeoff.py benchmark_results/*.json
+```
+
+### Benchmark Metrics
+
+The benchmark suite measures:
+- **QPS (Queries Per Second)**: Throughput for batch queries
+- **Latency**: Mean, p50, p95, p99 response times
+- **Recall@k**: Accuracy for k=1, 10, 100
+- **Index Build Time**: Time to construct the index
+- **Memory Usage**: Bytes per vector
+
+### Comparing Variants
+
+```bash
+# Test OpenMP variant
+make openmp
+python3 benchmark/comprehensive_bench.py ... --output-dir results_openmp
+
+# Test CUDA variant
+make cuda
+python3 benchmark/comprehensive_bench.py ... --output-dir results_cuda
+
+# Compare results
+python3 benchmark/plot_tradeoff.py results_*/*.json
+```
+
+See [benchmark/BENCHMARK_GUIDE.md](benchmark/BENCHMARK_GUIDE.md) for detailed instructions.
+
 ## Build and Test
 
 ### Requirements
 
+**Base Requirements:**
 - C++17 compiler (g++, clang++)
 - Python >= 3.10
 - CMake >= 3.17 (for Faiss)
 - Ninja build system
 - OpenBLAS
 
+**Additional Requirements for CUDA variant:**
+- CUDA Toolkit >= 10.0
+- NVIDIA GPU with compute capability >= 6.0 (Pascal or newer)
+
 ### Build Instructions
 
 ```bash
@@ -176,6 +262,8 @@ make full         # Same as above
 make naive        # Build naive version (no optimizations)
 make openmp       # Build OpenMP-only version
 make simd         # Build SIMD-only version
+make cuda         # Build CUDA version (GPU acceleration)
+make profiling    # Build profiling version (Full + timing)
 
 # 4. Run tests
 LD_LIBRARY_PATH=extern/faiss/build/install/lib pytest tests/
@@ -191,6 +279,14 @@ Choose the appropriate variant for your needs:
 | `make openmp` | Multi-threading only | Study OpenMP impact |
 | `make simd` | SIMD (AVX2) only | Study vectorization impact |
 | `make full` | OpenMP + SIMD | Production use (default) |
+| `make cuda` | GPU kernels | GPU acceleration, highest QPS |
+| `make profiling` | Full + timing | Performance analysis |
+
+**CUDA Build Notes:**
+- Ensure `nvcc` is in your PATH and CUDA Toolkit is properly installed
+- Adjust `CUDA_ARCH` in Makefile to match your GPU (sm_60=Pascal, sm_75=Turing, sm_86=Ampere)
+- The CUDA variant uses pure GPU acceleration (no OpenMP/SIMD)
+- Hybrid kernel strategy automatically handles k values up to 100
 
 ### Running Tests
 
@@ -213,20 +309,23 @@ All variants provide **correct results** with different performance profiles:
 |---------|-------------|--------------|
 | **naive** | Baseline (1x) | Single-threaded, scalar operations |
 | **openmp** | ~10x faster | Multi-threaded parallelization |
-| **simd** | ~3 faster | AVX2 vectorized distance calculations |
-| **full** | ~15-20x faster | Combined OpenMP + SIMD optimizations |
+| **simd** | ~3x faster | AVX2 vectorized distance calculations |
+| **full** | ~15-20x faster | Combined OpenMP + SIMD optimizations , highest QPS in k = 100 |
+| **cuda** | ~20-25x faster | GPU parallelization, highest QPS in k=1,10|
 
 **Performance factors:**
 - Actual speedup depends on dataset size, dimensionality, and hardware
 - OpenMP scales with CPU core count (tested on 8-core systems)
 - SIMD provides consistent 3x speedup for L2 distance calculations
 - Combining optimizations often yields multiplicative benefits
+- CUDA achieves 82K+ QPS on SIFT1M (k=10, nprobe=1) with NVIDIA GPUs
 
 **Optimization breakdown:**
 - **Distance calculations**: SIMD provides ~3x speedup (processes 8 floats per instruction with AVX2)
-- **Centroid search**: OpenMP parallelizes across centroids
-- **List probing**: OpenMP parallelizes across probe lists with dynamic scheduling
-- **Batch queries**: OpenMP parallelizes across multiple queries
+- **Centroid search**: OpenMP parallelizes across centroids; CUDA uses GPU threads
+- **List probing**: OpenMP parallelizes across probe lists; CUDA uses 2D grid mapping
+- **Batch queries**: OpenMP parallelizes across multiple queries; CUDA processes batch on GPU
+- **Top-K selection**: CUDA uses hybrid strategy (shared memory vs heap-based) for optimal performance
 
 ## Project Structure
 
@@ -238,24 +337,36 @@ ZenANN/
 │   ├── HNSWIndex.h
 │   ├── KDTreeIndex.h
 │   ├── VectorStore.h
-│   └── SimdUtils.h      # L2 distance with optional SIMD (conditional compilation)
+│   ├── SimdUtils.h      # L2 distance with optional SIMD (conditional compilation)
+│   └── CudaUtils.h      # CUDA kernel declarations
 ├── src/                  # C++ implementation (with conditional OpenMP pragmas)
+│   ├── IndexBase.cpp
+│   ├── IVFFlatIndex.cpp
+│   ├── KDTreeIndex.cpp
+│   ├── HNSWIndex.cpp
+│   └── CudaUtils.cu     # CUDA kernel implementations
 ├── python/               # Python bindings (pybind11)
 ├── tests/                # Unit tests (pytest)
 ├── benchmark/            # Performance benchmarks
+│   ├── comprehensive_bench.py  # Complete benchmark suite
+│   ├── ivf-bench.py            # IVF-specific benchmarks
+│   ├── hnsw-bench.py           # HNSW-specific benchmarks
+│   ├── plot_tradeoff.py        # Visualization tools
+│   └── BENCHMARK_GUIDE.md      # Benchmarking documentation
+├── doc/                  # Technical documentation
+│   ├── cuda.md          # CUDA implementation guide
+│   └── cuda-fix.md      # CUDA k=100 fix documentation
 ├── extern/faiss/         # Faiss submodule
 └── Makefile              # Build configuration with multiple targets
 ```
 
-## Documentation
-
-- **uml.md** - Architecture diagrams (Mermaid)
+### Core Documentation
 - **tests/** - Usage examples in test files
 - **Makefile** - Run `make help` for build variant information
 
 ## Engineering Infrastructure
 
-- **Build**: GNU Make, CMake
+- **Build**: GNU Make
 - **Testing**: pytest
 - **CI/CD**: GitHub Actions (tests full variant)
 - **Version Control**: Git