Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
137 changes: 124 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
- **openmp**: Multi-threaded parallelization using OpenMP
- **simd**: SIMD vectorization using AVX2 intrinsics
- **full**: Complete optimization with OpenMP + SIMD (default)
- **cuda**: GPU acceleration with CUDA kernels
- **profiling**: Full version with detailed performance profiling enabled

All variants provide the same API and functional correctness, differing only in performance characteristics.

Expand All @@ -27,11 +29,15 @@ Similarity search is a fundamental problem in many domains, including informatio
**Implemented Optimizations:**
- Multi-threading with OpenMP (centroid search, list probing, batch queries)
- SIMD vectorization with AVX2 (L2 distance calculations)
- GPU acceleration with CUDA (IVF-Flat search kernels)
- Conditional compilation for easy performance comparison

**Future Optimization Directions:**
- Cache-aware data layouts
- GPU acceleration (CUDA)
**CUDA Implementation Highlights:**
- Full GPU-accelerated IVF-Flat search pipeline
- Hybrid kernel strategy: fast shared-memory kernel for small k, heap-based kernel for large k
- Automatic kernel selection based on shared memory requirements
- Supports k up to 100 with all nprobe values (1-64)
- Achieves 82K+ QPS on SIFT1M dataset (k=10, nprobe=1)

## Target Users

Expand Down Expand Up @@ -141,16 +147,96 @@ ivf_index.write_index("my_index.bin")
loaded_index = zenann.IVFFlatIndex.read_index("my_index.bin")
```

### Using CUDA Acceleration

The CUDA variant provides the same Python API with GPU acceleration:

```python
import zenann
import numpy as np

# Build with CUDA variant first: make cuda
data = np.random.rand(1000000, 128).astype('float32')
queries = np.random.rand(10000, 128).astype('float32')

# Same API, GPU-accelerated backend
ivf_index = zenann.IVFFlatIndex(dim=128, nlist=1024, nprobe=16)
ivf_index.build(data)

# GPU-accelerated search
results = ivf_index.search_batch(queries, k=10)

# CUDA achieves 82K+ QPS on SIFT1M (k=10, nprobe=1)
```

**Note**: The CUDA variant automatically uses GPU for IVF-Flat search operations. No API changes required.

## Benchmarking

ZenANN provides comprehensive benchmarking tools to evaluate performance across different optimization variants.

### Quick Start

```bash
# Set library path
export LD_LIBRARY_PATH=extern/faiss/build/install/lib:$LD_LIBRARY_PATH

# Run comprehensive benchmark on SIFT1M
python3 benchmark/comprehensive_bench.py \
--base data/sift/sift_base.fvecs \
--query data/sift/sift_query.fvecs \
--groundtruth data/sift/sift_groundtruth.ivecs \
--nlist 1024 \
--nprobe-list "1,2,4,8,16,32,64" \
--k-list "1,10,100" \
--index-file sift_index.bin \
--output-dir benchmark_results

# Generate Recall-QPS trade-off plots
python3 benchmark/plot_tradeoff.py benchmark_results/*.json
```

### Benchmark Metrics

The benchmark suite measures:
- **QPS (Queries Per Second)**: Throughput for batch queries
- **Latency**: Mean, p50, p95, p99 response times
- **Recall@k**: Accuracy for k=1, 10, 100
- **Index Build Time**: Time to construct the index
- **Memory Usage**: Bytes per vector

### Comparing Variants

```bash
# Test OpenMP variant
make openmp
python3 benchmark/comprehensive_bench.py ... --output-dir results_openmp

# Test CUDA variant
make cuda
python3 benchmark/comprehensive_bench.py ... --output-dir results_cuda

# Compare results
python3 benchmark/plot_tradeoff.py results_*/*.json
```

See [benchmark/BENCHMARK_GUIDE.md](benchmark/BENCHMARK_GUIDE.md) for detailed instructions.

## Build and Test

### Requirements

**Base Requirements:**
- C++17 compiler (g++, clang++)
- Python >= 3.10
- CMake >= 3.17 (for Faiss)
- Ninja build system
- OpenBLAS

**Additional Requirements for CUDA variant:**
- CUDA Toolkit >= 10.0
- NVIDIA GPU with compute capability >= 6.0 (Pascal or newer)

### Build Instructions

```bash
Expand All @@ -176,6 +262,8 @@ make full # Same as above
make naive # Build naive version (no optimizations)
make openmp # Build OpenMP-only version
make simd # Build SIMD-only version
make cuda # Build CUDA version (GPU acceleration)
make profiling # Build profiling version (Full + timing)

# 4. Run tests
LD_LIBRARY_PATH=extern/faiss/build/install/lib pytest tests/
Expand All @@ -191,6 +279,14 @@ Choose the appropriate variant for your needs:
| `make openmp` | Multi-threading only | Study OpenMP impact |
| `make simd` | SIMD (AVX2) only | Study vectorization impact |
| `make full` | OpenMP + SIMD | Production use (default) |
| `make cuda` | GPU kernels | GPU acceleration, highest QPS |
| `make profiling` | Full + timing | Performance analysis |

**CUDA Build Notes:**
- Ensure `nvcc` is in your PATH and CUDA Toolkit is properly installed
- Adjust `CUDA_ARCH` in Makefile to match your GPU (sm_60=Pascal, sm_75=Turing, sm_86=Ampere)
- The CUDA variant uses pure GPU acceleration (no OpenMP/SIMD)
- Hybrid kernel strategy automatically handles k values up to 100

### Running Tests

Expand All @@ -213,20 +309,23 @@ All variants provide **correct results** with different performance profiles:
|---------|-------------|--------------|
| **naive** | Baseline (1x) | Single-threaded, scalar operations |
| **openmp** | ~10x faster | Multi-threaded parallelization |
| **simd** | ~3 faster | AVX2 vectorized distance calculations |
| **full** | ~15-20x faster | Combined OpenMP + SIMD optimizations |
| **simd** | ~3x faster | AVX2 vectorized distance calculations |
| **full** | ~15-20x faster | Combined OpenMP + SIMD optimizations , highest QPS in k = 100 |
| **cuda** | ~20-25x faster | GPU parallelization, highest QPS in k=1,10|

**Performance factors:**
- Actual speedup depends on dataset size, dimensionality, and hardware
- OpenMP scales with CPU core count (tested on 8-core systems)
- SIMD provides consistent 3x speedup for L2 distance calculations
- Combining optimizations often yields multiplicative benefits
- CUDA achieves 82K+ QPS on SIFT1M (k=10, nprobe=1) with NVIDIA GPUs

**Optimization breakdown:**
- **Distance calculations**: SIMD provides ~3x speedup (processes 8 floats per instruction with AVX2)
- **Centroid search**: OpenMP parallelizes across centroids
- **List probing**: OpenMP parallelizes across probe lists with dynamic scheduling
- **Batch queries**: OpenMP parallelizes across multiple queries
- **Centroid search**: OpenMP parallelizes across centroids; CUDA uses GPU threads
- **List probing**: OpenMP parallelizes across probe lists; CUDA uses 2D grid mapping
- **Batch queries**: OpenMP parallelizes across multiple queries; CUDA processes batch on GPU
- **Top-K selection**: CUDA uses hybrid strategy (shared memory vs heap-based) for optimal performance

## Project Structure

Expand All @@ -238,24 +337,36 @@ ZenANN/
│ ├── HNSWIndex.h
│ ├── KDTreeIndex.h
│ ├── VectorStore.h
│ └── SimdUtils.h # L2 distance with optional SIMD (conditional compilation)
│ ├── SimdUtils.h # L2 distance with optional SIMD (conditional compilation)
│ └── CudaUtils.h # CUDA kernel declarations
├── src/ # C++ implementation (with conditional OpenMP pragmas)
│ ├── IndexBase.cpp
│ ├── IVFFlatIndex.cpp
│ ├── KDTreeIndex.cpp
│ ├── HNSWIndex.cpp
│ └── CudaUtils.cu # CUDA kernel implementations
├── python/ # Python bindings (pybind11)
├── tests/ # Unit tests (pytest)
├── benchmark/ # Performance benchmarks
│ ├── comprehensive_bench.py # Complete benchmark suite
│ ├── ivf-bench.py # IVF-specific benchmarks
│ ├── hnsw-bench.py # HNSW-specific benchmarks
│ ├── plot_tradeoff.py # Visualization tools
│ └── BENCHMARK_GUIDE.md # Benchmarking documentation
├── doc/ # Technical documentation
│ ├── cuda.md # CUDA implementation guide
│ └── cuda-fix.md # CUDA k=100 fix documentation
├── extern/faiss/ # Faiss submodule
└── Makefile # Build configuration with multiple targets
```

## Documentation

- **uml.md** - Architecture diagrams (Mermaid)
### Core Documentation
- **tests/** - Usage examples in test files
- **Makefile** - Run `make help` for build variant information

## Engineering Infrastructure

- **Build**: GNU Make, CMake
- **Build**: GNU Make
- **Testing**: pytest
- **CI/CD**: GitHub Actions (tests full variant)
- **Version Control**: Git
Expand Down