Welcome to the CUDA Kernels repository! This project is a comprehensive collection of CUDA implementations ranging from fundamental concepts to advanced mathematical operations. It is designed for both beginners starting their CUDA journey and professionals looking for reference implementations.
The repository is organized into modules of increasing complexity:
- 01_kernel_basics: Introduction to writing your first CUDA kernel.
- 02_grid_block: Understanding the Grid-Block-Thread hierarchy.
- 03_hardware: Querying GPU device properties and capabilities.
- 01_vector_ops: Standard vector operations (Add, Sub, Mul, Dot) demonstrating global memory usage.
- 02_vector_dot: Dot product implementation using atomic operations.
- 03_constant_memory: Optimization using Constant Memory for read-only data.
- 04_unified_memory: Simplifies memory management using
cudaMallocManagedandcudaMemPrefetchAsync.
- 01_matrix_vector_ops: High-performance Matrix-Vector multiplication (Standard, Banded, Symmetric, Triangular) and Rank-1/Rank-2 updates. Includes CPU verification.
- 02_fft: Fast Fourier Transform implementations (Radix-2 and Stockham algorithms).
- 01_tiled_matmul: The "Holy Grail" of CUDA optimizations. Tiled Matrix-Matrix multiplication using Shared Memory.
- 02_reduction: Highly optimized parallel reduction (Sum) using Warp Shuffle instructions.
- 01_streams: Demonstrates maximizing GPU throughput by overlapping Compute with Memory Transfers using CUDA Streams.
- 01_histogram: Optimized frequency counting using Privatized Shared Memory Atomics to reduce Global Memory contention.
-
01_cublas: Industry-standard Matrix Multiplication using NVIDIA's hand-tuned
cuBLASlibrary. - 02_thrust: High-level C++ template library ("STL for CUDA") for Sorting and Reducing without writing kernels.
- 03_curand: Parallel Random Number Generation (Monte Carlo Pi Estimation).
- 04_cusparse: Sparse Matrix-Vector multiplication using Compressed Sparse Row (CSR) format.
-
05_cusolver: Dense Cholesky Decomposition (
$A = L L^T$ ). - 06_nvtx: Profiling range markers for Nsight Systems.
- 07_dynamic_parallelism: Child kernel launches from the GPU (CDP).
- NVIDIA GPU: Compute Capability 5.0 or higher recommended.
- CUDA Toolkit: Version 10.0 or higher.
- Compiler:
nvcc(bundled with CUDA Toolkit). - Build Tool:
make(ornmakeon Windows, though headers are set up for typicalmake).
Each module contains a Makefile. To build a specific module, navigate to its directory and run make.
Example: Running the Matrix-Vector Operations Demo
cd modules/03_advanced_math/01_matrix_vector_ops
make
./mv_app.exeFor a detailed learning path, check out A_BEGINNERS_GUIDE.md.
For a guide on the broader ecosystem (cuBLAS, Thrust, TensorRT, etc.), read CUDA_ECOSYSTEM_GUIDE.md.
All modules are self-contained and include verification mechanisms (comparing GPU results against CPU reference implementations) to ensure correctness.
To verify the entire repository at once, run the included PowerShell script:
./scripts/verify_all.ps1Contributions are welcome! Please ensure code is formatted and includes verification logic.
MIT License
