CUDA Tutorials

A comprehensive collection of CUDA programming tutorials covering various parallel programming concepts, memory models, optimization techniques, and performance patterns for NVIDIA GPUs.

Overview

This repository contains practical examples and tutorials demonstrating CUDA programming fundamentals and advanced techniques. Each tutorial is self-contained and focuses on specific aspects of GPU programming, making it easy to learn and understand CUDA concepts incrementally.

Tutorials

Getting Started

hello_world - Basic CUDA kernel execution with grid and block configuration
adding_integers - Simple integer addition using CUDA
device_properties - Query and display GPU device properties

Thread and Block Management

1_D_thread_indexing - Understanding 1D thread indexing in CUDA
2D_thread_indexing - Working with 2D thread blocks and grids
threads_access - Thread access patterns and organization
block_grid_access - Accessing data using block and grid dimensions
warp_indexing - Understanding warp-level thread organization

Array Operations

vector_addition_block_style - Vector addition using block-level parallelism
array_addition_thread_style - Array addition with thread-level parallelism
array_addition_with_threads_and_blocks - Combining threads and blocks for array operations
array_summation_profiling - Performance profiling for array summation

Memory Management

cuda_memory_model - Overview of CUDA memory hierarchy
shared_memory - Using shared memory for performance optimization
shared_memory_access - Optimizing shared memory access patterns
shared_mem_row_major_access - Row-major access patterns in shared memory
shared_memory_padding - Using padding to avoid bank conflicts
unified_memory - Unified memory for simplified memory management
pinned_memory - Using page-locked host memory for faster transfers
zero_copy_memory - Direct GPU access to host memory
global_memory_write_operations - Optimizing global memory writes
memory_access_patterns - Understanding efficient memory access patterns

Matrix Operations

matrix_multiplication - Basic matrix multiplication on GPU
matrix_transpose - Matrix transposition implementations
matrix_transpose_shared_memory - Using shared memory for matrix transpose
matrix_transpose_padded_shared_memory - Padded shared memory for optimized transpose
matrix_transpose_with_padded_shared_memory - Enhanced matrix transpose with padding

Optimization Techniques

unrolling - Loop unrolling for improved performance
complete_unrolling - Complete loop unrolling techniques
warp_unrolling - Warp-level loop unrolling
unrolling_mat_transpose_shared_memory - Unrolled matrix transpose with shared memory
register_usage - Optimizing register usage in kernels
template_parameters - Using C++ templates for flexible kernel code

Warp-Level Programming

warp_divergence - Understanding and managing warp divergence
warp_shuffling - Using warp shuffle operations for communication

Parallel Reduction

parallel_reduction_with_shared_memory - Reduction using shared memory
parallel_reduction_warp_shuffling - Reduction with warp shuffle instructions
divergence_in_parellel_reduction - Managing divergence in reduction operations
parallel_reduction_using_dynamic_parallelism - Reduction with dynamic parallelism

Atomic Operations

atomic_operations - Basic atomic operations in CUDA
atomic_parrallel_dot_product - Parallel vector dot product using atomic operations and shared memory
custom_atomic_operation - Implementing custom atomic operations

Advanced Features

dynamic_parallelism - Launching kernels from within kernels
synchronization - Thread synchronization mechanisms
error_handling - Proper CUDA error handling techniques

Streams and Asynchronous Operations

cuda_events - Using CUDA events for timing and synchronization
async_functions - Asynchronous function execution
non_nuil_stream_with_async_functions - Non-null streams with async operations
stream_synchronization - Stream synchronization techniques
streams_interdependcies - Managing dependencies between streams
memory_transfer_overlap - Overlapping memory transfers with computation

Scan Operations

simple_parallel_inclusive_scan - Basic parallel inclusive scan implementation
efficient_parallel_scan - Optimized parallel scan algorithm

Data Structures and Patterns

AOS_VS_SOA - Array of Structures vs Structure of Arrays comparison
stencil_algorithm - Stencil computation patterns
stencil_computations - Advanced stencil operations

Utility and Mathematical Operations

distance_3D - Computing 3D distances on GPU
standard_and_intrinsic_functions - Using CUDA standard and intrinsic math functions

Image Processing

sharpen_rgb_image - RGB image sharpening using CUDA and CImg library

Physical Simulations

laplace_equation_for_temperature_equalibrim - Solving Laplace equation for temperature equilibrium with OpenGL visualization

Building and Running

Each tutorial directory contains:

main.cu - CUDA source code
CMakeLists.txt - CMake build configuration

To build and run a tutorial:

cd <tutorial_name>
mkdir build
cd build
cmake ..
make
./<executable_name>

Requirements

NVIDIA GPU with CUDA support
CUDA Toolkit installed
CMake build system
C++ compiler with C++11 support or higher

C++ and Modern C++ Features

These tutorials demonstrate practical use of C++ and modern C++ features alongside CUDA programming:

Core C++ Features

Templates - Generic programming with template parameters for flexible kernel code
Structures and Classes - Data organization using structs (AOS vs SOA patterns)
Standard Library - Using STL containers and utilities

Modern C++ (C++11/14/17) Features

Type Inference (auto) - Automatic type deduction for cleaner code
constexpr - Compile-time constant expressions for performance
Smart Pointers - std::unique_ptr and std::shared_ptr for automatic memory management
Range-based For Loops - Simplified iteration over containers
std::vector - Dynamic arrays for host-side data management
std::tuple - Multiple return values and structured data
Uniform Initialization - Modern initialization syntax
nullptr - Type-safe null pointer constant

Examples in the Tutorials

template_parameters demonstrates template-based kernel optimization
AOS_VS_SOA showcases smart pointers and structured data patterns
stencil_algorithm uses tuples, range-based loops, and modern headers
parallel_reduction_warp_shuffling implements std::unique_ptr for resource management
Various tutorials use auto, constexpr, and type inference throughout

Tutorial Structure

Each tutorial is designed to be:

Self-contained - Can be built and run independently
Focused - Demonstrates a specific concept or technique
Practical - Includes working code examples

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA Tutorials

Overview

Tutorials

Getting Started

Thread and Block Management

Array Operations

Memory Management

Matrix Operations

Optimization Techniques

Warp-Level Programming

Parallel Reduction

Atomic Operations

Advanced Features

Streams and Asynchronous Operations

Scan Operations

Data Structures and Patterns

Utility and Mathematical Operations

Image Processing

Physical Simulations

Building and Running

Requirements

C++ and Modern C++ Features

Core C++ Features

Modern C++ (C++11/14/17) Features

Examples in the Tutorials

Tutorial Structure

About

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
.github		.github
1_D_thread_indexing		1_D_thread_indexing
2D_thread_indexing		2D_thread_indexing
AOS_VS_SOA		AOS_VS_SOA
adding_integers		adding_integers
array_addition_thread_style		array_addition_thread_style
array_addition_with_threads_and_blocks		array_addition_with_threads_and_blocks
array_summation_profiling		array_summation_profiling
async_functions		async_functions
atomic_operations		atomic_operations
atomic_parrallel_dot_product		atomic_parrallel_dot_product
block_grid_access		block_grid_access
complete_unrolling		complete_unrolling
cuda_events		cuda_events
cuda_memory_model		cuda_memory_model
custom_atomic_operation		custom_atomic_operation
device_properties		device_properties
distance_3D		distance_3D
divergence_in_parellel_reduction		divergence_in_parellel_reduction
dynamic_parallelism		dynamic_parallelism
efficient_parallel_scan		efficient_parallel_scan
error_handling		error_handling
global_memory_write_operations		global_memory_write_operations
hello_world		hello_world
laplace_equation_for_temperature_equalibrim		laplace_equation_for_temperature_equalibrim
matrix_multiplication		matrix_multiplication
matrix_transpose		matrix_transpose
matrix_transpose_padded_shared_memory		matrix_transpose_padded_shared_memory
matrix_transpose_shared_memory		matrix_transpose_shared_memory
matrix_transpose_with_padded_shared_memory		matrix_transpose_with_padded_shared_memory
memory_access_patterns		memory_access_patterns
memory_transfer_overlap		memory_transfer_overlap
non_nuil_stream_with_async_functions		non_nuil_stream_with_async_functions
parallel_reduction_using_dynamic_parallelism		parallel_reduction_using_dynamic_parallelism
parallel_reduction_warp_shuffling		parallel_reduction_warp_shuffling
parallel_reduction_with_shared_memory		parallel_reduction_with_shared_memory
pinned_memory		pinned_memory
register_usage		register_usage
shared_mem_row_major_access		shared_mem_row_major_access
shared_memory		shared_memory
shared_memory_access		shared_memory_access
shared_memory_padding		shared_memory_padding
sharpen_rgb_image		sharpen_rgb_image
simple_parallel_inclusive_scan		simple_parallel_inclusive_scan
standard_and_intrinsic_functions		standard_and_intrinsic_functions
stencil_algorithm		stencil_algorithm
stencil_computations		stencil_computations
stream_synchronization		stream_synchronization
streams_interdependcies		streams_interdependcies
synchronization		synchronization
template_parameters		template_parameters
threads_access		threads_access
unified_memory		unified_memory
unrolling		unrolling
unrolling_mat_transpose_shared_memory		unrolling_mat_transpose_shared_memory
vector_addition_block_style		vector_addition_block_style
warp_divergence		warp_divergence
warp_indexing		warp_indexing
warp_shuffling		warp_shuffling
warp_unrolling		warp_unrolling
zero_copy_memory		zero_copy_memory
.gitignore		.gitignore
README.md		README.md

muhammadtarek98/cuda_tutorials

Folders and files

Latest commit

History

Repository files navigation

CUDA Tutorials

Overview

Tutorials

Getting Started

Thread and Block Management

Array Operations

Memory Management

Matrix Operations

Optimization Techniques

Warp-Level Programming

Parallel Reduction

Atomic Operations

Advanced Features

Streams and Asynchronous Operations

Scan Operations

Data Structures and Patterns

Utility and Mathematical Operations

Image Processing

Physical Simulations

Building and Running

Requirements

C++ and Modern C++ Features

Core C++ Features

Modern C++ (C++11/14/17) Features

Examples in the Tutorials

Tutorial Structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages