Skip to content

A comprehensive collection of CUDA programming tutorials covering various parallel programming concepts, memory models, optimization techniques, and performance patterns for NVIDIA GPUs.

Notifications You must be signed in to change notification settings

muhammadtarek98/cuda_tutorials

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CUDA Tutorials

A comprehensive collection of CUDA programming tutorials covering various parallel programming concepts, memory models, optimization techniques, and performance patterns for NVIDIA GPUs.

Overview

This repository contains practical examples and tutorials demonstrating CUDA programming fundamentals and advanced techniques. Each tutorial is self-contained and focuses on specific aspects of GPU programming, making it easy to learn and understand CUDA concepts incrementally.

Tutorials

Getting Started

  • hello_world - Basic CUDA kernel execution with grid and block configuration
  • adding_integers - Simple integer addition using CUDA
  • device_properties - Query and display GPU device properties

Thread and Block Management

  • 1_D_thread_indexing - Understanding 1D thread indexing in CUDA
  • 2D_thread_indexing - Working with 2D thread blocks and grids
  • threads_access - Thread access patterns and organization
  • block_grid_access - Accessing data using block and grid dimensions
  • warp_indexing - Understanding warp-level thread organization

Array Operations

  • vector_addition_block_style - Vector addition using block-level parallelism
  • array_addition_thread_style - Array addition with thread-level parallelism
  • array_addition_with_threads_and_blocks - Combining threads and blocks for array operations
  • array_summation_profiling - Performance profiling for array summation

Memory Management

  • cuda_memory_model - Overview of CUDA memory hierarchy
  • shared_memory - Using shared memory for performance optimization
  • shared_memory_access - Optimizing shared memory access patterns
  • shared_mem_row_major_access - Row-major access patterns in shared memory
  • shared_memory_padding - Using padding to avoid bank conflicts
  • unified_memory - Unified memory for simplified memory management
  • pinned_memory - Using page-locked host memory for faster transfers
  • zero_copy_memory - Direct GPU access to host memory
  • global_memory_write_operations - Optimizing global memory writes
  • memory_access_patterns - Understanding efficient memory access patterns

Matrix Operations

  • matrix_multiplication - Basic matrix multiplication on GPU
  • matrix_transpose - Matrix transposition implementations
  • matrix_transpose_shared_memory - Using shared memory for matrix transpose
  • matrix_transpose_padded_shared_memory - Padded shared memory for optimized transpose
  • matrix_transpose_with_padded_shared_memory - Enhanced matrix transpose with padding

Optimization Techniques

  • unrolling - Loop unrolling for improved performance
  • complete_unrolling - Complete loop unrolling techniques
  • warp_unrolling - Warp-level loop unrolling
  • unrolling_mat_transpose_shared_memory - Unrolled matrix transpose with shared memory
  • register_usage - Optimizing register usage in kernels
  • template_parameters - Using C++ templates for flexible kernel code

Warp-Level Programming

  • warp_divergence - Understanding and managing warp divergence
  • warp_shuffling - Using warp shuffle operations for communication

Parallel Reduction

  • parallel_reduction_with_shared_memory - Reduction using shared memory
  • parallel_reduction_warp_shuffling - Reduction with warp shuffle instructions
  • divergence_in_parellel_reduction - Managing divergence in reduction operations
  • parallel_reduction_using_dynamic_parallelism - Reduction with dynamic parallelism

Atomic Operations

  • atomic_operations - Basic atomic operations in CUDA
  • atomic_parrallel_dot_product - Parallel vector dot product using atomic operations and shared memory
  • custom_atomic_operation - Implementing custom atomic operations

Advanced Features

  • dynamic_parallelism - Launching kernels from within kernels
  • synchronization - Thread synchronization mechanisms
  • error_handling - Proper CUDA error handling techniques

Streams and Asynchronous Operations

  • cuda_events - Using CUDA events for timing and synchronization
  • async_functions - Asynchronous function execution
  • non_nuil_stream_with_async_functions - Non-null streams with async operations
  • stream_synchronization - Stream synchronization techniques
  • streams_interdependcies - Managing dependencies between streams
  • memory_transfer_overlap - Overlapping memory transfers with computation

Scan Operations

  • simple_parallel_inclusive_scan - Basic parallel inclusive scan implementation
  • efficient_parallel_scan - Optimized parallel scan algorithm

Data Structures and Patterns

  • AOS_VS_SOA - Array of Structures vs Structure of Arrays comparison
  • stencil_algorithm - Stencil computation patterns
  • stencil_computations - Advanced stencil operations

Utility and Mathematical Operations

  • distance_3D - Computing 3D distances on GPU
  • standard_and_intrinsic_functions - Using CUDA standard and intrinsic math functions

Image Processing

  • sharpen_rgb_image - RGB image sharpening using CUDA and CImg library

Physical Simulations

  • laplace_equation_for_temperature_equalibrim - Solving Laplace equation for temperature equilibrium with OpenGL visualization

Building and Running

Each tutorial directory contains:

  • main.cu - CUDA source code
  • CMakeLists.txt - CMake build configuration

To build and run a tutorial:

cd <tutorial_name>
mkdir build
cd build
cmake ..
make
./<executable_name>

Requirements

  • NVIDIA GPU with CUDA support
  • CUDA Toolkit installed
  • CMake build system
  • C++ compiler with C++11 support or higher

C++ and Modern C++ Features

These tutorials demonstrate practical use of C++ and modern C++ features alongside CUDA programming:

Core C++ Features

  • Templates - Generic programming with template parameters for flexible kernel code
  • Structures and Classes - Data organization using structs (AOS vs SOA patterns)
  • Standard Library - Using STL containers and utilities

Modern C++ (C++11/14/17) Features

  • Type Inference (auto) - Automatic type deduction for cleaner code
  • constexpr - Compile-time constant expressions for performance
  • Smart Pointers - std::unique_ptr and std::shared_ptr for automatic memory management
  • Range-based For Loops - Simplified iteration over containers
  • std::vector - Dynamic arrays for host-side data management
  • std::tuple - Multiple return values and structured data
  • Uniform Initialization - Modern initialization syntax
  • nullptr - Type-safe null pointer constant

Examples in the Tutorials

  • template_parameters demonstrates template-based kernel optimization
  • AOS_VS_SOA showcases smart pointers and structured data patterns
  • stencil_algorithm uses tuples, range-based loops, and modern headers
  • parallel_reduction_warp_shuffling implements std::unique_ptr for resource management
  • Various tutorials use auto, constexpr, and type inference throughout

Tutorial Structure

Each tutorial is designed to be:

  • Self-contained - Can be built and run independently
  • Focused - Demonstrates a specific concept or technique
  • Practical - Includes working code examples

About

A comprehensive collection of CUDA programming tutorials covering various parallel programming concepts, memory models, optimization techniques, and performance patterns for NVIDIA GPUs.

Topics

Resources

Stars

Watchers

Forks

Contributors 2

  •  
  •  

Languages