This repository serves as a comprehensive guide to CUDA programming, covering fundamental concepts, progressive optimization techniques, and performance analysis methodologies. Whether you're new to GPU programming or looking to deepen your understanding of high-performance computing, this guide provides structured learning materials with practical examples.
Each section builds upon previous concepts, demonstrating how incremental improvements lead to significant performance gains. Real-world examples show the transition from basic implementations to highly optimized kernels.
Learn to use industry-standard tools like NVIDIA Nsight Compute to identify bottlenecks, measure performance metrics, and validate optimization efforts through:
- Roofline model analysis
- Memory access pattern evaluation
- Occupancy and throughput measurements
- Instruction-level profiling
All concepts are accompanied by executable code examples that illustrate key principles and allow hands-on experimentation with CUDA optimization techniques.
Each directory contains self-contained examples with build instructions. Follow the sequential learning path for best understanding, or jump to specific topics based on your interests.
Explore the directories to begin your journey into CUDA programming and performance optimization.
Recommendations
- A C++ compiler (e.g., Clang)
- Cmake
- CUDA Toolkit, including NCU (NVIDIA Nsight Compute), cuda-gdb
- Python (for simulation and calculation)
- Libtorch (optional but recommended for running examples)
Foundational CUDA concepts including:
- Memory management and optimization
- Reduction operations and parallel algorithms
- Warp-level primitives and cooperative groups
- Performance profiling fundamentals
A step-by-step optimization guide for GEMM (General Matrix Multiply):
- Naive implementation and performance bottlenecks
- Memory coalescing techniques
- Shared memory cache-blocking strategies
- Advanced optimizations with detailed profiling
- Roofline model analysis for performance characterization
Introduction to NVIDIA's CUTLASS library for high-performance GEMM operations:
- Core concepts and API usage
- Template-based programming patterns
- Performance tuning guidelines
Implementation and optimization of attention mechanisms:
- Classical softmax computation
- Memory-efficient attention algorithms
- Performance comparison with standard approaches