Skip to content

JR-Wesley/CUDApunk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A Definitive Guide to CUDA Programming and Profiling

This repository serves as a comprehensive guide to CUDA programming, covering fundamental concepts, progressive optimization techniques, and performance analysis methodologies. Whether you're new to GPU programming or looking to deepen your understanding of high-performance computing, this guide provides structured learning materials with practical examples.

Key Features

Progressive Optimization Approach

Each section builds upon previous concepts, demonstrating how incremental improvements lead to significant performance gains. Real-world examples show the transition from basic implementations to highly optimized kernels.

Performance Analysis and Profiling

Learn to use industry-standard tools like NVIDIA Nsight Compute to identify bottlenecks, measure performance metrics, and validate optimization efforts through:

  • Roofline model analysis
  • Memory access pattern evaluation
  • Occupancy and throughput measurements
  • Instruction-level profiling

Practical Examples

All concepts are accompanied by executable code examples that illustrate key principles and allow hands-on experimentation with CUDA optimization techniques.

Getting Started

Each directory contains self-contained examples with build instructions. Follow the sequential learning path for best understanding, or jump to specific topics based on your interests.

Explore the directories to begin your journey into CUDA programming and performance optimization.

Requirements

Recommendations

  • A C++ compiler (e.g., Clang)
  • Cmake
  • CUDA Toolkit, including NCU (NVIDIA Nsight Compute), cuda-gdb
  • Python (for simulation and calculation)
  • Libtorch (optional but recommended for running examples)

Repository Structure and Learning Path

Basics

Foundational CUDA concepts including:

  • Memory management and optimization
  • Reduction operations and parallel algorithms
  • Warp-level primitives and cooperative groups
  • Performance profiling fundamentals

MMA (Matrix Multiplication Accumulation)

A step-by-step optimization guide for GEMM (General Matrix Multiply):

  • Naive implementation and performance bottlenecks
  • Memory coalescing techniques
  • Shared memory cache-blocking strategies
  • Advanced optimizations with detailed profiling
  • Roofline model analysis for performance characterization

CUTLASS

Introduction to NVIDIA's CUTLASS library for high-performance GEMM operations:

  • Core concepts and API usage
  • Template-based programming patterns
  • Performance tuning guidelines

Flash Attention

Implementation and optimization of attention mechanisms:

  • Classical softmax computation
  • Memory-efficient attention algorithms
  • Performance comparison with standard approaches

About

A Definitive Guide to CUDA Programming and Profiling

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors