A Definitive Guide to CUDA Programming and Profiling

This repository serves as a comprehensive guide to CUDA programming, covering fundamental concepts, progressive optimization techniques, and performance analysis methodologies. Whether you're new to GPU programming or looking to deepen your understanding of high-performance computing, this guide provides structured learning materials with practical examples.

Key Features

Progressive Optimization Approach

Each section builds upon previous concepts, demonstrating how incremental improvements lead to significant performance gains. Real-world examples show the transition from basic implementations to highly optimized kernels.

Performance Analysis and Profiling

Learn to use industry-standard tools like NVIDIA Nsight Compute to identify bottlenecks, measure performance metrics, and validate optimization efforts through:

Roofline model analysis
Memory access pattern evaluation
Occupancy and throughput measurements
Instruction-level profiling

Practical Examples

All concepts are accompanied by executable code examples that illustrate key principles and allow hands-on experimentation with CUDA optimization techniques.

Getting Started

Each directory contains self-contained examples with build instructions. Follow the sequential learning path for best understanding, or jump to specific topics based on your interests.

Explore the directories to begin your journey into CUDA programming and performance optimization.

Requirements

Recommendations

A C++ compiler (e.g., Clang)
Cmake
CUDA Toolkit, including NCU (NVIDIA Nsight Compute), cuda-gdb
Python (for simulation and calculation)
Libtorch (optional but recommended for running examples)

Repository Structure and Learning Path

Basics

Foundational CUDA concepts including:

Memory management and optimization
Reduction operations and parallel algorithms
Warp-level primitives and cooperative groups
Performance profiling fundamentals

MMA (Matrix Multiplication Accumulation)

A step-by-step optimization guide for GEMM (General Matrix Multiply):

Naive implementation and performance bottlenecks
Memory coalescing techniques
Shared memory cache-blocking strategies
Advanced optimizations with detailed profiling
Roofline model analysis for performance characterization

CUTLASS

Introduction to NVIDIA's CUTLASS library for high-performance GEMM operations:

Core concepts and API usage
Template-based programming patterns
Performance tuning guidelines

Flash Attention

Implementation and optimization of attention mechanisms:

Classical softmax computation
Memory-efficient attention algorithms
Performance comparison with standard approaches

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
CUTE		CUTE
basics		basics
cutlass		cutlass
flash-attention		flash-attention
mma		mma
norm		norm
softmax		softmax
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Definitive Guide to CUDA Programming and Profiling

Key Features

Progressive Optimization Approach

Performance Analysis and Profiling

Practical Examples

Getting Started

Requirements

Repository Structure and Learning Path

Basics

MMA (Matrix Multiplication Accumulation)

CUTLASS

Flash Attention

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

A Definitive Guide to CUDA Programming and Profiling

Key Features

Progressive Optimization Approach

Performance Analysis and Profiling

Practical Examples

Getting Started

Requirements

Repository Structure and Learning Path

Basics

MMA (Matrix Multiplication Accumulation)

CUTLASS

Flash Attention

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages