-
CUDA C++ Programming Guide
- https://docs.nvidia.com/cuda/cuda-c-programming-guide/
- Comprehensive guide to CUDA programming
- Essential reading for understanding our CUDA backend
-
CUDA Runtime API Reference
- https://docs.nvidia.com/cuda/cuda-runtime-api/
- Complete API reference for all CUDA functions
- Used extensively in
src/backends/cuda/CUDABackend.cpp
-
CUDA Toolkit Documentation
- https://docs.nvidia.com/cuda/
- Hub for all CUDA documentation
-
CUDA C++ Best Practices Guide
- https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/
- Optimization techniques we use in our kernels
-
Parallel Reduction
- https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf
- Mark Harris's famous reduction optimization guide
- Direct inspiration for our
reduction.cukernel
-
Ampere Architecture Whitepaper (RTX 3000 series)
-
CUDA GPU Compute Capability List
- https://developer.nvidia.com/cuda-gpus
- Find your GPU's compute capability
-
OpenCL 3.0 Specification
- https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_API.html
- Official API specification
-
OpenCL C Language Specification
- https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_C.html
- Kernel language syntax
-
OpenCL Quick Reference
- https://www.khronos.org/files/opencl30-reference-guide.pdf
- Handy PDF reference card
-
Hands On OpenCL
- https://handsonopencl.github.io/
- Excellent tutorial series
-
OpenCL Programming Guide (Book)
- By Aaftab Munshi, Benedict Gaster, et al.
- ISBN: 978-0321749642
-
DirectCompute Overview
- https://learn.microsoft.com/en-us/windows/win32/direct3d11/direct3d-11-advanced-stages-compute-shader
- Official Microsoft documentation
-
HLSL Reference
- https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl-reference
- Complete HLSL language reference
-
Compute Shader Overview
- https://learn.microsoft.com/en-us/windows/win32/direct3d11/direct3d-11-advanced-stages-compute-create
- How to create and use compute shaders
- Direct3D 11 Documentation
- https://learn.microsoft.com/en-us/windows/win32/direct3d11/atoc-dx-graphics-direct3d-11
- Full DirectX 11 API documentation
-
"Programming Massively Parallel Processors"
- Authors: David Kirk, Wen-mei Hwu
- ISBN: 978-0124159921
- Best for: Understanding GPU architecture fundamentals
- Used in this project: Matrix multiplication optimization insights
-
"CUDA by Example"
- Authors: Jason Sanders, Edward Kandrot
- ISBN: 978-0131387683
- Best for: Learning CUDA from scratch
- Used in this project: Vector addition patterns
-
"Professional CUDA C Programming"
- Author: John Cheng, Max Grossman, Ty McKercher
- ISBN: 978-1118739327
- Best for: Advanced optimization techniques
- Used in this project: Warp shuffle primitives, bank conflict avoidance
-
"Heterogeneous Computing with OpenCL 2.0"
- Authors: David Kaeli, Perhaad Mistry, et al.
- ISBN: 978-0128014141
- Best for: Cross-platform GPU programming
- Used in this project: OpenCL backend design
-
"Effective Modern C++"
- Author: Scott Meyers
- ISBN: 978-1491903995
- Best for: Modern C++ patterns (C++11/14/17)
- Used in this project: Smart pointers, move semantics, RAII
-
"Design Patterns"
- Authors: Gang of Four (Gamma, Helm, Johnson, Vlissides)
- ISBN: 978-0201633610
- Best for: Software architecture patterns
- Used in this project: Strategy, Factory, Singleton patterns
-
NVIDIA CUDA Training
- https://www.nvidia.com/en-us/training/
- Free official NVIDIA courses
-
Udacity: Intro to Parallel Programming (CS344)
- https://www.udacity.com/course/intro-to-parallel-programming--cs344
- Free course, excellent for beginners
-
Coursera: GPU Programming Specialization
- https://www.coursera.org/specializations/gpu-programming
- Johns Hopkins University
- Hands-On OpenCL Course
- https://handsonopencl.github.io/
- Free, interactive tutorial
- Microsoft Learn: DirectX
- https://learn.microsoft.com/en-us/windows/win32/directx
- Official tutorials and samples
-
"Optimizing Parallel Reduction in CUDA" (Mark Harris, 2007)
- https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf
- Foundational paper on reduction optimization
- Direct influence on our
reduction.cuimplementation
-
"Roofline: An Insightful Visual Performance Model" (Williams et al., 2009)
- https://people.eecs.berkeley.edu/~kubitron/cs252/handouts/papers/RooflineVyNoYellow.pdf
- Performance analysis framework
- Used to understand bottlenecks
-
"Matrix Multiplication on GPUs" (Volkov & Demmel, 2008)
- https://www2.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-111.pdf
- Advanced matrix multiplication optimization
- Inspired our tiling strategy
- "NVIDIA GPU Architecture Whitepapers"
- Ampere: https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf
- Turing: https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf
- Essential for understanding modern GPU capabilities
-
Nsight Compute
- https://developer.nvidia.com/nsight-compute
- Kernel profiler (instruction-level analysis)
- Usage: Profile our CUDA kernels to find bottlenecks
-
Nsight Systems
- https://developer.nvidia.com/nsight-systems
- System-wide profiler (CPU+GPU timeline)
- Usage: Understand host-device interactions
-
CUDA-GDB
- https://docs.nvidia.com/cuda/cuda-gdb/
- GPU debugger
- Usage: Debug kernel crashes
- AMD CodeXL
- https://github.com/GPUOpen-Archive/CodeXL
- OpenCL profiler and debugger
- Usage: Profile OpenCL kernels on AMD GPUs
- PIX for Windows
- https://devblogs.microsoft.com/pix/download/
- DirectX profiler and debugger
- Usage: Profile DirectCompute shaders
-
GPU-Z
- https://www.techpowerup.com/gpuz/
- GPU monitoring tool (clocks, temperatures, utilization)
-
HWiNFO
- https://www.hwinfo.com/
- Comprehensive system monitoring
-
NVIDIA Developer Forums
- https://forums.developer.nvidia.com/c/gpu-graphics-and-game-dev/cuda/206
- Ask CUDA questions
-
Khronos OpenCL Forums
- https://community.khronos.org/c/opencl/13
- OpenCL discussions
-
Stack Overflow
- Tags:
[cuda],[opencl],[directcompute],[hlsl] - https://stackoverflow.com/questions/tagged/cuda
- Tags:
-
CUDA Samples
- https://github.com/NVIDIA/cuda-samples
- Official NVIDIA examples
-
OpenCL Samples
- https://github.com/KhronosGroup/OpenCL-Guide
- Official Khronos guide
-
DirectX Samples
- https://github.com/microsoft/DirectX-Graphics-Samples
- Official Microsoft samples
-
NVIDIA Developer Blog
- https://developer.nvidia.com/blog/
- Latest CUDA features and best practices
-
Parallel Forall (Archive)
- https://developer.nvidia.com/blog/category/parallel-forall/
- Classic GPU programming articles
- Colfax Research: CUDA Optimization
- https://colfaxresearch.com/blog/
- Deep-dive optimization guides
- Real-Time Rendering Blog
- http://www.realtimerendering.com/blog/
- Graphics and compute shader insights
- OpenCL Registry
- https://www.khronos.org/registry/OpenCL/
- All OpenCL specifications and extensions
- HLSL Specifications
- https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl
- Language specifications and shader models
- C++17 Standard
- https://isocpp.org/std/the-standard
- Language standard we use
-
NVIDIA Developer
- https://www.youtube.com/c/NVIDIADeveloper
- GTC talks, tutorials
-
CppCon Talks
- https://www.youtube.com/user/CppCon
- C++ best practices
-
"Intro to CUDA" - NVIDIA
- Basic CUDA programming concepts
-
"Optimizing Parallel Reduction in CUDA" - Mark Harris
- Our reduction kernel is based on this
-
"GPU Performance Analysis and Optimization" - NVIDIA GTC
- Profiling and optimization techniques
- CUDA C++ Programming Guide → Core backend architecture
- "CUDA by Example" → Vector addition implementation
- "Programming Massively Parallel Processors" → Matrix multiplication tiling
- Mark Harris's Reduction Paper → Reduction kernel optimization
- NVIDIA Best Practices Guide → Memory coalescing patterns
- Roofline Model Paper → Performance analysis framework
- "Design Patterns" (GoF) → Strategy and Factory patterns
- "Effective Modern C++" → RAII and smart pointers
- OpenCL Spec → Cross-platform API design
- Professional CUDA C Programming → Technical explanations
- NVIDIA Documentation Style → Code comments format
- GitHub Best Practices → README structure
- Read "CUDA by Example"
- Complete Udacity CS344 course
- Study our
vector_add.cukernel - Modify and experiment
- Read "Programming Massively Parallel Processors"
- Study our
matrix_mul.cuoptimizations - Profile with Nsight Compute
- Implement your own benchmark
- Read "Professional CUDA C Programming"
- Study advanced papers (Reduction, Roofline)
- Optimize for specific GPU architectures
- Contribute to this project!
Found a great resource? Add it here!
- Fork the repository
- Edit this file
- Submit a pull request
- Help others learn!
This is your roadmap to GPU programming mastery! 🎓🚀
Next: Apply this knowledge by reading our source code and documentation!
Curated by: Soham Dave
Date: January 2026
For: GPU Benchmark Suite v1.0
Purpose: Comprehensive learning resource collection