This repository contains the complete code and resources for benchmarking matrix multiplication on both CPU and GPU, using various implementations and optimizations. The project demonstrates the performance benefits of GPU acceleration using CuPy with FP32 and TensorCore support, alongside CPU-based approaches with NumPy, Numba, and custom CUDA kernels.
-
CPU Implementations:
- Naïve Python multiplication
- NumPy-based matrix multiplication
- Numba parallel-accelerated multiplication
-
GPU Implementations:
- CuPy FP32: Optimized with standard floating-point precision
- CuPy TensorCore: Faster, mixed-precision matrix multiplication
- Custom CUDA kernel for small matrices with shared memory optimization
-
Performance Benchmarking:
- Execution time, throughput, and speedup measurements
- Comparison against NumPy baseline performance
-
Visualizations:
- Logarithmic scale performance graphs
- Speedup and time comparisons
- Install dependencies:
Ensure you have the required libraries installed. Use the following command:
pip install numpy numba cupy matplotlib- Modify matrix sizes and block dimensions in the scripts for different benchmarks.
- Tune the CUDA kernel parameters to optimize performance for specific hardware.
- Experiment with different matrix sizes to observe scaling behavior.
Feel free to contribute by suggesting further optimizations, adding new algorithms, or improving the visualizations. 🚀