Part-I: Use CUDA to accelerate the operations of a typical convolutional layer in often-used large-scale neural networks. (You can find the description slides here)
Part-II: Accelerate a sparse convolutional layer with CUDA. (You can find the description slides here)
This directory contains the input data for the base program
- /data/filt.txt - Store the values of filters
- /data/filt.coo - Store the values of filters in COO format
- /data/inNeu.txt - Store the values of input neurons
- /data/inNeu.coo - Store the values of input neurons in COO format
This is the example to show you how to use CUDA to accelerate Inner Product
cd ./innerProduct
make
make run
The program under this directory can show the device information
cd ./device
make
make run
git checkout -t origin/part2
make
make run
- Put the input data in sparse format and reimplement your CUDA kernels
- Use NVIDIA Visual Profiler to analyze and improve your code
- Optimize your CUDA kernels for the sparse format
- Improve the input data format (like using other sparse format rather than COO)
-
convLayerCPU() will do the computation with C++ and store the output in the outCPU
-
checker() will check whether the values stored in outCPU and outGPU are the same
- Store your result in the outGPU in dense format
- You must pass the checking to ensure your result is correct!
-
Use nvvp (or nvprof) to measure the kernel execution time and data transfer time
-
TA will use TotalExecTime to evaluate your preformance
DataTransTime = DataHostToDeviceTime + DataDeviceToHostTime TotalExecTime = GPUKernelsExecTime + DataTransTime
- Completeness (30%)
- Your result is correct (Pass the check) - 5%
- You get speedup compared to convLayerCPU() - 5%
- You use NVIDIA Visual Profiler (NVVP) to help you - 5%
- You utilize the sparsity in either Neurons or Filters - 5%
- Improve the input data format (like using other sparse format rather than COO) - 10%
- Performance Ranking (30%)
- TA will rank your TotalExecTime on the provided server
- The fastest one will get 30% and the last one will get 1%
- Report (40%)
- Description of your implementation and results - 5%
- Show how NVVP help you find and solve perf. issues - 5%
- Discussion on your optimizations and innovations - 20%
- Comparison between part-I - 5%
- Feedback of this project - 5%
- It’s team work, 1 ~ 3 people in one team
- Same team members as part-I
- Compress your code and report into one zip file and upload to E3
- Name your package as: LeaderID_FP2.zip
- One team only need to upload one package to E3
- Please name your report as: LeaderID_Report_FP2.pdf
- Make sure TA can compile and run your code on the provided server
- Any CUDA library is forbidden to use in this project
- Delay is NOT acceptable
- Any plagiarism will make you get zero point
- LeNet: Gradient Based Learning Applied to Document Recognition
- AlexNet: ImageNet Classification with Deep Convolutional Neural Networks
- CNN: Standford CS231n Convolutional Neural Networks for Visual Recognition
- CUDA Tutorial: CUDA C/C++ Basics
- CNN with CUDA: Optimizing Convolution Operations in CUDA with Adaptive Tiling convolution on gpu
- GPU Profiling: GPU Performance Analysis and Optimisation
- GPU Profiling: CUDA Profiling Documentation
- Network pruning: Learning both Weights and Connections for Efficient Neural Networks
- Sparsity in Neurons: Cnvlutin: Ineffectual-neuron-free Deep Neural Network Computing
- Sparse data GPU: Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
- Sparse data with CUDA: Efficient Sparse Matrix-Vector Multiplication on CUDA
TA: Chien-Yu Lin
Email: myislin@gmail.com