A short example of how we can go from basic CPU to GPU with copy-compute overlap to speed up a dummy encryption/decryption program. Speed ups are dependent on the hardware used.
First ensure that you have properly installed CUDA and Nsight Tools.
cd baseline_cipher
Type make baseline in a CLI to compile the code. The first time you run the command will be slow as the CPU does all the encoding and caches the encoded result as a file.
Type make clean to remove the results of compilation.
cd all_cuda_streams
Type make streams to compile. This code uses the GPU for both encoding and decoding the data, and it never reads from your cached file.
Type make profile to generate a Nsight Systems report file that can be viewed in Nsight. Viewing the report should show that your compute kernel calls are executed in parallel.
Type make clean to remove the results of compilation.