Skip to content

Various CUDA Optimizations#1

Open
lilinitsy wants to merge 54 commits into
scope-ml:mainfrom
lilinitsy:master
Open

Various CUDA Optimizations#1
lilinitsy wants to merge 54 commits into
scope-ml:mainfrom
lilinitsy:master

Conversation

@lilinitsy

Copy link
Copy Markdown

I've added in a ton of CUDA optimizations for LS/CE/AOV.

Largely, I've done some lightcurve batching on lomb-scargle, and made use of CUDA asynchronous streams (including async memory transfers) to get some nice performance improvements.

Smaller improvements come from using restrict on pointers that are appropriate for it, and using some GPU intrinsic functions for slow math calls (ie,, using __sincosf).

All testings were done on a V100 on the SDSC Expanse GPU cluster.

TIMINGS

METHOD VERSION MACHINE DATA TIME % GAIN
LS Baseline EXPANSE all time, all mags, 100k periods 139.75323605537415
LS OPTIMIZED EXPANSE all time, all mags, 100k periods 122.99888670444 +11.98%
CE Baseline EXPANSE all time, all mags, all 237.9152114391327
CE OPTIMIZED EXPANSE all time, all mags, all 224.75576400756836 +5.53%
AOV Baseline EXPANSE all time, all mags, 100k periods 244.3776876926422
AOV OPTIMIZED EXPANSE all time, all mags, 100k periods 214.41236901283264 +12.26%
METHOD VERSION MACHINE DATA TIME % GAIN
LS Baseline EXPANSE 1000 lightcurves, all periods 17.119052171707153
LS OPTIMIZED EXPANSE 1000 lightcurves, all periods 16.61055612564087 +2.97%
CE Baseline EXPANSE 1000 lightcurves, all periods 28.453676223754883
CE OPTIMIZED EXPANSE 1000 lightcurves, all periods 27.26391577720642 +4.18%
AOV Baseline EXPANSE 1000 lightcurves, all periods 30.421194076538086
AOV OPTIMIZED EXPANSE 1000 lightcurves, all periods 26.356033086776733 +13.36%

The performance gains decrease as the GPU memory bandwidth increases -- on a GTX 1080, the lomb-scargle gains were in the mid 20%'s.

@ejaszewski

Copy link
Copy Markdown
Collaborator

Happy to review the changes this weekend if you would like another pair of eyes on it!

@ejaszewski

Copy link
Copy Markdown
Collaborator

The changed .clang-format makes it very difficult to actually tell what has changed because the diff is picking up all of the whitespace changes. If possible, can you re-format this with the original clang-format so the diff is meaningful?

@lilinitsy

Copy link
Copy Markdown
Author

@ejaszewski Sure thing. I'll try and do that this weekend.

@lilinitsy

Copy link
Copy Markdown
Author

@ejaszewski I reformatted with the original and did a scan. Hopefully it's easier to see the relevant changes now.

@ejaszewski

Copy link
Copy Markdown
Collaborator

Thanks for doing that! The diff is much clearer now. I'm pretty busy this week but I should have time to review over the weekend.

@ejaszewski ejaszewski left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than a few minor comments, looks good to me. You should either document the use of restrict in the Device... functions or remove them since they aren't called internally anymore AFAICT. The only things that need to be fixed are the typos in aov.cu and removing the tmp directory.

Comment thread CMakeLists.txt
Comment thread periodfind/cuda/aov.cu Outdated
Comment thread periodfind/cuda/ls.cu
Comment thread periodfind/cuda/aov.cu Outdated
Comment thread tmp/periodfind Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants