Various CUDA Optimizations#1
Conversation
…es from 170s to 148s
|
Happy to review the changes this weekend if you would like another pair of eyes on it! |
|
The changed |
|
@ejaszewski Sure thing. I'll try and do that this weekend. |
|
@ejaszewski I reformatted with the original and did a scan. Hopefully it's easier to see the relevant changes now. |
|
Thanks for doing that! The diff is much clearer now. I'm pretty busy this week but I should have time to review over the weekend. |
ejaszewski
left a comment
There was a problem hiding this comment.
Other than a few minor comments, looks good to me. You should either document the use of restrict in the Device... functions or remove them since they aren't called internally anymore AFAICT. The only things that need to be fixed are the typos in aov.cu and removing the tmp directory.
I've added in a ton of CUDA optimizations for LS/CE/AOV.
Largely, I've done some lightcurve batching on lomb-scargle, and made use of CUDA asynchronous streams (including async memory transfers) to get some nice performance improvements.
Smaller improvements come from using restrict on pointers that are appropriate for it, and using some GPU intrinsic functions for slow math calls (ie,, using __sincosf).
All testings were done on a V100 on the SDSC Expanse GPU cluster.
TIMINGS
The performance gains decrease as the GPU memory bandwidth increases -- on a GTX 1080, the lomb-scargle gains were in the mid 20%'s.