Workaround for groupyby-min/max compile-time issue with thrust-1.17#11467
Workaround for groupyby-min/max compile-time issue with thrust-1.17#11467davidwendt wants to merge 3 commits into
Conversation
| bool const arg_min; | ||
|
|
||
| __device__ inline auto operator()(size_type const& lhs_idx, size_type const& rhs_idx) const | ||
| __noinline__ __device__ auto operator()(size_type const& lhs_idx, size_type const& rhs_idx) const |
There was a problem hiding this comment.
What is the difference between __noinline__ and __attribute__((noinline))?
There was a problem hiding this comment.
Nothing as far as I can tell. I tried them both.
Also, reference: https://github.com/NVIDIA/thrust/issues/1344#issuecomment-1164676122
There was a problem hiding this comment.
I asked @jrhemstad about this once and he gave a nice example of exactly how equivalent they are once nvcc is done preprocessing the files:
(rapids) rapids@compose:~/cudf/tmp$ echo "__global__ void kernel(){} int main(){}" > test.cu && nvcc ./test.cu --keep && tail -3 test.cpp1.ii
# 1 "<command-line>" 2
# 1 "./test.cu"
__attribute__((global)) void kernel(){} int main(){}
There was a problem hiding this comment.
So __inline__ will be converted into __attribute__((inline)) right? I guess so, as __something__ seems a CUDA specific keyword and I guess it will be converted into some C++ standard equivalent if possible.
There was a problem hiding this comment.
I was wrong:
╰─ echo "__inline__ __device__ void f(){} int main(){}" > test.cu && nvcc ./test.cu --keep && tail -3 test.cpp1.ii
# 0 "<command-line>" 2
# 1 "./test.cu"
__inline__ __attribute__((device)) void f(){} int main(){}
But:
╰─ echo "__noinline__ __device__ void f(){} int main(){}" > test.cu && nvcc ./test.cu --keep && tail -3 test.cpp1.ii
# 0 "<command-line>" 2
# 1 "./test.cu"
__attribute__((noinline)) __attribute__((device)) void f(){} int main(){}
So __inline__ is not converted but __noinline__ is converted.
There was a problem hiding this comment.
When we use __noinline__ for a specific reason like this that is affected by external code (CUB), we should add a comment to indicate why.
| __noinline__ __device__ auto operator()(size_type const& lhs_idx, size_type const& rhs_idx) const | |
| // Must be __noinline__ for Thrust/CUB, to prevent long compile times | |
| __noinline__ __device__ auto operator()(size_type const& lhs_idx, size_type const& rhs_idx) const |
There was a problem hiding this comment.
Already had it queued up. 👍
|
To clarify: This is a fix for sort-based groupby |
The source files are named in the PR description: #11467 (comment) |
|
And these files were compiled in 8 minutes (from 30m) 🚀 |
|
Let me try to run a benchmark for this. My recent PR for benchmarking groupby |
This comment was marked as outdated.
This comment was marked as outdated.
|
Oh sorry, I just recognized that my benchmark was wrong. Generating again.... |
|
Benchmark groupby Benchmark groupby |
|
I'd try patching the implementation to reduce the unrolling. That should alleviate the inflated compile time without negatively impacting performance. |
This comment was marked as off-topic.
This comment was marked as off-topic.
|
Closing this as an unacceptable workaround. |
Description
Fixes issue found in #11437 where compile-time for
groupby/sort/group_argmax.cuandgroupby/sort/group_argmin.cumore than doubles to over 30 minutes for each file:https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cudf/job/prb/job/cudf-cpu-cuda-build/CUDA=11.5/11169/Build_20Metrics_20Report/
Baseline example from a different PR:
https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cudf/job/prb/job/cudf-cpu-cuda-build/CUDA=11.5/11215/Build_20Metrics_20Report/
The culprit appears to be
thrust::reduce_by_keyand almost all source files using this function appear to have doubled in compile time.The fix here forces the
element_argminmax_fnfunctor's device operator tonoinline.Checklist