Skip to content

Workaround for groupyby-min/max compile-time issue with thrust-1.17#11467

Closed
davidwendt wants to merge 3 commits into
rapidsai:branch-22.10from
davidwendt:t17-groupby-minmax
Closed

Workaround for groupyby-min/max compile-time issue with thrust-1.17#11467
davidwendt wants to merge 3 commits into
rapidsai:branch-22.10from
davidwendt:t17-groupby-minmax

Conversation

@davidwendt
Copy link
Copy Markdown
Contributor

Description

Fixes issue found in #11437 where compile-time for groupby/sort/group_argmax.cu and groupby/sort/group_argmin.cu more than doubles to over 30 minutes for each file:
https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cudf/job/prb/job/cudf-cpu-cuda-build/CUDA=11.5/11169/Build_20Metrics_20Report/
Baseline example from a different PR:
https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cudf/job/prb/job/cudf-cpu-cuda-build/CUDA=11.5/11215/Build_20Metrics_20Report/

The culprit appears to be thrust::reduce_by_key and almost all source files using this function appear to have doubled in compile time.
The fix here forces the element_argminmax_fn functor's device operator to noinline.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Aug 4, 2022
@davidwendt davidwendt requested a review from a team as a code owner August 4, 2022 17:19
@davidwendt davidwendt self-assigned this Aug 4, 2022
@davidwendt davidwendt requested review from cwharris and vuule August 4, 2022 17:19
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Aug 4, 2022
bool const arg_min;

__device__ inline auto operator()(size_type const& lhs_idx, size_type const& rhs_idx) const
__noinline__ __device__ auto operator()(size_type const& lhs_idx, size_type const& rhs_idx) const
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between __noinline__ and __attribute__((noinline))?

Copy link
Copy Markdown
Contributor Author

@davidwendt davidwendt Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing as far as I can tell. I tried them both.
Also, reference: https://github.com/NVIDIA/thrust/issues/1344#issuecomment-1164676122

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked @jrhemstad about this once and he gave a nice example of exactly how equivalent they are once nvcc is done preprocessing the files:

(rapids) rapids@compose:~/cudf/tmp$ echo "__global__ void kernel(){} int main(){}" > test.cu && nvcc ./test.cu --keep && tail -3 test.cpp1.ii
# 1 "<command-line>" 2
# 1 "./test.cu"
__attribute__((global)) void kernel(){} int main(){}

Copy link
Copy Markdown
Contributor

@ttnghia ttnghia Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So __inline__ will be converted into __attribute__((inline)) right? I guess so, as __something__ seems a CUDA specific keyword and I guess it will be converted into some C++ standard equivalent if possible.

Copy link
Copy Markdown
Contributor

@ttnghia ttnghia Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wrong:

╰─ echo "__inline__ __device__ void f(){} int main(){}" > test.cu && nvcc ./test.cu --keep && tail -3 test.cpp1.ii
# 0 "<command-line>" 2
# 1 "./test.cu"
__inline__ __attribute__((device)) void f(){} int main(){}

But:

╰─ echo "__noinline__ __device__ void f(){} int main(){}" > test.cu && nvcc ./test.cu --keep && tail -3 test.cpp1.ii
# 0 "<command-line>" 2
# 1 "./test.cu"
__attribute__((noinline)) __attribute__((device)) void f(){} int main(){}

So __inline__ is not converted but __noinline__ is converted.

Copy link
Copy Markdown
Contributor

@bdice bdice Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we use __noinline__ for a specific reason like this that is affected by external code (CUB), we should add a comment to indicate why.

Suggested change
__noinline__ __device__ auto operator()(size_type const& lhs_idx, size_type const& rhs_idx) const
// Must be __noinline__ for Thrust/CUB, to prevent long compile times
__noinline__ __device__ auto operator()(size_type const& lhs_idx, size_type const& rhs_idx) const

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already had it queued up. 👍

@ttnghia
Copy link
Copy Markdown
Contributor

ttnghia commented Aug 4, 2022

To clarify: This is a fix for sort-based groupby arg_min/arg_max instead.

@davidwendt
Copy link
Copy Markdown
Contributor Author

To clarify: This is a fix for sort-based groupby arg_min/arg_max instead.

The source files are named in the PR description: #11467 (comment)

@ttnghia
Copy link
Copy Markdown
Contributor

ttnghia commented Aug 4, 2022

And these files were compiled in 8 minutes (from 30m) 🚀

@ttnghia
Copy link
Copy Markdown
Contributor

ttnghia commented Aug 4, 2022

Let me try to run a benchmark for this. My recent PR for benchmarking groupby max can be modified a little bit to run this easily.

@davidwendt davidwendt requested a review from bdice August 4, 2022 18:25
@ttnghia

This comment was marked as outdated.

@ttnghia
Copy link
Copy Markdown
Contributor

ttnghia commented Aug 4, 2022

Oh sorry, I just recognized that my benchmark was wrong. Generating again....

@ttnghia
Copy link
Copy Markdown
Contributor

ttnghia commented Aug 4, 2022

Benchmark groupby arg_max with Thrust 1.15 (current):

## [0] Quadro RTX 6000

|  T  |  num_rows  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|-----|------------|------------|-------------|------------|-------------|-----------|---------|----------|
| I32 |    2^12    |  69.403 us |      61.73% |  76.430 us |      41.24% |  7.027 us |  10.12% |   PASS   |
| I32 |    2^18    |  72.317 us |      34.54% | 170.834 us |     103.27% | 98.518 us | 136.23% |   FAIL   |
| I32 |    2^24    |   1.469 ms |      10.33% |   9.017 ms |       7.23% |  7.549 ms | 513.99% |   FAIL   |

Benchmark groupby arg_max with Thrust 1.17 (merged with #11437):

|  T  |  num_rows  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|-----|------------|------------|-------------|------------|-------------|------------|---------|----------|
| I32 |    2^12    |  69.474 us |      74.59% |  77.506 us |      55.57% |   8.032 us |  11.56% |   PASS   |
| I32 |    2^18    |  77.304 us |      61.53% | 182.766 us |      29.55% | 105.462 us | 136.43% |   FAIL   |
| I32 |    2^24    |   1.506 ms |      14.55% |   9.294 ms |       7.57% |   7.788 ms | 517.13% |   FAIL   |

@jrhemstad
Copy link
Copy Markdown
Contributor

I'd try patching the implementation to reduce the unrolling. That should alleviate the inflated compile time without negatively impacting performance.

@codecov

This comment was marked as off-topic.

@davidwendt
Copy link
Copy Markdown
Contributor Author

Closing this as an unacceptable workaround.

@davidwendt davidwendt closed this Aug 5, 2022
@davidwendt davidwendt deleted the t17-groupby-minmax branch August 5, 2022 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants