Workaround for groupyby-min/max compile-time issue with thrust-1.17 by davidwendt · Pull Request #11467 · rapidsai/cudf

davidwendt · 2022-08-04T17:19:44Z

Description

Fixes issue found in #11437 where compile-time for groupby/sort/group_argmax.cu and groupby/sort/group_argmin.cu more than doubles to over 30 minutes for each file:
https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cudf/job/prb/job/cudf-cpu-cuda-build/CUDA=11.5/11169/Build_20Metrics_20Report/
Baseline example from a different PR:
https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cudf/job/prb/job/cudf-cpu-cuda-build/CUDA=11.5/11215/Build_20Metrics_20Report/

The culprit appears to be thrust::reduce_by_key and almost all source files using this function appear to have doubled in compile time.
The fix here forces the element_argminmax_fn functor's device operator to noinline.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

ttnghia · 2022-08-04T17:25:03Z

  bool const arg_min;

-  __device__ inline auto operator()(size_type const& lhs_idx, size_type const& rhs_idx) const
+  __noinline__ __device__ auto operator()(size_type const& lhs_idx, size_type const& rhs_idx) const


What is the difference between __noinline__ and __attribute__((noinline))?

Nothing as far as I can tell. I tried them both.
Also, reference: https://github.com/NVIDIA/thrust/issues/1344#issuecomment-1164676122

I asked @jrhemstad about this once and he gave a nice example of exactly how equivalent they are once nvcc is done preprocessing the files:

(rapids) rapids@compose:~/cudf/tmp$ echo "__global__ void kernel(){} int main(){}" > test.cu && nvcc ./test.cu --keep && tail -3 test.cpp1.ii # 1 "<command-line>" 2 # 1 "./test.cu" __attribute__((global)) void kernel(){} int main(){}

So __inline__ will be converted into __attribute__((inline)) right? I guess so, as __something__ seems a CUDA specific keyword and I guess it will be converted into some C++ standard equivalent if possible.

I was wrong:

╰─ echo "__inline__ __device__ void f(){} int main(){}" > test.cu && nvcc ./test.cu --keep && tail -3 test.cpp1.ii # 0 "<command-line>" 2 # 1 "./test.cu" __inline__ __attribute__((device)) void f(){} int main(){}

But:

╰─ echo "__noinline__ __device__ void f(){} int main(){}" > test.cu && nvcc ./test.cu --keep && tail -3 test.cpp1.ii # 0 "<command-line>" 2 # 1 "./test.cu" __attribute__((noinline)) __attribute__((device)) void f(){} int main(){}

So __inline__ is not converted but __noinline__ is converted.

When we use __noinline__ for a specific reason like this that is affected by external code (CUB), we should add a comment to indicate why.

Suggested change

__noinline__ __device__ auto operator()(size_type const& lhs_idx, size_type const& rhs_idx) const

// Must be __noinline__ for Thrust/CUB, to prevent long compile times

__noinline__ __device__ auto operator()(size_type const& lhs_idx, size_type const& rhs_idx) const

Already had it queued up. 👍

ttnghia · 2022-08-04T17:47:26Z

To clarify: This is a fix for sort-based groupby arg_min/arg_max instead.

davidwendt · 2022-08-04T17:49:05Z

To clarify: This is a fix for sort-based groupby arg_min/arg_max instead.

The source files are named in the PR description: #11467 (comment)

ttnghia · 2022-08-04T18:07:32Z

And these files were compiled in 8 minutes (from 30m) 🚀

ttnghia · 2022-08-04T18:14:47Z

Let me try to run a benchmark for this. My recent PR for benchmarking groupby max can be modified a little bit to run this easily.

ttnghia · 2022-08-04T18:31:16Z

Oh sorry, I just recognized that my benchmark was wrong. Generating again....

ttnghia · 2022-08-04T18:37:46Z

Benchmark groupby arg_max with Thrust 1.15 (current):

## [0] Quadro RTX 6000

|  T  |  num_rows  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|-----|------------|------------|-------------|------------|-------------|-----------|---------|----------|
| I32 |    2^12    |  69.403 us |      61.73% |  76.430 us |      41.24% |  7.027 us |  10.12% |   PASS   |
| I32 |    2^18    |  72.317 us |      34.54% | 170.834 us |     103.27% | 98.518 us | 136.23% |   FAIL   |
| I32 |    2^24    |   1.469 ms |      10.33% |   9.017 ms |       7.23% |  7.549 ms | 513.99% |   FAIL   |

Benchmark groupby arg_max with Thrust 1.17 (merged with #11437):

|  T  |  num_rows  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|-----|------------|------------|-------------|------------|-------------|------------|---------|----------|
| I32 |    2^12    |  69.474 us |      74.59% |  77.506 us |      55.57% |   8.032 us |  11.56% |   PASS   |
| I32 |    2^18    |  77.304 us |      61.53% | 182.766 us |      29.55% | 105.462 us | 136.43% |   FAIL   |
| I32 |    2^24    |   1.506 ms |      14.55% |   9.294 ms |       7.57% |   7.788 ms | 517.13% |   FAIL   |

jrhemstad · 2022-08-04T19:00:46Z

I'd try patching the implementation to reduce the unrolling. That should alleviate the inflated compile time without negatively impacting performance.

davidwendt · 2022-08-05T18:04:28Z

Closing this as an unacceptable workaround.

Workaround for groupyby-min/max compile-time issue with thrust-1.17

d70c074

davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Aug 4, 2022

davidwendt requested a review from a team as a code owner August 4, 2022 17:19

davidwendt self-assigned this Aug 4, 2022

davidwendt requested review from cwharris and vuule August 4, 2022 17:19

davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Aug 4, 2022

ttnghia reviewed Aug 4, 2022

View reviewed changes

add comment about reduce_by_key

0fb0cd9

ttnghia approved these changes Aug 4, 2022

View reviewed changes

Merge branch 'branch-22.10' into t17-groupby-minmax

7cc8594

davidwendt requested a review from bdice August 4, 2022 18:25

This comment was marked as outdated.

Sign in to view

This comment was marked as off-topic.

Sign in to view

davidwendt closed this Aug 5, 2022

davidwendt deleted the t17-groupby-minmax branch August 5, 2022 18:04

	__noinline__ __device__ auto operator()(size_type const& lhs_idx, size_type const& rhs_idx) const
	// Must be __noinline__ for Thrust/CUB, to prevent long compile times
	__noinline__ __device__ auto operator()(size_type const& lhs_idx, size_type const& rhs_idx) const

Conversation

davidwendt commented Aug 4, 2022

Description

Checklist

Uh oh!

ttnghia Aug 4, 2022

Choose a reason for hiding this comment

Uh oh!

davidwendt Aug 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vyasr Aug 4, 2022

Choose a reason for hiding this comment

Uh oh!

ttnghia Aug 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ttnghia Aug 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bdice Aug 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidwendt Aug 4, 2022

Choose a reason for hiding this comment

Uh oh!

ttnghia commented Aug 4, 2022

Uh oh!

davidwendt commented Aug 4, 2022

Uh oh!

ttnghia commented Aug 4, 2022

Uh oh!

ttnghia commented Aug 4, 2022

Uh oh!

This comment was marked as outdated.

ttnghia commented Aug 4, 2022

Uh oh!

ttnghia commented Aug 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrhemstad commented Aug 4, 2022

Uh oh!

This comment was marked as off-topic.

davidwendt commented Aug 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

davidwendt Aug 4, 2022 •

edited

Loading

ttnghia Aug 4, 2022 •

edited

Loading

ttnghia Aug 4, 2022 •

edited

Loading

bdice Aug 4, 2022 •

edited

Loading

ttnghia commented Aug 4, 2022 •

edited

Loading