Skip to content

Asserting current device and CUB stream matches#9119

Open
thom-gg wants to merge 3 commits into
NVIDIA:mainfrom
thom-gg:validate-device-and-stream-matches
Open

Asserting current device and CUB stream matches#9119
thom-gg wants to merge 3 commits into
NVIDIA:mainfrom
thom-gg:validate-device-and-stream-matches

Conversation

@thom-gg
Copy link
Copy Markdown

@thom-gg thom-gg commented May 23, 2026

Description

closes #7782

Adding an assertion to all the dispatching codes to ensure current device and CUB stream matches. Calling it at the very beginning of each of the dispatch functions, hence the number of modified files

The assertion itself uses cudaStreamGetDevice which was introduced in CTK 12.8 so it's guarded by the macro _CCCL_CTK_AT_LEAST(12,8).

I'm new to the project so unsure if there is a better place to call the assertion rather than doing it in every dispatch file, also unsure if the assertion should be put in the cub/cub/util_device.cuh file like i did or elsewhere, please tell me if this issue should be addressed differently and i'll try to do it !

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@thom-gg thom-gg requested a review from a team as a code owner May 23, 2026 15:52
@thom-gg thom-gg requested a review from fbusato May 23, 2026 15:52
@github-project-automation github-project-automation Bot moved this to Todo in CCCL May 23, 2026
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 23, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 23, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: aefb923b-860a-423b-b27f-76cd2d48894e

📥 Commits

Reviewing files that changed from the base of the PR and between a8d21f9 and 8de3755.

📒 Files selected for processing (2)
  • cub/cub/device/dispatch/dispatch_scan_by_key.cuh
  • cub/cub/device/dispatch/dispatch_segmented_reduce.cuh

📝 Walkthrough

Summary by CodeRabbit

  • Bug Fixes
    • Added upfront CUDA stream ↔ device validation across many device dispatch paths to detect invalid stream-device associations early and return errors before dispatching work.
    • Introduced a centralized runtime stream validation utility to improve robustness and prevent downstream failures during sorting, scanning, reduction, selection, partitioning, transforms, unique/find/top‑K and related operations.

important:

Walkthrough

Stream-device validation is added: a new validate_stream_device(cudaStream_t) helper was introduced and many dispatch entrypoints now call it at function start, returning any cudaError before PTX/compute-capability queries or kernel launch planning.

Changes

Stream-device validation layer

Layer / File(s) Summary
Core validation utility
cub/cub/util_device.cuh
New validate_stream_device(cudaStream_t) implemented; queries cudaStreamGetDevice and cudaGetDevice and returns a cudaError_t (guarded by _CCCL_CTK_AT_LEAST(12, 8)).
Dispatch validation rollout
cub/cub/device/dispatch/*.cuh (adjacent_difference, batch_memcpy, batched_topk, find, for, histogram, merge, merge_sort, radix_sort, reduce, reduce_by_key, reduce_deterministic, reduce_nondeterministic, rle, scan, scan_by_key, segmented_radix_sort, segmented_reduce, segmented_scan, segmented_sort, select_if, three_way_partition, topk, transform, unique_by_key)
Inserted validate_stream_device(stream) at the start of dispatch entrypoints and related helper dispatch functions; each call returns early on error before PTX/compute-capability selection, temporary-storage sizing/aliasing, or kernel dispatch logic.

Assessment against linked issues

Objective Addressed Explanation
Add stream-device validation to CUB dispatch functions [#7782]

Suggested labels

libcu++

Suggested reviewers

  • bernhardmgruber
  • srinivasyadav18
  • davebayer

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (16)
cub/cub/device/dispatch/dispatch_merge_sort.cuh (1)

406-406: ⚡ Quick win

suggestion: qualify validate_stream_device(stream) with its global namespace-qualified symbol (matching its declaration namespace) instead of using unqualified lookup.
As per coding guidelines "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".

Also applies to: 477-477

cub/cub/device/dispatch/dispatch_radix_sort.cuh (1)

1141-1141: ⚡ Quick win

suggestion: use the global namespace-qualified form of validate_stream_device(stream) at both dispatch entry points to satisfy the free-function qualification rule.
As per coding guidelines "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".

Also applies to: 1206-1206

cub/cub/device/dispatch/dispatch_reduce.cuh (1)

481-481: ⚡ Quick win

suggestion: qualify validate_stream_device(stream) from the global namespace in both locations rather than relying on unqualified lookup.
As per coding guidelines "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".

Also applies to: 754-754

cub/cub/device/dispatch/dispatch_reduce_by_key.cuh (1)

609-609: ⚡ Quick win

suggestion: switch both validate_stream_device(stream) calls to the fully qualified global-namespace symbol to comply with the free-function call rule.
As per coding guidelines "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".

Also applies to: 698-698

cub/cub/device/dispatch/dispatch_reduce_deterministic.cuh (1)

342-342: ⚡ Quick win

suggestion: qualify validate_stream_device(stream) from the global namespace instead of calling it unqualified.
As per coding guidelines "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".

cub/cub/device/dispatch/dispatch_reduce_nondeterministic.cuh (1)

176-176: ⚡ Quick win

suggestion: call validate_stream_device(stream) via its global namespace-qualified symbol here.
As per coding guidelines "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".

cub/cub/device/dispatch/dispatch_rle.cuh (1)

608-608: ⚡ Quick win

suggestion: make both validate_stream_device(stream) calls fully qualified from the global namespace to align with project call-qualification rules.
As per coding guidelines "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".

Also applies to: 666-666

cub/cub/device/dispatch/dispatch_scan.cuh (1)

865-865: ⚡ Quick win

suggestion: use the global namespace-qualified form for validate_stream_device(stream) in both locations rather than unqualified calls.
As per coding guidelines "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".

Also applies to: 933-933

cub/cub/device/dispatch/dispatch_scan_by_key.cuh (1)

599-599: ⚡ Quick win

suggestion: Qualify validate_stream_device from the global namespace in both dispatch entrypoints to match the repository call-style rule.

As per coding guidelines, "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".

Also applies to: 737-737

cub/cub/device/dispatch/dispatch_segmented_radix_sort.cuh (1)

620-620: ⚡ Quick win

suggestion: Use a globally qualified call for validate_stream_device at both insertion points to keep dispatch code aligned with repository qualification rules.

As per coding guidelines, "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".

Also applies to: 907-907

cub/cub/device/dispatch/dispatch_segmented_reduce.cuh (1)

424-424: ⚡ Quick win

suggestion: Fully qualify validate_stream_device from global scope in both dispatch paths for consistency with the project’s free-function call rule.

As per coding guidelines, "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".

Also applies to: 531-531

cub/cub/device/dispatch/dispatch_segmented_scan.cuh (1)

132-132: ⚡ Quick win

suggestion: Qualify validate_stream_device from the global namespace here to satisfy the repository’s free-function qualification requirement.

As per coding guidelines, "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".

cub/cub/device/dispatch/dispatch_segmented_sort.cuh (1)

692-692: ⚡ Quick win

suggestion: Switch both validate_stream_device invocations to globally qualified form to match the enforced free-function qualification convention.

As per coding guidelines, "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".

Also applies to: 1285-1285

cub/cub/device/dispatch/dispatch_select_if.cuh (1)

846-846: ⚡ Quick win

suggestion: Apply global qualification to validate_stream_device in both dispatch entrypoints to comply with the project-wide free-function call convention.

As per coding guidelines, "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".

Also applies to: 1105-1105

cub/cub/device/dispatch/dispatch_three_way_partition.cuh (1)

367-367: ⚡ Quick win

suggestion: Use globally qualified validate_stream_device calls in both updated dispatch layers to align with the mandatory free-function qualification rule.

As per coding guidelines, "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".

Also applies to: 438-438

cub/cub/device/dispatch/dispatch_topk.cuh (1)

478-478: ⚡ Quick win

suggestion: Qualify validate_stream_device from global scope in this dispatch entrypoint to satisfy the repository free-function qualification rule.

As per coding guidelines, "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5d183277-1a01-4666-9f1b-617c330bdabb

📥 Commits

Reviewing files that changed from the base of the PR and between c47f140 and e23c56c.

📒 Files selected for processing (26)
  • cub/cub/device/dispatch/dispatch_adjacent_difference.cuh
  • cub/cub/device/dispatch/dispatch_batch_memcpy.cuh
  • cub/cub/device/dispatch/dispatch_batched_topk.cuh
  • cub/cub/device/dispatch/dispatch_find.cuh
  • cub/cub/device/dispatch/dispatch_for.cuh
  • cub/cub/device/dispatch/dispatch_histogram.cuh
  • cub/cub/device/dispatch/dispatch_merge.cuh
  • cub/cub/device/dispatch/dispatch_merge_sort.cuh
  • cub/cub/device/dispatch/dispatch_radix_sort.cuh
  • cub/cub/device/dispatch/dispatch_reduce.cuh
  • cub/cub/device/dispatch/dispatch_reduce_by_key.cuh
  • cub/cub/device/dispatch/dispatch_reduce_deterministic.cuh
  • cub/cub/device/dispatch/dispatch_reduce_nondeterministic.cuh
  • cub/cub/device/dispatch/dispatch_rle.cuh
  • cub/cub/device/dispatch/dispatch_scan.cuh
  • cub/cub/device/dispatch/dispatch_scan_by_key.cuh
  • cub/cub/device/dispatch/dispatch_segmented_radix_sort.cuh
  • cub/cub/device/dispatch/dispatch_segmented_reduce.cuh
  • cub/cub/device/dispatch/dispatch_segmented_scan.cuh
  • cub/cub/device/dispatch/dispatch_segmented_sort.cuh
  • cub/cub/device/dispatch/dispatch_select_if.cuh
  • cub/cub/device/dispatch/dispatch_three_way_partition.cuh
  • cub/cub/device/dispatch/dispatch_topk.cuh
  • cub/cub/device/dispatch/dispatch_transform.cuh
  • cub/cub/device/dispatch/dispatch_unique_by_key.cuh
  • cub/cub/util_device.cuh

Comment thread cub/cub/device/dispatch/dispatch_adjacent_difference.cuh Outdated
Comment thread cub/cub/util_device.cuh
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2d2b79d9-a47e-40c6-99d9-5d9e96a54b55

📥 Commits

Reviewing files that changed from the base of the PR and between e23c56c and a8d21f9.

📒 Files selected for processing (26)
  • cub/cub/device/dispatch/dispatch_adjacent_difference.cuh
  • cub/cub/device/dispatch/dispatch_batch_memcpy.cuh
  • cub/cub/device/dispatch/dispatch_batched_topk.cuh
  • cub/cub/device/dispatch/dispatch_find.cuh
  • cub/cub/device/dispatch/dispatch_for.cuh
  • cub/cub/device/dispatch/dispatch_histogram.cuh
  • cub/cub/device/dispatch/dispatch_merge.cuh
  • cub/cub/device/dispatch/dispatch_merge_sort.cuh
  • cub/cub/device/dispatch/dispatch_radix_sort.cuh
  • cub/cub/device/dispatch/dispatch_reduce.cuh
  • cub/cub/device/dispatch/dispatch_reduce_by_key.cuh
  • cub/cub/device/dispatch/dispatch_reduce_deterministic.cuh
  • cub/cub/device/dispatch/dispatch_reduce_nondeterministic.cuh
  • cub/cub/device/dispatch/dispatch_rle.cuh
  • cub/cub/device/dispatch/dispatch_scan.cuh
  • cub/cub/device/dispatch/dispatch_scan_by_key.cuh
  • cub/cub/device/dispatch/dispatch_segmented_radix_sort.cuh
  • cub/cub/device/dispatch/dispatch_segmented_reduce.cuh
  • cub/cub/device/dispatch/dispatch_segmented_scan.cuh
  • cub/cub/device/dispatch/dispatch_segmented_sort.cuh
  • cub/cub/device/dispatch/dispatch_select_if.cuh
  • cub/cub/device/dispatch/dispatch_three_way_partition.cuh
  • cub/cub/device/dispatch/dispatch_topk.cuh
  • cub/cub/device/dispatch/dispatch_transform.cuh
  • cub/cub/device/dispatch/dispatch_unique_by_key.cuh
  • cub/cub/util_device.cuh
✅ Files skipped from review due to trivial changes (1)
  • cub/cub/device/dispatch/dispatch_histogram.cuh

Comment thread cub/cub/device/dispatch/dispatch_scan_by_key.cuh
Comment thread cub/cub/device/dispatch/dispatch_segmented_reduce.cuh
Copy link
Copy Markdown
Contributor

@bernhardmgruber bernhardmgruber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this contribution! Please add a unit test to at least one algorithm calling it with a stream that does not match the current device. This test must be written in a way that it also works if there is only one GPU/device in the system (just succeeding is fine I think). I can try it briefly on my machine where I have two GPUs.

Comment thread cub/cub/util_device.cuh
Comment on lines +467 to +471
error = cudaStreamGetDevice(stream, &streamDevice);
if (error != cudaSuccess)
{
return error;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Let's not reuse the error variable:

Suggested change
error = cudaStreamGetDevice(stream, &streamDevice);
if (error != cudaSuccess)
{
return error;
}
if (const auto error = cudaStreamGetDevice(stream, &streamDevice);)
{
return error;
}

Comment thread cub/cub/util_device.cuh
Comment on lines +473 to +477
error = cudaGetDevice(&currentDevice);
if (error != cudaSuccess)
{
return error;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
error = cudaGetDevice(&currentDevice);
if (error != cudaSuccess)
{
return error;
}
if (const auto error = cudaGetDevice(&currentDevice);)
{
return error;
}

Comment thread cub/cub/util_device.cuh
return cudaErrorInvalidDevice;
}
# endif // _CCCL_CTK_AT_LEAST(12,8)
return error;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return error;
return cudaSuccess;

Comment thread cub/cub/util_device.cuh
CUB_RUNTIME_FUNCTION _CCCL_FORCEINLINE cudaError_t validate_stream_device(cudaStream_t stream)
{
cudaError_t error = cudaSuccess;
# if _CCCL_CTK_AT_LEAST(12, 8)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: sometimes users violate our API requirements, but their software ran fine for a long time. They would be upset if we suddenly enforce requirements, causing their software to break. Let's add a macro to disable this new feature:

Suggested change
# if _CCCL_CTK_AT_LEAST(12, 8)
# if _CCCL_CTK_AT_LEAST(12, 8) && !defined(CCCL_DISABLE_STREAM_DEVICE_CHECK)

If possible, add a unit test that calls a simple algorithm like DeviceFor with a stream and a different current device and define the CCCL_DISABLE_STREAM_DEVICE_CHECK macro, to see whether the escape hatch works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

Validate current device and CUB stream matches

2 participants