Skip to content

Fix Arrow device string view variadic buffers#22620

Open
0ax1 wants to merge 4 commits into
rapidsai:mainfrom
0ax1:fix/arrow-device-string-view-variadic-buffers
Open

Fix Arrow device string view variadic buffers#22620
0ax1 wants to merge 4 commits into
rapidsai:mainfrom
0ax1:fix/arrow-device-string-view-variadic-buffers

Conversation

@0ax1
Copy link
Copy Markdown

@0ax1 0ax1 commented May 21, 2026

Description

Arrow C Device arrays can be produced by libraries other than Nanoarrow. In that case ArrowArray.private_data is producer-owned, so importing STRING_VIEW data must not call ArrowArrayBuffer() to discover variadic buffers via Nanoarrow's private_data layout.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Arrow C Device arrays can be produced by libraries other than Nanoarrow. In that case ArrowArray.private_data is producer-owned, so importing STRING_VIEW data must not call ArrowArrayBuffer() to discover variadic buffers via Nanoarrow's private_data layout.

Use the ArrowArrayView normalized variadic_buffers pointers populated by ArrowArrayViewSetArray() instead. This matches the host import path and accepts valid external producers such as the Vortex-based cudf-test-harness reproduction in commit 2b4e4d3 (Patch cuDF Arrow device StringView import).

Add a from_arrow_device regression test that constructs a STRING_VIEW Arrow C Device array with producer-owned private_data and verifies import via from_arrow_device_column().

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
@0ax1 0ax1 requested a review from a team as a code owner May 21, 2026 10:05
@0ax1 0ax1 requested review from mythrocks and vyasr May 21, 2026 10:05
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the libcudf Affects libcudf (C++/CUDA) code. label May 21, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • Bug Fixes
    • Improved reliability of string-view conversion from Arrow device arrays by correcting how device string buffers are sourced.
  • Tests
    • Added tests covering string-view conversion with producer-owned private data and explicit device-side Arrow array layouts.

Walkthrough

The PR changes how Arrow device STRING_VIEW arrays obtain variadic buffer pointers—using ArrowArrayView::variadic_buffers[i] during conversion—and updates tests to manually construct device ArrowArray buffers, adding a producer-owned private_data test.

Changes

Arrow STRING_VIEW conversion with producer-owned private data

Layer / File(s) Summary
Variadic buffer sourcing in conversion
cpp/src/interop/from_arrow_device.cu
The variadic buffer pointer gathering loop now uses view.variadic_buffers[i] directly for NANOARROW_TYPE_STRING_VIEW.
Device-side STRING_VIEW tests including producer-owned private_data
cpp/tests/interop/from_arrow_device_test.cpp
Replaces Nanoarrow builder usage with manual ArrowArray buffer assembly for STRING_VIEW in StringViewType, and adds StringViewTypeWithProducerOwnedPrivateData which sets ArrowArray.private_data and validates cudf::from_arrow_device_column produces the expected column.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fix Arrow device string view variadic buffers' directly and concisely describes the main change: fixing how variadic buffers are sourced in STRING_VIEW Arrow-device conversion.
Description check ✅ Passed The description explains the core issue: ArrowArray.private_data is producer-owned in non-Nanoarrow libraries, and importing STRING_VIEW data must use ArrowArrayView's variadic_buffers instead of ArrowArrayBuffer().
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@davidwendt
Copy link
Copy Markdown
Contributor

/ok to test 529a049

@davidwendt davidwendt added bug Something isn't working 3 - Ready for Review Ready for review by team non-breaking Non-breaking change labels May 21, 2026
@0ax1
Copy link
Copy Markdown
Author

0ax1 commented May 21, 2026

Double checking on my end why the tests fail. Tests are fixed now. @davidwendt

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/tests/interop/from_arrow_device_test.cpp (1)

489-499: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a UTF-8 value to these STRING_VIEW fixtures.

Both tests only use ASCII literals, so this path still misses multibyte string coverage. Please swap in at least one non-ASCII value here so the new STRING_VIEW regression also verifies UTF-8 handling. As per coding guidelines, "String tests must include non-ASCII UTF-8 characters to verify proper encoding handling."

Suggested fixture tweak
 auto data = std::vector<std::string>({"hello",
-                                      "worldy",
+                                      "Überwald",
                                       "much longer string",
                                       "",
                                       "another even longer string",
                                       "",
-                                      "other string"});
+                                      "naïve café"});

Apply the same change to both duplicated data fixtures.

Also applies to: 590-600

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tests/interop/from_arrow_device_test.cpp` around lines 489 - 499, The
string test fixtures use only ASCII literals; update the
std::vector<std::string> named data in from_arrow_device_test.cpp (the fixture
that feeds into cudf::test::strings_column_wrapper and the expected_col) to
replace or add at least one non-ASCII UTF-8 value (e.g., include an accented or
multibyte character) so the STRING_VIEW regression verifies UTF-8 handling, and
apply the identical change to the other duplicated data fixture used later in
the file to ensure both tests cover multibyte UTF-8 strings.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/tests/interop/from_arrow_device_test.cpp`:
- Around line 554-576: The test builds manual ArrowArray instances (child_array
and struct_array) with offset = 0 and therefore never exercises the sliced path
that ArrowArrayViewSetArray() normalizes; add an additional test case that
constructs a sliced STRING_VIEW variant by creating a second set of ArrowArray
instances (e.g., child_array_sliced and struct_array_sliced) with a non-zero
offset and adjusted length/null_count and matching buffers/children, then invoke
ArrowArrayViewSetArray() on that sliced array to cover the normalization logic;
apply the same change for the other duplicated block around the 652-668 region
so both locations include a non-zero-offset (sliced) test alongside the existing
zero-offset case.

---

Outside diff comments:
In `@cpp/tests/interop/from_arrow_device_test.cpp`:
- Around line 489-499: The string test fixtures use only ASCII literals; update
the std::vector<std::string> named data in from_arrow_device_test.cpp (the
fixture that feeds into cudf::test::strings_column_wrapper and the expected_col)
to replace or add at least one non-ASCII UTF-8 value (e.g., include an accented
or multibyte character) so the STRING_VIEW regression verifies UTF-8 handling,
and apply the identical change to the other duplicated data fixture used later
in the file to ensure both tests cover multibyte UTF-8 strings.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a68cc39f-59b5-4f5b-b0b8-7b8c0ce9a47f

📥 Commits

Reviewing files that changed from the base of the PR and between 529a049 and 28ecf28.

📒 Files selected for processing (1)
  • cpp/tests/interop/from_arrow_device_test.cpp

Comment thread cpp/tests/interop/from_arrow_device_test.cpp
@davidwendt
Copy link
Copy Markdown
Contributor

/ok to test 0ax1@28ecf28

@davidwendt
Copy link
Copy Markdown
Contributor

davidwendt commented May 21, 2026

There are still some pending style issues it seems.

https://github.com/rapidsai/cudf/actions/runs/26254142222/job/77272618273#step:7:1

@0ax1
Copy link
Copy Markdown
Author

0ax1 commented May 22, 2026

There are still some pending style issues it seems.

https://github.com/rapidsai/cudf/actions/runs/26254142222/job/77272618273#step:7:1

Thanks for the heads up, I pushed a clang-format commit by now.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
cpp/tests/interop/from_arrow_device_test.cpp (2)

489-500: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a non-ASCII UTF-8 case to these new STRING_VIEW regressions.

Both tests only use ASCII payloads, so they miss the encoding-sensitive path this file is required to cover. Replacing one of the longer strings with something like "Über" or "你好世界" would keep the regression focused while also validating byte-length handling through the variadic buffers.

As per coding guidelines, "String tests must include non-ASCII UTF-8 characters to verify proper encoding handling."

Also applies to: 590-601

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tests/interop/from_arrow_device_test.cpp` around lines 489 - 500, The
string test vectors (data) used to build expected_col via
cudf::test::strings_column_wrapper only contain ASCII and miss
UTF-8/encoding-sensitive paths; update the data vector to replace one of the
longer ASCII entries (e.g., "much longer string" or "another even longer
string") with a non-ASCII UTF-8 example such as "Über" or "你好世界" so that
expected_col/expected_view validate multi-byte UTF-8 handling; apply the same
change to the other analogous test block referenced (the second
data/validity/expected_col/expected_view group).

554-576: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid constructing “already released” ArrowArray objects in from_arrow_device_test.cpp

cudf::from_arrow_device* doesn’t call the input ArrowArray’s release, but setting release == nullptr marks the ArrowArray as already released per the Arrow C Data interface—yet these tests still feed the arrays’ live buffers into the conversion. Replace those release = nullptr assignments with a small no-op release callback (that sets array->release = nullptr) for child_array, struct_array, and device_array (also applies to the later block around ~651–663).

Suggested fix
+namespace {
+void noop_arrow_array_release(ArrowArray* array) { array->release = nullptr; }
+}  // namespace
+
   ArrowArray child_array{};
   child_array.length     = input.length;
   child_array.null_count = expected_view.null_count();
   child_array.offset     = 0;
   child_array.n_buffers  = static_cast<int64_t>(child_buffers.size());
   child_array.n_children = 0;
   child_array.buffers    = child_buffers.data();
   child_array.children   = nullptr;
   child_array.dictionary = nullptr;
-  child_array.release    = nullptr;
+  child_array.release    = noop_arrow_array_release;
...
   ArrowArray struct_array{};
   struct_array.length     = input.length;
   struct_array.null_count = 0;
   struct_array.offset     = 0;
   struct_array.n_buffers  = 1;
   struct_array.n_children = 1;
   struct_array.buffers    = struct_buffers;
   struct_array.children   = children;
   struct_array.dictionary = nullptr;
-  struct_array.release    = nullptr;
+  struct_array.release    = noop_arrow_array_release;
...
   device_array.n_buffers    = static_cast<int64_t>(device_buffers.size());
   device_array.n_children   = 0;
   device_array.buffers      = device_buffers.data();
   device_array.children     = nullptr;
   device_array.dictionary   = nullptr;
-  device_array.release      = nullptr;
+  device_array.release      = noop_arrow_array_release;
   device_array.private_data = &producer_private_data;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tests/interop/from_arrow_device_test.cpp` around lines 554 - 576, The
test constructs ArrowArray objects (child_array, struct_array, and device_array)
with release == nullptr which marks them as already released per the Arrow C
Data interface; instead provide a small no-op release callback that, when
invoked, sets array->release = nullptr so the arrays are considered valid until
released; update the assignments for child_array.release, struct_array.release,
and device_array.release (and the later analogous block) to point to that no-op
function so the buffers can be safely consumed by cudf::from_arrow_device*
during the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@cpp/tests/interop/from_arrow_device_test.cpp`:
- Around line 489-500: The string test vectors (data) used to build expected_col
via cudf::test::strings_column_wrapper only contain ASCII and miss
UTF-8/encoding-sensitive paths; update the data vector to replace one of the
longer ASCII entries (e.g., "much longer string" or "another even longer
string") with a non-ASCII UTF-8 example such as "Über" or "你好世界" so that
expected_col/expected_view validate multi-byte UTF-8 handling; apply the same
change to the other analogous test block referenced (the second
data/validity/expected_col/expected_view group).
- Around line 554-576: The test constructs ArrowArray objects (child_array,
struct_array, and device_array) with release == nullptr which marks them as
already released per the Arrow C Data interface; instead provide a small no-op
release callback that, when invoked, sets array->release = nullptr so the arrays
are considered valid until released; update the assignments for
child_array.release, struct_array.release, and device_array.release (and the
later analogous block) to point to that no-op function so the buffers can be
safely consumed by cudf::from_arrow_device* during the test.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 68e6ec58-f580-4475-8a0e-d0520ba16f38

📥 Commits

Reviewing files that changed from the base of the PR and between 28ecf28 and 001c4cf.

📒 Files selected for processing (1)
  • cpp/tests/interop/from_arrow_device_test.cpp

@davidwendt
Copy link
Copy Markdown
Contributor

/ok to test 001c4cf

@0ax1
Copy link
Copy Markdown
Author

0ax1 commented May 22, 2026

The conda-python-other-tests failing seems unrelated to my PR?

@davidwendt davidwendt removed their assignment May 22, 2026
@davidwendt
Copy link
Copy Markdown
Contributor

/ok to test e42b85d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants