Fix Arrow device string view variadic buffers by 0ax1 · Pull Request #22620 · rapidsai/cudf

0ax1 · 2026-05-21T10:05:46Z

Description

Arrow C Device arrays can be produced by libraries other than Nanoarrow. In that case ArrowArray.private_data is producer-owned, so importing STRING_VIEW data must not call ArrowArrayBuffer() to discover variadic buffers via Nanoarrow's private_data layout.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

Arrow C Device arrays can be produced by libraries other than Nanoarrow. In that case ArrowArray.private_data is producer-owned, so importing STRING_VIEW data must not call ArrowArrayBuffer() to discover variadic buffers via Nanoarrow's private_data layout. Use the ArrowArrayView normalized variadic_buffers pointers populated by ArrowArrayViewSetArray() instead. This matches the host import path and accepts valid external producers such as the Vortex-based cudf-test-harness reproduction in commit 2b4e4d3 (Patch cuDF Arrow device StringView import). Add a from_arrow_device regression test that constructs a STRING_VIEW Arrow C Device array with producer-owned private_data and verifies import via from_arrow_device_column(). Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

copy-pr-bot · 2026-05-21T10:05:50Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-21T10:08:17Z

📝 Walkthrough

Summary by CodeRabbit

Bug Fixes
- Improved reliability of string-view conversion from Arrow device arrays by correcting how device string buffers are sourced.
Tests
- Added tests covering string-view conversion with producer-owned private data and explicit device-side Arrow array layouts.

Walkthrough

The PR changes how Arrow device STRING_VIEW arrays obtain variadic buffer pointers—using ArrowArrayView::variadic_buffers[i] during conversion—and updates tests to manually construct device ArrowArray buffers, adding a producer-owned private_data test.

Changes

Arrow STRING_VIEW conversion with producer-owned private data

Layer / File(s)	Summary
Variadic buffer sourcing in conversion `cpp/src/interop/from_arrow_device.cu`	The variadic buffer pointer gathering loop now uses `view.variadic_buffers[i]` directly for `NANOARROW_TYPE_STRING_VIEW`.
Device-side STRING_VIEW tests including producer-owned private_data `cpp/tests/interop/from_arrow_device_test.cpp`	Replaces Nanoarrow builder usage with manual `ArrowArray` buffer assembly for `STRING_VIEW` in `StringViewType`, and adds `StringViewTypeWithProducerOwnedPrivateData` which sets `ArrowArray.private_data` and validates `cudf::from_arrow_device_column` produces the expected column.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Fix Arrow device string view variadic buffers' directly and concisely describes the main change: fixing how variadic buffers are sourced in STRING_VIEW Arrow-device conversion.
Description check	✅ Passed	The description explains the core issue: ArrowArray.private_data is producer-owned in non-Nanoarrow libraries, and importing STRING_VIEW data must use ArrowArrayView's variadic_buffers instead of ArrowArrayBuffer().
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

davidwendt · 2026-05-21T10:47:19Z

/ok to test 529a049

0ax1 · 2026-05-21T12:00:49Z

~~Double checking on my end why the tests fail.~~ Tests are fixed now. @davidwendt

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cpp/tests/interop/from_arrow_device_test.cpp (1)

489-499: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a UTF-8 value to these STRING_VIEW fixtures.

Both tests only use ASCII literals, so this path still misses multibyte string coverage. Please swap in at least one non-ASCII value here so the new STRING_VIEW regression also verifies UTF-8 handling. As per coding guidelines, "String tests must include non-ASCII UTF-8 characters to verify proper encoding handling."

Suggested fixture tweak

 auto data = std::vector<std::string>({"hello",
-                                      "worldy",
+                                      "Überwald",
                                       "much longer string",
                                       "",
                                       "another even longer string",
                                       "",
-                                      "other string"});
+                                      "naïve café"});

Apply the same change to both duplicated data fixtures.

Also applies to: 590-600

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tests/interop/from_arrow_device_test.cpp` around lines 489 - 499, The
string test fixtures use only ASCII literals; update the
std::vector<std::string> named data in from_arrow_device_test.cpp (the fixture
that feeds into cudf::test::strings_column_wrapper and the expected_col) to
replace or add at least one non-ASCII UTF-8 value (e.g., include an accented or
multibyte character) so the STRING_VIEW regression verifies UTF-8 handling, and
apply the identical change to the other duplicated data fixture used later in
the file to ensure both tests cover multibyte UTF-8 strings.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/tests/interop/from_arrow_device_test.cpp`:
- Around line 554-576: The test builds manual ArrowArray instances (child_array
and struct_array) with offset = 0 and therefore never exercises the sliced path
that ArrowArrayViewSetArray() normalizes; add an additional test case that
constructs a sliced STRING_VIEW variant by creating a second set of ArrowArray
instances (e.g., child_array_sliced and struct_array_sliced) with a non-zero
offset and adjusted length/null_count and matching buffers/children, then invoke
ArrowArrayViewSetArray() on that sliced array to cover the normalization logic;
apply the same change for the other duplicated block around the 652-668 region
so both locations include a non-zero-offset (sliced) test alongside the existing
zero-offset case.

---

Outside diff comments:
In `@cpp/tests/interop/from_arrow_device_test.cpp`:
- Around line 489-499: The string test fixtures use only ASCII literals; update
the std::vector<std::string> named data in from_arrow_device_test.cpp (the
fixture that feeds into cudf::test::strings_column_wrapper and the expected_col)
to replace or add at least one non-ASCII UTF-8 value (e.g., include an accented
or multibyte character) so the STRING_VIEW regression verifies UTF-8 handling,
and apply the identical change to the other duplicated data fixture used later
in the file to ensure both tests cover multibyte UTF-8 strings.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a68cc39f-59b5-4f5b-b0b8-7b8c0ce9a47f

📥 Commits

Reviewing files that changed from the base of the PR and between 529a049 and 28ecf28.

📒 Files selected for processing (1)

cpp/tests/interop/from_arrow_device_test.cpp

davidwendt · 2026-05-21T21:27:59Z

/ok to test 0ax1@28ecf28

davidwendt · 2026-05-21T21:30:15Z

There are still some pending style issues it seems.

https://github.com/rapidsai/cudf/actions/runs/26254142222/job/77272618273#step:7:1

0ax1 · 2026-05-22T07:16:22Z

There are still some pending style issues it seems.

https://github.com/rapidsai/cudf/actions/runs/26254142222/job/77272618273#step:7:1

Thanks for the heads up, I pushed a clang-format commit by now.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

cpp/tests/interop/from_arrow_device_test.cpp (2)

489-500: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a non-ASCII UTF-8 case to these new STRING_VIEW regressions.

Both tests only use ASCII payloads, so they miss the encoding-sensitive path this file is required to cover. Replacing one of the longer strings with something like "Über" or "你好世界" would keep the regression focused while also validating byte-length handling through the variadic buffers.

As per coding guidelines, "String tests must include non-ASCII UTF-8 characters to verify proper encoding handling."

Also applies to: 590-601

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tests/interop/from_arrow_device_test.cpp` around lines 489 - 500, The
string test vectors (data) used to build expected_col via
cudf::test::strings_column_wrapper only contain ASCII and miss
UTF-8/encoding-sensitive paths; update the data vector to replace one of the
longer ASCII entries (e.g., "much longer string" or "another even longer
string") with a non-ASCII UTF-8 example such as "Über" or "你好世界" so that
expected_col/expected_view validate multi-byte UTF-8 handling; apply the same
change to the other analogous test block referenced (the second
data/validity/expected_col/expected_view group).

554-576: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid constructing “already released” ArrowArray objects in from_arrow_device_test.cpp

cudf::from_arrow_device* doesn’t call the input ArrowArray’s release, but setting release == nullptr marks the ArrowArray as already released per the Arrow C Data interface—yet these tests still feed the arrays’ live buffers into the conversion. Replace those release = nullptr assignments with a small no-op release callback (that sets array->release = nullptr) for child_array, struct_array, and device_array (also applies to the later block around ~651–663).

Suggested fix

+namespace {
+void noop_arrow_array_release(ArrowArray* array) { array->release = nullptr; }
+}  // namespace
+
   ArrowArray child_array{};
   child_array.length     = input.length;
   child_array.null_count = expected_view.null_count();
   child_array.offset     = 0;
   child_array.n_buffers  = static_cast<int64_t>(child_buffers.size());
   child_array.n_children = 0;
   child_array.buffers    = child_buffers.data();
   child_array.children   = nullptr;
   child_array.dictionary = nullptr;
-  child_array.release    = nullptr;
+  child_array.release    = noop_arrow_array_release;
...
   ArrowArray struct_array{};
   struct_array.length     = input.length;
   struct_array.null_count = 0;
   struct_array.offset     = 0;
   struct_array.n_buffers  = 1;
   struct_array.n_children = 1;
   struct_array.buffers    = struct_buffers;
   struct_array.children   = children;
   struct_array.dictionary = nullptr;
-  struct_array.release    = nullptr;
+  struct_array.release    = noop_arrow_array_release;
...
   device_array.n_buffers    = static_cast<int64_t>(device_buffers.size());
   device_array.n_children   = 0;
   device_array.buffers      = device_buffers.data();
   device_array.children     = nullptr;
   device_array.dictionary   = nullptr;
-  device_array.release      = nullptr;
+  device_array.release      = noop_arrow_array_release;
   device_array.private_data = &producer_private_data;

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tests/interop/from_arrow_device_test.cpp` around lines 554 - 576, The
test constructs ArrowArray objects (child_array, struct_array, and device_array)
with release == nullptr which marks them as already released per the Arrow C
Data interface; instead provide a small no-op release callback that, when
invoked, sets array->release = nullptr so the arrays are considered valid until
released; update the assignments for child_array.release, struct_array.release,
and device_array.release (and the later analogous block) to point to that no-op
function so the buffers can be safely consumed by cudf::from_arrow_device*
during the test.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@cpp/tests/interop/from_arrow_device_test.cpp`:
- Around line 489-500: The string test vectors (data) used to build expected_col
via cudf::test::strings_column_wrapper only contain ASCII and miss
UTF-8/encoding-sensitive paths; update the data vector to replace one of the
longer ASCII entries (e.g., "much longer string" or "another even longer
string") with a non-ASCII UTF-8 example such as "Über" or "你好世界" so that
expected_col/expected_view validate multi-byte UTF-8 handling; apply the same
change to the other analogous test block referenced (the second
data/validity/expected_col/expected_view group).
- Around line 554-576: The test constructs ArrowArray objects (child_array,
struct_array, and device_array) with release == nullptr which marks them as
already released per the Arrow C Data interface; instead provide a small no-op
release callback that, when invoked, sets array->release = nullptr so the arrays
are considered valid until released; update the assignments for
child_array.release, struct_array.release, and device_array.release (and the
later analogous block) to point to that no-op function so the buffers can be
safely consumed by cudf::from_arrow_device* during the test.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 68e6ec58-f580-4475-8a0e-d0520ba16f38

📥 Commits

Reviewing files that changed from the base of the PR and between 28ecf28 and 001c4cf.

📒 Files selected for processing (1)

cpp/tests/interop/from_arrow_device_test.cpp

davidwendt · 2026-05-22T11:40:05Z

/ok to test 001c4cf

0ax1 · 2026-05-22T12:34:13Z

The conda-python-other-tests failing seems unrelated to my PR?

davidwendt · 2026-05-22T12:42:17Z

/ok to test e42b85d

0ax1 requested a review from a team as a code owner May 21, 2026 10:05

0ax1 requested review from mythrocks and vyasr May 21, 2026 10:05

github-actions Bot assigned 0ax1 May 21, 2026

github-actions Bot added the libcudf Affects libcudf (C++/CUDA) code. label May 21, 2026

0ax1 mentioned this pull request May 21, 2026

chore: polish ArrowDeviceArray vortex-data/vortex#8023

Merged

davidwendt added bug Something isn't working 3 - Ready for Review Ready for review by team non-breaking Non-breaking change labels May 21, 2026

fix tests

28ecf28

coderabbitai Bot reviewed May 21, 2026

View reviewed changes

Comment thread cpp/tests/interop/from_arrow_device_test.cpp

clang-format

001c4cf

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

Merge branch 'main' into fix/arrow-device-string-view-variadic-buffers

e42b85d

github-actions Bot assigned davidwendt May 22, 2026

davidwendt removed their assignment May 22, 2026

davidwendt approved these changes May 22, 2026

View reviewed changes

Conversation

0ax1 commented May 21, 2026

Description

Checklist

Uh oh!

copy-pr-bot Bot commented May 21, 2026

Uh oh!

coderabbitai Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

davidwendt commented May 21, 2026

Uh oh!

0ax1 commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

davidwendt commented May 21, 2026

Uh oh!

davidwendt commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0ax1 commented May 22, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

davidwendt commented May 22, 2026

Uh oh!

0ax1 commented May 22, 2026

Uh oh!

davidwendt commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 21, 2026 •

edited

Loading

0ax1 commented May 21, 2026 •

edited

Loading

davidwendt commented May 21, 2026 •

edited

Loading