Fix Arrow device string view variadic buffers#22620
Conversation
Arrow C Device arrays can be produced by libraries other than Nanoarrow. In that case ArrowArray.private_data is producer-owned, so importing STRING_VIEW data must not call ArrowArrayBuffer() to discover variadic buffers via Nanoarrow's private_data layout. Use the ArrowArrayView normalized variadic_buffers pointers populated by ArrowArrayViewSetArray() instead. This matches the host import path and accepts valid external producers such as the Vortex-based cudf-test-harness reproduction in commit 2b4e4d3 (Patch cuDF Arrow device StringView import). Add a from_arrow_device regression test that constructs a STRING_VIEW Arrow C Device array with producer-owned private_data and verifies import via from_arrow_device_column(). Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
📝 WalkthroughSummary by CodeRabbit
WalkthroughThe PR changes how Arrow device STRING_VIEW arrays obtain variadic buffer pointers—using ChangesArrow STRING_VIEW conversion with producer-owned private data
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/ok to test 529a049 |
|
|
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/tests/interop/from_arrow_device_test.cpp (1)
489-499:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd a UTF-8 value to these STRING_VIEW fixtures.
Both tests only use ASCII literals, so this path still misses multibyte string coverage. Please swap in at least one non-ASCII value here so the new
STRING_VIEWregression also verifies UTF-8 handling. As per coding guidelines, "String tests must include non-ASCII UTF-8 characters to verify proper encoding handling."Suggested fixture tweak
auto data = std::vector<std::string>({"hello", - "worldy", + "Überwald", "much longer string", "", "another even longer string", "", - "other string"}); + "naïve café"});Apply the same change to both duplicated
datafixtures.Also applies to: 590-600
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tests/interop/from_arrow_device_test.cpp` around lines 489 - 499, The string test fixtures use only ASCII literals; update the std::vector<std::string> named data in from_arrow_device_test.cpp (the fixture that feeds into cudf::test::strings_column_wrapper and the expected_col) to replace or add at least one non-ASCII UTF-8 value (e.g., include an accented or multibyte character) so the STRING_VIEW regression verifies UTF-8 handling, and apply the identical change to the other duplicated data fixture used later in the file to ensure both tests cover multibyte UTF-8 strings.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cpp/tests/interop/from_arrow_device_test.cpp`:
- Around line 554-576: The test builds manual ArrowArray instances (child_array
and struct_array) with offset = 0 and therefore never exercises the sliced path
that ArrowArrayViewSetArray() normalizes; add an additional test case that
constructs a sliced STRING_VIEW variant by creating a second set of ArrowArray
instances (e.g., child_array_sliced and struct_array_sliced) with a non-zero
offset and adjusted length/null_count and matching buffers/children, then invoke
ArrowArrayViewSetArray() on that sliced array to cover the normalization logic;
apply the same change for the other duplicated block around the 652-668 region
so both locations include a non-zero-offset (sliced) test alongside the existing
zero-offset case.
---
Outside diff comments:
In `@cpp/tests/interop/from_arrow_device_test.cpp`:
- Around line 489-499: The string test fixtures use only ASCII literals; update
the std::vector<std::string> named data in from_arrow_device_test.cpp (the
fixture that feeds into cudf::test::strings_column_wrapper and the expected_col)
to replace or add at least one non-ASCII UTF-8 value (e.g., include an accented
or multibyte character) so the STRING_VIEW regression verifies UTF-8 handling,
and apply the identical change to the other duplicated data fixture used later
in the file to ensure both tests cover multibyte UTF-8 strings.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a68cc39f-59b5-4f5b-b0b8-7b8c0ce9a47f
📒 Files selected for processing (1)
cpp/tests/interop/from_arrow_device_test.cpp
|
/ok to test 0ax1@28ecf28 |
|
There are still some pending style issues it seems. https://github.com/rapidsai/cudf/actions/runs/26254142222/job/77272618273#step:7:1 |
Thanks for the heads up, I pushed a clang-format commit by now. |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
cpp/tests/interop/from_arrow_device_test.cpp (2)
489-500:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd a non-ASCII UTF-8 case to these new
STRING_VIEWregressions.Both tests only use ASCII payloads, so they miss the encoding-sensitive path this file is required to cover. Replacing one of the longer strings with something like
"Über"or"你好世界"would keep the regression focused while also validating byte-length handling through the variadic buffers.As per coding guidelines, "String tests must include non-ASCII UTF-8 characters to verify proper encoding handling."
Also applies to: 590-601
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tests/interop/from_arrow_device_test.cpp` around lines 489 - 500, The string test vectors (data) used to build expected_col via cudf::test::strings_column_wrapper only contain ASCII and miss UTF-8/encoding-sensitive paths; update the data vector to replace one of the longer ASCII entries (e.g., "much longer string" or "another even longer string") with a non-ASCII UTF-8 example such as "Über" or "你好世界" so that expected_col/expected_view validate multi-byte UTF-8 handling; apply the same change to the other analogous test block referenced (the second data/validity/expected_col/expected_view group).
554-576:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAvoid constructing “already released”
ArrowArrayobjects infrom_arrow_device_test.cpp
cudf::from_arrow_device*doesn’t call the inputArrowArray’srelease, but settingrelease == nullptrmarks theArrowArrayas already released per the Arrow C Data interface—yet these tests still feed the arrays’ live buffers into the conversion. Replace thoserelease = nullptrassignments with a small no-op release callback (that setsarray->release = nullptr) forchild_array,struct_array, anddevice_array(also applies to the later block around ~651–663).Suggested fix
+namespace { +void noop_arrow_array_release(ArrowArray* array) { array->release = nullptr; } +} // namespace + ArrowArray child_array{}; child_array.length = input.length; child_array.null_count = expected_view.null_count(); child_array.offset = 0; child_array.n_buffers = static_cast<int64_t>(child_buffers.size()); child_array.n_children = 0; child_array.buffers = child_buffers.data(); child_array.children = nullptr; child_array.dictionary = nullptr; - child_array.release = nullptr; + child_array.release = noop_arrow_array_release; ... ArrowArray struct_array{}; struct_array.length = input.length; struct_array.null_count = 0; struct_array.offset = 0; struct_array.n_buffers = 1; struct_array.n_children = 1; struct_array.buffers = struct_buffers; struct_array.children = children; struct_array.dictionary = nullptr; - struct_array.release = nullptr; + struct_array.release = noop_arrow_array_release; ... device_array.n_buffers = static_cast<int64_t>(device_buffers.size()); device_array.n_children = 0; device_array.buffers = device_buffers.data(); device_array.children = nullptr; device_array.dictionary = nullptr; - device_array.release = nullptr; + device_array.release = noop_arrow_array_release; device_array.private_data = &producer_private_data;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tests/interop/from_arrow_device_test.cpp` around lines 554 - 576, The test constructs ArrowArray objects (child_array, struct_array, and device_array) with release == nullptr which marks them as already released per the Arrow C Data interface; instead provide a small no-op release callback that, when invoked, sets array->release = nullptr so the arrays are considered valid until released; update the assignments for child_array.release, struct_array.release, and device_array.release (and the later analogous block) to point to that no-op function so the buffers can be safely consumed by cudf::from_arrow_device* during the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@cpp/tests/interop/from_arrow_device_test.cpp`:
- Around line 489-500: The string test vectors (data) used to build expected_col
via cudf::test::strings_column_wrapper only contain ASCII and miss
UTF-8/encoding-sensitive paths; update the data vector to replace one of the
longer ASCII entries (e.g., "much longer string" or "another even longer
string") with a non-ASCII UTF-8 example such as "Über" or "你好世界" so that
expected_col/expected_view validate multi-byte UTF-8 handling; apply the same
change to the other analogous test block referenced (the second
data/validity/expected_col/expected_view group).
- Around line 554-576: The test constructs ArrowArray objects (child_array,
struct_array, and device_array) with release == nullptr which marks them as
already released per the Arrow C Data interface; instead provide a small no-op
release callback that, when invoked, sets array->release = nullptr so the arrays
are considered valid until released; update the assignments for
child_array.release, struct_array.release, and device_array.release (and the
later analogous block) to point to that no-op function so the buffers can be
safely consumed by cudf::from_arrow_device* during the test.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 68e6ec58-f580-4475-8a0e-d0520ba16f38
📒 Files selected for processing (1)
cpp/tests/interop/from_arrow_device_test.cpp
|
/ok to test 001c4cf |
|
The conda-python-other-tests failing seems unrelated to my PR? |
|
/ok to test e42b85d |
Description
Arrow C Device arrays can be produced by libraries other than Nanoarrow. In that case ArrowArray.private_data is producer-owned, so importing STRING_VIEW data must not call ArrowArrayBuffer() to discover variadic buffers via Nanoarrow's private_data layout.
Checklist