Skip to content

Field extraction for Parquet VARIANT columns (scalar + nested objects)#22416

Draft
vuule wants to merge 20 commits into
rapidsai:mainfrom
vuule:pr2-variant-field-extraction-core
Draft

Field extraction for Parquet VARIANT columns (scalar + nested objects)#22416
vuule wants to merge 20 commits into
rapidsai:mainfrom
vuule:pr2-variant-field-extraction-core

Conversation

@vuule
Copy link
Copy Markdown
Contributor

@vuule vuule commented May 7, 2026

Description

Depends on #22310
Contributes to #22312

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 7, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. CMake CMake build issue labels May 7, 2026
@GPUtester GPUtester moved this to In Progress in cuDF Python May 7, 2026
@vuule vuule force-pushed the pr2-variant-field-extraction-core branch from c0386bf to c02b0c3 Compare May 11, 2026 18:01
@vuule vuule added feature request New feature or request non-breaking Non-breaking change labels May 11, 2026
@vuule vuule changed the title Variant field extraction APIs and implementation for simple cases Variant field extraction APIs for scalars May 11, 2026
Adds the scalar/object-descent slice of cudf::extract_variant_field on top of
the variant reader infra in rapidsai#22310. Provides three public APIs in
cudf::io::parquet::experimental:
  * get_variant_field     - extract raw VARIANT bytes at an object-key path
  * cast_variant          - decode VARIANT value blobs to STRING or INT8/16/32/64
  * extract_variant_field - convenience composition of the two

Path grammar is intentionally limited to $?(.name)+ for Phase A; bracket steps
and array indexing are reserved for a follow-on phase. The parser rejects
bracket/quoted/index/wildcard syntax with std::invalid_argument.

Implementation:
  * Two-pass kernel: sizing pass parses the path per row, memoizes the
    intra-blob source offset, and writes the per-row null mask; the copy
    pass is a pure gather using the memoized offsets.
  * Object-field lookup walks all n+1 field offsets to find the tightest end
    (Variant fields can be stored out-of-order).
  * extract_variant_field validates only once and goes through the detail::
    overloads to avoid revalidating the well-formed intermediate.

Includes 40 GTests covering null handling, multi-row mixed shapes, the Apache
parquet-testing object fixtures, dictionary/object scans 10x larger than the
unit fixtures, and a 128-row deep-path null-mask race test.
@vuule vuule force-pushed the pr2-variant-field-extraction-core branch from cd1ab15 to 7b39901 Compare May 12, 2026 01:51
@vuule vuule changed the title Variant field extraction APIs for scalars Add GPU-accelerated field extraction for Parquet VARIANT columns (scalar + nested objects) May 12, 2026
@vuule vuule changed the title Add GPU-accelerated field extraction for Parquet VARIANT columns (scalar + nested objects) Field extraction for Parquet VARIANT columns (scalar + nested objects) May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants