Field extraction for Parquet VARIANT columns (scalar + nested objects)#22416
Draft
vuule wants to merge 20 commits into
Draft
Field extraction for Parquet VARIANT columns (scalar + nested objects)#22416vuule wants to merge 20 commits into
vuule wants to merge 20 commits into
Conversation
c0386bf to
c02b0c3
Compare
Adds the scalar/object-descent slice of cudf::extract_variant_field on top of the variant reader infra in rapidsai#22310. Provides three public APIs in cudf::io::parquet::experimental: * get_variant_field - extract raw VARIANT bytes at an object-key path * cast_variant - decode VARIANT value blobs to STRING or INT8/16/32/64 * extract_variant_field - convenience composition of the two Path grammar is intentionally limited to $?(.name)+ for Phase A; bracket steps and array indexing are reserved for a follow-on phase. The parser rejects bracket/quoted/index/wildcard syntax with std::invalid_argument. Implementation: * Two-pass kernel: sizing pass parses the path per row, memoizes the intra-blob source offset, and writes the per-row null mask; the copy pass is a pure gather using the memoized offsets. * Object-field lookup walks all n+1 field offsets to find the tightest end (Variant fields can be stored out-of-order). * extract_variant_field validates only once and goes through the detail:: overloads to avoid revalidating the well-formed intermediate. Includes 40 GTests covering null handling, multi-row mixed shapes, the Apache parquet-testing object fixtures, dictionary/object scans 10x larger than the unit fixtures, and a 128-row deep-path null-mask race test.
cd1ab15 to
7b39901
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Depends on #22310
Contributes to #22312
Checklist