Skip to content

Comments

feat: support Spark-compatible json_tuple function#20412

Merged
comphead merged 2 commits intoapache:mainfrom
CuteChuanChuan:raymond/spark-func-json-tuple
Feb 20, 2026
Merged

feat: support Spark-compatible json_tuple function#20412
comphead merged 2 commits intoapache:mainfrom
CuteChuanChuan:raymond/spark-func-json-tuple

Conversation

@CuteChuanChuan
Copy link
Contributor

@CuteChuanChuan CuteChuanChuan commented Feb 17, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

  • Add Spark-compatible json_tuple function in datafusion-spark crate
  • Function signature: json_tuple(json_string, key1, key2, ...) -> Struct<c0: Utf8, c1: Utf8, ...>
    • json_string: The JSON string to extract fields from
    • key1, key2, ...: Top-level field names to extract
    • Returns a Struct because DataFusion ScalarUDFs return one value per row; caller (Comet) destructures the fields

Examples

SELECT json_tuple('{"f1":"value1","f2":"value2","f3":3}', 'f1', 'f2', 'f3');
-- {c0: value1, c1: value2, c2: 3}

SELECT json_tuple('{"f1":"value1"}', 'f1', 'f2');
-- {c0: value1, c1: NULL}

SELECT json_tuple(NULL, 'f1');
-- NULL

Are these changes tested?

  • Unit tests: return_field_from_args shape validation and too-few-args error
  • sqllogictest: test_files/spark/json/json_tuple.slt, test cases derived from Spark JsonExpressionsSuite

Are there any user-facing changes?

Yes.

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) spark labels Feb 17, 2026
@CuteChuanChuan CuteChuanChuan changed the title feat(spark): support Spark-compatible json_tuple function feat: support Spark-compatible json_tuple function Feb 17, 2026
@CuteChuanChuan CuteChuanChuan force-pushed the raymond/spark-func-json-tuple branch from 8babbe9 to e0cf25c Compare February 17, 2026 10:03
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @CuteChuanChuan it is a great PR, let me go through tests soon

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double checked, all 14 tests passed.

Please add some edge case

  • json_tuple(null, null)
  • json_tuple("", "")
  • json_tuple() for mixing upper/lower cases
  • json with UTF tests, chinese, cyrillic, etc

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns a Struct because DataFusion ScalarUDFs return one value per row; caller (Comet) destructures the fields

Probably good to include this reasoning in the docstring of the function so its more visible (than looking at history/commit)


let json_array = args[0]
.as_any()
.downcast_ref::<StringArray>()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use as_string_array here for downcasting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Updated to use as_string_array.

.collect::<Result<Vec<_>>>()?;

let mut builders: Vec<StringBuilder> = (0..num_fields)
.map(|_| StringBuilder::with_capacity(num_rows, num_rows * 32))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why * 32 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the arbitrary * 32 heuristic and switched to StringBuilder::new() to let Arrow manage buffer growth

use datafusion_expr::ReturnFieldArgs;

#[test]
fn test_return_field_shape() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could test this in SLT using arrow_typeof() function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added arrow_typeof() test in SLT.

- Use `as_string_array` instead of manual downcast
- Replace `StringBuilder::with_capacity(num_rows, num_rows * 32)` with`StringBuilder::new()` to let Arrow manage buffer growth
- Add `arrow_typeof()` SLT test to verify return type
- Add Struct return rationale to docstring
- Add edge case SLT tests: null/null, empty/empty, mixed case keys, UTF-8 Chinese and Cyrillic characters
@CuteChuanChuan
Copy link
Contributor Author

Hi @comphead and @Jefffrey ,
Appreciate for the review. I added more edge cases, revise the places pointed out.
PTAL when you have a chance. Thanks.

Comment on lines +38 to +40
/// Note: In Spark, `json_tuple` is a Generator that produces multiple columns directly.
/// In DataFusion, a ScalarUDF can only return one value per row, so the result is wrapped
/// in a Struct. The caller (e.g. Comet) is expected to destructure the struct fields.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this means this is technically a UDTF instead of a scalar UDF? Something to potentially explore in future if we want closer compatibility with Spark (i.e. don't require comet to do the destructuring to handle this)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. In Spark, json_tuple is indeed a Generator (UDTF) that produces multiple columns directly. Using a UDTF in DataFusion would remove the need for Comet to destructure the Struct.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @CuteChuanChuan all 19 tests passed, thanks @Jefffrey for the review

@comphead comphead added this pull request to the merge queue Feb 20, 2026
Merged via the queue into apache:main with commit 0f7a405 Feb 20, 2026
31 checks passed
@CuteChuanChuan
Copy link
Contributor Author

Thanks @comphead and @Jefffrey for precious review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

spark sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants