feat: support Spark-compatible `json_tuple` function by CuteChuanChuan · Pull Request #20412 · apache/datafusion

CuteChuanChuan · 2026-02-17T09:58:44Z

Which issue does this PR close?

Part of [EPIC] Complete datafusion-spark Spark Compatible Functions #15914
Related comet issue: [Feature] Support Spark expression: json_tuple datafusion-comet#3160

Rationale for this change

Apache Spark's json_tuple extracts top-level fields from a JSON string.
This function is used in Spark SQL and needed for DataFusion-Comet compatibility.
Reference: https://spark.apache.org/docs/latest/api/sql/index.html#json_tuple

What changes are included in this PR?

Add Spark-compatible json_tuple function in datafusion-spark crate
Function signature: json_tuple(json_string, key1, key2, ...) -> Struct<c0: Utf8, c1: Utf8, ...>
- json_string: The JSON string to extract fields from
- key1, key2, ...: Top-level field names to extract
- Returns a Struct because DataFusion ScalarUDFs return one value per row; caller (Comet) destructures the fields

Examples

SELECT json_tuple('{"f1":"value1","f2":"value2","f3":3}', 'f1', 'f2', 'f3');
-- {c0: value1, c1: value2, c2: 3}

SELECT json_tuple('{"f1":"value1"}', 'f1', 'f2');
-- {c0: value1, c1: NULL}

SELECT json_tuple(NULL, 'f1');
-- NULL

Are these changes tested?

Unit tests: return_field_from_args shape validation and too-few-args error
sqllogictest: test_files/spark/json/json_tuple.slt, test cases derived from Spark JsonExpressionsSuite

Are there any user-facing changes?

Yes.

comphead

Thanks @CuteChuanChuan it is a great PR, let me go through tests soon

comphead

double checked, all 14 tests passed.

Please add some edge case

json_tuple(null, null)
json_tuple("", "")
json_tuple() for mixing upper/lower cases
json with UTF tests, chinese, cyrillic, etc

Jefffrey

Returns a Struct because DataFusion ScalarUDFs return one value per row; caller (Comet) destructures the fields

Probably good to include this reasoning in the docstring of the function so its more visible (than looking at history/commit)

Jefffrey · 2026-02-20T02:39:19Z

datafusion/spark/src/function/json/json_tuple.rs

+
+    let json_array = args[0]
+        .as_any()
+        .downcast_ref::<StringArray>()


Can use as_string_array here for downcasting

Thanks. Updated to use as_string_array.

Jefffrey · 2026-02-20T02:39:27Z

datafusion/spark/src/function/json/json_tuple.rs

+        .collect::<Result<Vec<_>>>()?;
+
+    let mut builders: Vec<StringBuilder> = (0..num_fields)
+        .map(|_| StringBuilder::with_capacity(num_rows, num_rows * 32))


Why * 32 here?

Removed the arbitrary * 32 heuristic and switched to StringBuilder::new() to let Arrow manage buffer growth

Jefffrey · 2026-02-20T02:41:02Z

datafusion/spark/src/function/json/json_tuple.rs

+    use datafusion_expr::ReturnFieldArgs;
+
+    #[test]
+    fn test_return_field_shape() {


We could test this in SLT using arrow_typeof() function

Added arrow_typeof() test in SLT.

- Use `as_string_array` instead of manual downcast - Replace `StringBuilder::with_capacity(num_rows, num_rows * 32)` with`StringBuilder::new()` to let Arrow manage buffer growth - Add `arrow_typeof()` SLT test to verify return type - Add Struct return rationale to docstring - Add edge case SLT tests: null/null, empty/empty, mixed case keys, UTF-8 Chinese and Cyrillic characters

CuteChuanChuan · 2026-02-20T09:25:04Z

Hi @comphead and @Jefffrey ,
Appreciate for the review. I added more edge cases, revise the places pointed out.
PTAL when you have a chance. Thanks.

Jefffrey · 2026-02-20T09:57:52Z

datafusion/spark/src/function/json/json_tuple.rs

+/// Note: In Spark, `json_tuple` is a Generator that produces multiple columns directly.
+/// In DataFusion, a ScalarUDF can only return one value per row, so the result is wrapped
+/// in a Struct. The caller (e.g. Comet) is expected to destructure the struct fields.


I wonder if this means this is technically a UDTF instead of a scalar UDF? Something to potentially explore in future if we want closer compatibility with Spark (i.e. don't require comet to do the destructuring to handle this)

Good point. In Spark, json_tuple is indeed a Generator (UDTF) that produces multiple columns directly. Using a UDTF in DataFusion would remove the need for Comet to destructure the Struct.

comphead

Thanks @CuteChuanChuan all 19 tests passed, thanks @Jefffrey for the review

CuteChuanChuan · 2026-02-20T18:42:56Z

Thanks @comphead and @Jefffrey for precious review.

github-actions bot added sqllogictest SQL Logic Tests (.slt) spark labels Feb 17, 2026

CuteChuanChuan changed the title ~~feat(spark): support Spark-compatible json_tuple function~~ feat: support Spark-compatible json_tuple function Feb 17, 2026

CuteChuanChuan force-pushed the raymond/spark-func-json-tuple branch from 8babbe9 to e0cf25c Compare February 17, 2026 10:03

feat: support Spark-compatible json_tuple function

6d6c2a0

CuteChuanChuan force-pushed the raymond/spark-func-json-tuple branch from e0cf25c to 6d6c2a0 Compare February 17, 2026 10:12

davidlghellin mentioned this pull request Feb 18, 2026

feat: add json_tuple function lakehq/sail#1224

Draft

comphead reviewed Feb 19, 2026

View reviewed changes

comphead reviewed Feb 20, 2026

View reviewed changes

Jefffrey reviewed Feb 20, 2026

View reviewed changes

Jefffrey approved these changes Feb 20, 2026

View reviewed changes

comphead approved these changes Feb 20, 2026

View reviewed changes

comphead added this pull request to the merge queue Feb 20, 2026

Merged via the queue into apache:main with commit 0f7a405 Feb 20, 2026
31 checks passed

Comments

Conversation

CuteChuanChuan commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Examples

Are these changes tested?

Are there any user-facing changes?

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

comphead left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jefffrey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CuteChuanChuan commented Feb 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

CuteChuanChuan commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CuteChuanChuan commented Feb 17, 2026 •

edited

Loading

comphead left a comment •

edited

Loading