[SPARK-55242][PYSPARK] Handle np.ndarray elements in list-valued columns when converting from pandas by azmatsiddique · Pull Request #55196 · apache/spark

azmatsiddique · 2026-04-04T11:42:12Z

What changes were proposed in this pull request?
In DataTypeOps.prepare() (python/pyspark/pandas/data_type_ops/base.py),
added a pre-processing step that detects object-dtype pandas Series whose
elements are np.ndarray objects and converts them to plain Python lists
via .tolist() before the existing col.replace({np.nan: None}) call.

This is a targeted, minimal fix: the ndarray-to-list conversion only fires
when all three conditions hold:

The Series dtype is object
The Series is non-empty
The first non-null element is a np.ndarray

Why are the changes needed?
In pandas 3, when a DataFrame column is created from a list-of-lists
(e.g. [[e] for e in ...]), each element is stored internally as a
np.ndarray object rather than a plain Python list.

DataTypeOps.prepare() calls col.replace({np.nan: None}), which
internally compares every element with np.nan using ==. Comparing a
np.ndarray with a scalar via == returns an array, not a bool, so
pandas raises:

ValueError: The truth value of an array is ambiguous.
Use a.any() or a.all()

This makes ps.from_pandas() (and ps.DataFrame(), ps.from_pandas(series),
etc.) crash whenever the input contains list-valued columns in a pandas 3
environment.

Reproducer:
import numpy as np
import pandas as pd
import pyspark.pandas as ps

pdf = pd.DataFrame(
    {"a": [1, 2, 3, 4, 5, 6, 7, 8, 9],
     "b": [[e] for e in [4, 5, 6, 3, 2, 1, 0, 0, 0]]},
    index=np.random.rand(9),
)
psdf = ps.from_pandas(pdf)  # raises ValueError on pandas 3

Does this PR introduce any user-facing change?
Yes. This is a bug fix.

Before: ps.from_pandas(pdf) with a list-valued column raised
ValueError: The truth value of an array is ambiguous on pandas 3.

After: the call succeeds and the DataFrame is created correctly, with
the list column properly inferred as ArrayType in the Spark schema.

This affects pandas 3 users only; the fix is backward-compatible with
earlier pandas versions.
How was this patch tested?
Added test_from_pandas_with_np_array_elements in
python/pyspark/pandas/tests/data_type_ops/test_complex_ops.py.

The test reproduces the exact scenario from SPARK-55242:

Creates a pandas DataFrame with integer column "a" and a
list-valued column "b" (one list per row) with a float index.
Calls ps.from_pandas(pdf) — this previously raised ValueError.
Asserts that column "a" round-trips correctly.
Asserts that column "b" has the expected number of rows.
Was this patch authored or co-authored using generative AI tooling?
No

…umns when converting from pandas When a pandas DataFrame contains list-valued columns (e.g. a column created via `[[e] for e in ...]`), pandas 3 stores each list element internally as a `np.ndarray` object rather than a plain Python list. The existing `DataTypeOps.prepare()` method calls: col.replace({np.nan: None}) on the pandas Series before passing it to Spark's `createDataFrame`. When the Series has dtype "object" and its elements are `np.ndarray` objects, pandas 3 raises: ValueError: The truth value of an array is ambiguous. Use a.any() or a.all() because numpy arrays cannot be compared with `==` in the way that `replace` needs. Fix: detect object-dtype columns whose non-null first element is a `np.ndarray` and convert each such element to a plain Python list via `.tolist()` before performing the NaN-to-None substitution. This also ensures PyArrow correctly infers the column type as `ArrayType` for the resulting Spark schema. ### Does this PR introduce _any_ user-facing change? No - this is a regression fix. Previously `ps.from_pandas(pdf)` with a list-valued column raised an error; after the fix it succeeds and the data round-trips correctly. ### How was this patch tested? Added `test_from_pandas_with_np_array_elements` in `pyspark/pandas/tests/data_type_ops/test_complex_ops.py`, which reproduces the exact scenario reported in SPARK-55242. Closes #SPARK-55242

…ress Connect parity test warnings

…fix linting

…eOps.prepare for better Connect serialization

…ving numpy array elements

… ops test

azmatsiddique force-pushed the SPARK-55242-pyspark-pandas-np-array-from-list branch from faadf8d to 1fc2051 Compare April 4, 2026 11:45

Azmat Siddique added 6 commits April 4, 2026 19:17

[SPARK-55242][PYTHON] Refine handling of np.ndarray elements and supp…

ef76d1f

…ress Connect parity test warnings

[SPARK-55242][PYTHON] Remove unused numpy import in string_ops.py to …

3d3acd9

…fix linting

[SPARK-55242][PYTHON] Convert np.ndarray elements to lists in DataTyp…

d210f99

…eOps.prepare for better Connect serialization

[SPARK-55242][PYTHON] Reformat complex ops test for ruff compliance

ffdc968

[SPARK-55242][PYTHON] Add graceful skip for Connect parity test invol…

de4df8e

…ving numpy array elements

[SPARK-55242][PYTHON] Fix unsupported sort_values(key=...) in complex…

ed3477e

… ops test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-55242][PYSPARK] Handle np.ndarray elements in list-valued columns when converting from pandas#55196

[SPARK-55242][PYSPARK] Handle np.ndarray elements in list-valued columns when converting from pandas#55196
azmatsiddique wants to merge 7 commits intoapache:masterfrom
azmatsiddique:SPARK-55242-pyspark-pandas-np-array-from-list

azmatsiddique commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

azmatsiddique commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant