Skip to content

[SPARK-55242][PYSPARK] Handle np.ndarray elements in list-valued columns when converting from pandas#55196

Open
azmatsiddique wants to merge 7 commits intoapache:masterfrom
azmatsiddique:SPARK-55242-pyspark-pandas-np-array-from-list
Open

[SPARK-55242][PYSPARK] Handle np.ndarray elements in list-valued columns when converting from pandas#55196
azmatsiddique wants to merge 7 commits intoapache:masterfrom
azmatsiddique:SPARK-55242-pyspark-pandas-np-array-from-list

Conversation

@azmatsiddique
Copy link
Copy Markdown

What changes were proposed in this pull request?
In DataTypeOps.prepare() (python/pyspark/pandas/data_type_ops/base.py),
added a pre-processing step that detects object-dtype pandas Series whose
elements are np.ndarray objects and converts them to plain Python lists
via .tolist() before the existing col.replace({np.nan: None}) call.

This is a targeted, minimal fix: the ndarray-to-list conversion only fires
when all three conditions hold:

  1. The Series dtype is object
  2. The Series is non-empty
  3. The first non-null element is a np.ndarray

Why are the changes needed?
In pandas 3, when a DataFrame column is created from a list-of-lists
(e.g. [[e] for e in ...]), each element is stored internally as a
np.ndarray object rather than a plain Python list.

DataTypeOps.prepare() calls col.replace({np.nan: None}), which
internally compares every element with np.nan using ==. Comparing a
np.ndarray with a scalar via == returns an array, not a bool, so
pandas raises:

ValueError: The truth value of an array is ambiguous.
Use a.any() or a.all()

This makes ps.from_pandas() (and ps.DataFrame(), ps.from_pandas(series),
etc.) crash whenever the input contains list-valued columns in a pandas 3
environment.

Reproducer:
import numpy as np
import pandas as pd
import pyspark.pandas as ps

pdf = pd.DataFrame(
    {"a": [1, 2, 3, 4, 5, 6, 7, 8, 9],
     "b": [[e] for e in [4, 5, 6, 3, 2, 1, 0, 0, 0]]},
    index=np.random.rand(9),
)
psdf = ps.from_pandas(pdf)  # raises ValueError on pandas 3

Does this PR introduce any user-facing change?
Yes. This is a bug fix.

Before: ps.from_pandas(pdf) with a list-valued column raised
ValueError: The truth value of an array is ambiguous on pandas 3.

After: the call succeeds and the DataFrame is created correctly, with
the list column properly inferred as ArrayType in the Spark schema.

This affects pandas 3 users only; the fix is backward-compatible with
earlier pandas versions.
How was this patch tested?
Added test_from_pandas_with_np_array_elements in
python/pyspark/pandas/tests/data_type_ops/test_complex_ops.py.

The test reproduces the exact scenario from SPARK-55242:

  • Creates a pandas DataFrame with integer column "a" and a
    list-valued column "b" (one list per row) with a float index.
  • Calls ps.from_pandas(pdf) — this previously raised ValueError.
  • Asserts that column "a" round-trips correctly.
  • Asserts that column "b" has the expected number of rows.
    Was this patch authored or co-authored using generative AI tooling?
    No

…umns when converting from pandas

When a pandas DataFrame contains list-valued columns (e.g. a column
created via `[[e] for e in ...]`), pandas 3 stores each list element
internally as a `np.ndarray` object rather than a plain Python list.

The existing `DataTypeOps.prepare()` method calls:

    col.replace({np.nan: None})

on the pandas Series before passing it to Spark's `createDataFrame`.
When the Series has dtype "object" and its elements are `np.ndarray`
objects, pandas 3 raises:

    ValueError: The truth value of an array is ambiguous.
    Use a.any() or a.all()

because numpy arrays cannot be compared with `==` in the way that
`replace` needs.

Fix: detect object-dtype columns whose non-null first element is a
`np.ndarray` and convert each such element to a plain Python list via
`.tolist()` before performing the NaN-to-None substitution.  This also
ensures PyArrow correctly infers the column type as `ArrayType` for the
resulting Spark schema.

### Does this PR introduce _any_ user-facing change?
No - this is a regression fix.  Previously `ps.from_pandas(pdf)` with a
list-valued column raised an error; after the fix it succeeds and the
data round-trips correctly.

### How was this patch tested?
Added `test_from_pandas_with_np_array_elements` in
`pyspark/pandas/tests/data_type_ops/test_complex_ops.py`, which
reproduces the exact scenario reported in SPARK-55242.

Closes #SPARK-55242
@azmatsiddique azmatsiddique force-pushed the SPARK-55242-pyspark-pandas-np-array-from-list branch from faadf8d to 1fc2051 Compare April 4, 2026 11:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant