[SPARK-55242][PYSPARK] Handle np.ndarray elements in list-valued columns when converting from pandas#55196
Open
azmatsiddique wants to merge 7 commits intoapache:masterfrom
Conversation
…umns when converting from pandas
When a pandas DataFrame contains list-valued columns (e.g. a column
created via `[[e] for e in ...]`), pandas 3 stores each list element
internally as a `np.ndarray` object rather than a plain Python list.
The existing `DataTypeOps.prepare()` method calls:
col.replace({np.nan: None})
on the pandas Series before passing it to Spark's `createDataFrame`.
When the Series has dtype "object" and its elements are `np.ndarray`
objects, pandas 3 raises:
ValueError: The truth value of an array is ambiguous.
Use a.any() or a.all()
because numpy arrays cannot be compared with `==` in the way that
`replace` needs.
Fix: detect object-dtype columns whose non-null first element is a
`np.ndarray` and convert each such element to a plain Python list via
`.tolist()` before performing the NaN-to-None substitution. This also
ensures PyArrow correctly infers the column type as `ArrayType` for the
resulting Spark schema.
### Does this PR introduce _any_ user-facing change?
No - this is a regression fix. Previously `ps.from_pandas(pdf)` with a
list-valued column raised an error; after the fix it succeeds and the
data round-trips correctly.
### How was this patch tested?
Added `test_from_pandas_with_np_array_elements` in
`pyspark/pandas/tests/data_type_ops/test_complex_ops.py`, which
reproduces the exact scenario reported in SPARK-55242.
Closes #SPARK-55242
faadf8d to
1fc2051
Compare
added 6 commits
April 4, 2026 19:17
…ress Connect parity test warnings
…eOps.prepare for better Connect serialization
…ving numpy array elements
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
In
DataTypeOps.prepare()(python/pyspark/pandas/data_type_ops/base.py),added a pre-processing step that detects object-dtype pandas Series whose
elements are
np.ndarrayobjects and converts them to plain Python listsvia
.tolist()before the existingcol.replace({np.nan: None})call.This is a targeted, minimal fix: the ndarray-to-list conversion only fires
when all three conditions hold:
objectnp.ndarrayWhy are the changes needed?
In pandas 3, when a DataFrame column is created from a list-of-lists
(e.g.
[[e] for e in ...]), each element is stored internally as anp.ndarrayobject rather than a plain Python list.DataTypeOps.prepare()callscol.replace({np.nan: None}), whichinternally compares every element with
np.nanusing==. Comparing anp.ndarraywith a scalar via==returns an array, not a bool, sopandas raises:
This makes
ps.from_pandas()(andps.DataFrame(),ps.from_pandas(series),etc.) crash whenever the input contains list-valued columns in a pandas 3
environment.
Reproducer:
import numpy as np
import pandas as pd
import pyspark.pandas as ps
Does this PR introduce any user-facing change?
Yes. This is a bug fix.
Before:
ps.from_pandas(pdf)with a list-valued column raisedValueError: The truth value of an array is ambiguouson pandas 3.After: the call succeeds and the DataFrame is created correctly, with
the list column properly inferred as
ArrayTypein the Spark schema.This affects pandas 3 users only; the fix is backward-compatible with
earlier pandas versions.
How was this patch tested?
Added
test_from_pandas_with_np_array_elementsinpython/pyspark/pandas/tests/data_type_ops/test_complex_ops.py.The test reproduces the exact scenario from SPARK-55242:
list-valued column "b" (one list per row) with a float index.
ps.from_pandas(pdf)— this previously raised ValueError.Was this patch authored or co-authored using generative AI tooling?
No