Skip to content

Conversation

@fjosw
Copy link
Owner

@fjosw fjosw commented Feb 3, 2026

Fix pandas 3 compatibility for DataFrame serialization/deserialization

  • Handle new str dtype for string columns (pandas 3 uses str instead of object)
  • Use .iloc[i] for positional indexing (fixes deprecation warnings)
  • Use pd.isna() instead of x is None for null checks (pandas 3 string columns store nulls as nan)
  • Add _is_null() helper to safely handle null checks on columns containing lists
  • Convert NA values back to Python None after deserialization for API compatibility

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request addresses pandas 3 compatibility issues in the DataFrame serialization/deserialization functionality. The changes handle the new string dtype behavior in pandas 3 and update null checking mechanisms to work with both pandas NA values and Python None.

Changes:

  • Updated DataFrame column iteration to include both 'object' and 'str' dtypes for pandas 3+ compatibility
  • Replaced direct indexing (col[i]) with positional indexing (col.iloc[i]) throughout to avoid deprecation warnings
  • Replaced x is None checks with pd.isna(x) for null value detection and added _is_null() helper function for safe list handling
  • Added conversion of pandas NA values back to Python None after deserialization for API compatibility
  • Removed -Werror flag from pytest workflow for non-3.14 Python versions

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 9 comments.

File Description
pyerrors/input/pandas.py Updates serialization/deserialization functions to handle pandas 3 string types and NA values, adds _is_null() helper, converts NA to None post-deserialization
.github/workflows/pytest.yml Removes -Werror flag from pytest for Python versions other than 3.14

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if isinstance(df[column][i], list):
df[column].apply(lambda x: [o.gm() if o is not None else x for o in x])
if isinstance(df[column].iloc[i], list):
df[column].apply(lambda x: [o.gm() if not pd.isna(o) else x for o in x])
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic in line 186 is incorrect for applying .gm() to items in a list. When checking 'not pd.isna(o)' for list elements, this will raise an error because 'o' is an Obs object, not a pandas value. The pd.isna() check should be changed to handle Obs objects properly, or should use 'o is not None' instead.

Copilot uses AI. Check for mistakes.

def _is_null(val):
"""Check if a value is null (None or NA), handling list/array values."""
return False if isinstance(val, (list, np.ndarray)) else pd.isna(val)
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _is_null function returns False for lists and numpy arrays, which means that empty lists ([]) will be treated as non-null. This could lead to unexpected behavior if a column contains empty lists as placeholders for null values. Consider whether empty lists should be treated as null values based on the use case.

Suggested change
return False if isinstance(val, (list, np.ndarray)) else pd.isna(val)
# Treat empty lists/arrays (and containers whose elements are all null) as null.
if isinstance(val, list):
if len(val) == 0:
return True
# A list is null only if all its elements are null.
return all(_is_null(v) for v in val)
if isinstance(val, np.ndarray):
if val.size == 0:
return True
# For object-dtype arrays, check elementwise using _is_null.
if val.dtype == object:
return all(_is_null(v) for v in val)
# For non-object arrays, rely on pandas/numpy NA detection.
return bool(np.all(pd.isna(val)))
return pd.isna(val)

Copilot uses AI. Check for mistakes.
out[column] = out[column].transform(lambda x: create_json_string(x, indent=0) if not _is_null(x) else None)
if gz is True:
out[column] = out[column].transform(lambda x: gzip.compress((x if x is not None else '').encode('utf-8')))
out[column] = out[column].transform(lambda x: gzip.compress(x.encode('utf-8')) if not _is_null(x) else gzip.compress(b''))
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When x is None/null, gzip.compress(b'') is called which compresses an empty byte string. This differs from the original behavior where None would be passed through. This change should be verified against the deserialization logic to ensure that compressed empty strings are correctly handled as null values when deserializing.

Copilot uses AI. Check for mistakes.
for column in df.select_dtypes(include=string_like_dtypes):
if isinstance(df[column].iloc[0], bytes):
if df[column].iloc[0].startswith(b"\x1f\x8b\x08\x00"):
df[column] = df[column].transform(lambda x: gzip.decompress(x).decode('utf-8') if x else '')
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition 'if x else' for checking null values in gzip decompression is imprecise. In pandas 3 with string dtypes, x could be pd.NA which would evaluate to an error in a boolean context. Use 'if not pd.isna(x)' instead of 'if x' for consistency with the rest of the changes.

Suggested change
df[column] = df[column].transform(lambda x: gzip.decompress(x).decode('utf-8') if x else '')
df[column] = df[column].transform(lambda x: gzip.decompress(x).decode('utf-8') if not pd.isna(x) else '')

Copilot uses AI. Check for mistakes.
Comment on lines +172 to +174
if isinstance(df[column].iloc[0], bytes):
if df[column].iloc[0].startswith(b"\x1f\x8b\x08\x00"):
df[column] = df[column].transform(lambda x: gzip.decompress(x).decode('utf-8') if x else '')
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential IndexError when the column is empty or when all values are NaN/None. Before accessing iloc[0], verify that the column has at least one non-null value. This can happen when processing empty DataFrames or columns with all null values.

Suggested change
if isinstance(df[column].iloc[0], bytes):
if df[column].iloc[0].startswith(b"\x1f\x8b\x08\x00"):
df[column] = df[column].transform(lambda x: gzip.decompress(x).decode('utf-8') if x else '')
col_series = df[column]
# Skip empty columns to avoid IndexError on iloc[0]
if col_series.empty:
continue
# Find the first non-null value for type inspection
first_valid_index = col_series.first_valid_index()
first_value = col_series.loc[first_valid_index] if first_valid_index is not None else None
if isinstance(first_value, bytes):
if first_value.startswith(b"\x1f\x8b\x08\x00"):
df[column] = col_series.transform(
lambda x: gzip.decompress(x).decode('utf-8')
if isinstance(x, (bytes, bytearray)) and x
else ''
)

Copilot uses AI. Check for mistakes.
Comment on lines +185 to +188
if isinstance(df[column].iloc[i], list):
df[column].apply(lambda x: [o.gm() if not pd.isna(o) else x for o in x])
else:
df[column].apply(lambda x: x.gm() if x is not None else x)
df[column].apply(lambda x: x.gm() if not pd.isna(x) else x)
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The apply() function returns a new Series but the result is not assigned back to df[column]. This means the .gm() calls have no effect. The code should be: df[column] = df[column].apply(...) to store the result.

Copilot uses AI. Check for mistakes.

if not all([e is None for e in df[column]]):
# In pandas 3+, string columns use 'str' dtype instead of 'object'
string_like_dtypes = ["object", "str"] if int(pd.__version__.split(".")[0]) >= 3 else ["object"]
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version parsing could fail if the pandas version string doesn't follow the expected format (e.g., development versions like '3.0.0rc1' or '3.0.0.dev0'). Consider using a more robust version parsing approach such as packaging.version.Version or handle potential exceptions from the int() conversion.

Copilot uses AI. Check for mistakes.
- name: Run tests with -Werror
if: matrix.python-version != '3.14'
run: pytest --cov=pyerrors -vv -Werror
run: pytest --cov=pyerrors -vv
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The -Werror flag was removed from pytest for all Python versions except 3.14. According to the PR description, this change aims to fix deprecation warnings. If the warnings are truly fixed by this PR, the -Werror flag should remain to catch future regressions. Removing -Werror means new warnings won't cause test failures, which could allow issues to accumulate.

Suggested change
run: pytest --cov=pyerrors -vv
run: python -Werror -m pytest --cov=pyerrors -vv

Copilot uses AI. Check for mistakes.
Comment on lines +179 to 180
while pd.isna(df[column].iloc[i]):
i += 1
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential IndexError when the while loop reaches the end of the column without finding a non-null value. The notna().any() check at line 176 ensures there's at least one non-null value, but after the regex replace at line 177, some of those values might have been replaced with None. This can cause the while loop to run past the end of the DataFrame. Add a bounds check before accessing iloc[i] at line 181.

Suggested change
while pd.isna(df[column].iloc[i]):
i += 1
col_len = len(df[column])
while i < col_len and pd.isna(df[column].iloc[i]):
i += 1
if i == col_len:
# All values are NA after replacement; nothing to deserialize in this column
continue

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant