-
Notifications
You must be signed in to change notification settings - Fork 19
[Fix] Pandas 3 string type changes #278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -145,9 +145,9 @@ def _serialize_df(df, gz=False): | |||||||||||||||||||||||||||||||||||||||
| serialize = _need_to_serialize(out[column]) | ||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||
| if serialize is True: | ||||||||||||||||||||||||||||||||||||||||
| out[column] = out[column].transform(lambda x: create_json_string(x, indent=0) if x is not None else None) | ||||||||||||||||||||||||||||||||||||||||
| out[column] = out[column].transform(lambda x: create_json_string(x, indent=0) if not _is_null(x) else None) | ||||||||||||||||||||||||||||||||||||||||
| if gz is True: | ||||||||||||||||||||||||||||||||||||||||
| out[column] = out[column].transform(lambda x: gzip.compress((x if x is not None else '').encode('utf-8'))) | ||||||||||||||||||||||||||||||||||||||||
| out[column] = out[column].transform(lambda x: gzip.compress(x.encode('utf-8')) if not _is_null(x) else gzip.compress(b'')) | ||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||
| return out | ||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||
|
|
@@ -166,37 +166,47 @@ def _deserialize_df(df, auto_gamma=False): | |||||||||||||||||||||||||||||||||||||||
| ------ | ||||||||||||||||||||||||||||||||||||||||
| In case any column of the DataFrame is gzipped it is gunzipped in the process. | ||||||||||||||||||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||||||||||||||||||
| for column in df.select_dtypes(include="object"): | ||||||||||||||||||||||||||||||||||||||||
| if isinstance(df[column][0], bytes): | ||||||||||||||||||||||||||||||||||||||||
| if df[column][0].startswith(b"\x1f\x8b\x08\x00"): | ||||||||||||||||||||||||||||||||||||||||
| df[column] = df[column].transform(lambda x: gzip.decompress(x).decode('utf-8')) | ||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||
| if not all([e is None for e in df[column]]): | ||||||||||||||||||||||||||||||||||||||||
| # In pandas 3+, string columns use 'str' dtype instead of 'object' | ||||||||||||||||||||||||||||||||||||||||
| string_like_dtypes = ["object", "str"] if int(pd.__version__.split(".")[0]) >= 3 else ["object"] | ||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||
| for column in df.select_dtypes(include=string_like_dtypes): | ||||||||||||||||||||||||||||||||||||||||
| if isinstance(df[column].iloc[0], bytes): | ||||||||||||||||||||||||||||||||||||||||
| if df[column].iloc[0].startswith(b"\x1f\x8b\x08\x00"): | ||||||||||||||||||||||||||||||||||||||||
| df[column] = df[column].transform(lambda x: gzip.decompress(x).decode('utf-8') if x else '') | ||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||
| df[column] = df[column].transform(lambda x: gzip.decompress(x).decode('utf-8') if x else '') | |
| df[column] = df[column].transform(lambda x: gzip.decompress(x).decode('utf-8') if not pd.isna(x) else '') |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential IndexError when the column is empty or when all values are NaN/None. Before accessing iloc[0], verify that the column has at least one non-null value. This can happen when processing empty DataFrames or columns with all null values.
| if isinstance(df[column].iloc[0], bytes): | |
| if df[column].iloc[0].startswith(b"\x1f\x8b\x08\x00"): | |
| df[column] = df[column].transform(lambda x: gzip.decompress(x).decode('utf-8') if x else '') | |
| col_series = df[column] | |
| # Skip empty columns to avoid IndexError on iloc[0] | |
| if col_series.empty: | |
| continue | |
| # Find the first non-null value for type inspection | |
| first_valid_index = col_series.first_valid_index() | |
| first_value = col_series.loc[first_valid_index] if first_valid_index is not None else None | |
| if isinstance(first_value, bytes): | |
| if first_value.startswith(b"\x1f\x8b\x08\x00"): | |
| df[column] = col_series.transform( | |
| lambda x: gzip.decompress(x).decode('utf-8') | |
| if isinstance(x, (bytes, bytearray)) and x | |
| else '' | |
| ) |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential IndexError when the while loop reaches the end of the column without finding a non-null value. The notna().any() check at line 176 ensures there's at least one non-null value, but after the regex replace at line 177, some of those values might have been replaced with None. This can cause the while loop to run past the end of the DataFrame. Add a bounds check before accessing iloc[i] at line 181.
| while pd.isna(df[column].iloc[i]): | |
| i += 1 | |
| col_len = len(df[column]) | |
| while i < col_len and pd.isna(df[column].iloc[i]): | |
| i += 1 | |
| if i == col_len: | |
| # All values are NA after replacement; nothing to deserialize in this column | |
| continue |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic in line 186 is incorrect for applying .gm() to items in a list. When checking 'not pd.isna(o)' for list elements, this will raise an error because 'o' is an Obs object, not a pandas value. The pd.isna() check should be changed to handle Obs objects properly, or should use 'o is not None' instead.
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The apply() function returns a new Series but the result is not assigned back to df[column]. This means the .gm() calls have no effect. The code should be: df[column] = df[column].apply(...) to store the result.
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _is_null function returns False for lists and numpy arrays, which means that empty lists ([]) will be treated as non-null. This could lead to unexpected behavior if a column contains empty lists as placeholders for null values. Consider whether empty lists should be treated as null values based on the use case.
| return False if isinstance(val, (list, np.ndarray)) else pd.isna(val) | |
| # Treat empty lists/arrays (and containers whose elements are all null) as null. | |
| if isinstance(val, list): | |
| if len(val) == 0: | |
| return True | |
| # A list is null only if all its elements are null. | |
| return all(_is_null(v) for v in val) | |
| if isinstance(val, np.ndarray): | |
| if val.size == 0: | |
| return True | |
| # For object-dtype arrays, check elementwise using _is_null. | |
| if val.dtype == object: | |
| return all(_is_null(v) for v in val) | |
| # For non-object arrays, rely on pandas/numpy NA detection. | |
| return bool(np.all(pd.isna(val))) | |
| return pd.isna(val) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The -Werror flag was removed from pytest for all Python versions except 3.14. According to the PR description, this change aims to fix deprecation warnings. If the warnings are truly fixed by this PR, the -Werror flag should remain to catch future regressions. Removing -Werror means new warnings won't cause test failures, which could allow issues to accumulate.