Skip to content

[python] Update ASCII storage for dataframes#273

Merged
johnkerl merged 7 commits intomain-oldfrom
kerl/ascii
Sep 23, 2022
Merged

[python] Update ASCII storage for dataframes#273
johnkerl merged 7 commits intomain-oldfrom
kerl/ascii

Conversation

@johnkerl
Copy link
Copy Markdown
Contributor

@johnkerl johnkerl commented Aug 31, 2022

On collaboration with @nguyenv on TileDB-Inc/TileDB-Py#1304 -- new/upated feedback (thanks @nguyenv !) is that all along we can obtain reduce type-conversion at readback by a one-line change in this package.

Note this means that while before this PR we can store (encode/decode) user-supplied Unicode data in dataframes, as of this PR we no longer can. That ability will be restored in a future core release -- perhaps early 2023. Meanwhile unit tests on this PR that used to validate the ability to encode/decode user-supplied Unicode data in dataframes have been commented out, to accommodate the feature loss.

Summary:

  • Write string dims/attrs as "ascii" per se, not bytes and not str
  • Since old SOMAs written before this PR won't have all "ascii", conditionally retain the decode-on-readback logic which is essential since otherwise b"foo" != "foo"
  • Please see also the clear presentation at Queryability of dataframe attribute columns #99

@johnkerl
Copy link
Copy Markdown
Contributor Author

Status note: the CI failure is puzzling me as unit tests are passing for me locally 👀

@johnkerl johnkerl force-pushed the kerl/ascii branch 2 times, most recently from 3edca86 to ad10f3e Compare September 2, 2022 21:18
@johnkerl johnkerl changed the base branch from main to main-old September 16, 2022 00:36
@johnkerl johnkerl force-pushed the kerl/ascii branch 3 times, most recently from 1b487b5 to f3a0d62 Compare September 16, 2022 01:00
@johnkerl
Copy link
Copy Markdown
Contributor Author

johnkerl commented Sep 16, 2022

Sorry for the delay.

@Shelnutt2 @nguyenv this is ready for review.

Failure of Windows R-CMD-check.yaml is an unrelated KP.

@johnkerl johnkerl force-pushed the kerl/ascii branch 9 times, most recently from a75f33e to 4da886a Compare September 16, 2022 20:58
@johnkerl
Copy link
Copy Markdown
Contributor Author

@nguyenv @ihnorton @Shelnutt2 ping 🙏

Comment on lines +78 to +90

# TileDB string dims are ASCII not UTF-8. Decode them so they readback not like
# `b"AKR1C3"` but rather like `"AKR1C3"`. Update as of
# https://github.com/TileDB-Inc/TileDB-Py/pull/1304 these dims will read back OK.
retval = A.query(attrs=[], dims=[self.dim_name])[:][self.dim_name].tolist()
return [e.decode() for e in retval]

retval = [e.decode() for e in retval]

if len(retval) > 0 and isinstance(retval[0], bytes):
return [e.decode() for e in retval]
else:
# list(...) is there to appease the linter which thinks we're returning `Any`
return list(retval)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could something like this work?

retval = A.query(attrs=[], dims=[self.dim_name])[:][self.dim_name]
return np.frombuffer(retval, dtype="U").tolist()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no sorry @nguyenv !

    return np.frombuffer(retval, dtype="U").tolist()
ValueError: itemsize cannot be zero in type

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But regardless all I'm doing is putting an if around known-good code, invoking it when needed for older arrays which predate this PR!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nguyenv what works remains to accept or reject this PR?

@johnkerl
Copy link
Copy Markdown
Contributor Author

Thanks @nguyenv ! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants