[python] Use true ASCII attributes in dataframes#359
Merged
Conversation
This was referenced Oct 3, 2022
5820757 to
c89409a
Compare
4fa2285 to
8193c21
Compare
973dc7d to
9215877
Compare
f260372 to
ae2bde1
Compare
ae2bde1 to
5ccef0e
Compare
0661600 to
681b743
Compare
5ccef0e to
9d2a8f4
Compare
46b4ae3 to
a8bb90b
Compare
Contributor
Author
|
fyi, now that python unit-test cases are passing, there are C++ unit-test failures which i am currently debugging |
Contributor
Author
fixed on latest commit: |
Contributor
Author
|
Note: tests will fail with until TileDB-Inc/TileDB-Py#1337 wends its way into a |
gspowley
approved these changes
Oct 11, 2022
This was referenced Oct 12, 2022
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
This is a forward-port of #273 from the
main-oldbranch to themainbranch.Why
Lack of backward compatibility
main-old(see [python] Update ASCII storage for dataframes #273) we conditionally retained the decode-on-readback logic which was essential since otherwise b"foo" != "foo"main(greenfield) branch we do not keep backward compatibility -- any pre-existing datasets with non-ASCII attributes will need to be regeneratedPrerequisites
TileDB-Py0.17.5 which is here as of [python] Depend on TileDB-Py 0.17.5 #394PR context
This is the second in a group of three related PRs:
readreturnpyarrow.Tablenotpyarrow.RecordBatch(as in an outdated version of that spec)main-oldwhich will truly have ASCII columns, obviating the need for ourutil_arrow.ascii_to_unicode_pyarrow_readbackreadmethods, which will go in cleanly nowpyarrow.Tableand with the first PR our unit tests will be ready to gopyarrow.LargeBinaryArray(needing decode) but when we are properly writing ASCII cells via the Python write path then the C++ code will read ASCII cells and return them as strings (no longer needing decoding)These three changesets could be done in a single PR, but that would be unmerciful to the reviewers.
Problem
Issue I'm seeing on this PR which confuses me:
Script:
TileDB schema:
Reading back from C++ or from Python:
In summary:
"ascii"attributeslarge_stringnotstringin the Arrow readback schemacc @gspowley and @nguyenv
@nguyenv I'll get you a simpler reprex tomorrow
Update 2022-10-04: simpler repro https://gist.github.com/johnkerl/278bb246c3aa05cc4dfb1694503744c6