[python] Update ASCII storage for dataframes by johnkerl · Pull Request #273 · single-cell-data/TileDB-SOMA

johnkerl · 2022-08-31T19:00:45Z

On collaboration with @nguyenv on TileDB-Inc/TileDB-Py#1304 -- new/upated feedback (thanks @nguyenv !) is that all along we can obtain reduce type-conversion at readback by a one-line change in this package.

Note this means that while before this PR we can store (encode/decode) user-supplied Unicode data in dataframes, as of this PR we no longer can. That ability will be restored in a future core release -- perhaps early 2023. Meanwhile unit tests on this PR that used to validate the ability to encode/decode user-supplied Unicode data in dataframes have been commented out, to accommodate the feature loss.

Summary:

Write string dims/attrs as "ascii" per se, not bytes and not str
Since old SOMAs written before this PR won't have all "ascii", conditionally retain the decode-on-readback logic which is essential since otherwise b"foo" != "foo"
Please see also the clear presentation at Queryability of dataframe attribute columns #99

johnkerl · 2022-08-31T21:09:32Z

Status note: the CI failure is puzzling me as unit tests are passing for me locally 👀

johnkerl · 2022-09-16T02:38:59Z

Sorry for the delay.

@Shelnutt2 @nguyenv this is ready for review.

Failure of Windows R-CMD-check.yaml is an unrelated KP.

johnkerl · 2022-09-17T16:33:38Z

@nguyenv @ihnorton @Shelnutt2 ping 🙏

nguyenv · 2022-09-19T14:40:53Z

apis/python/src/tiledbsoma/annotation_dataframe.py

+
+            # TileDB string dims are ASCII not UTF-8. Decode them so they readback not like
+            # `b"AKR1C3"` but rather like `"AKR1C3"`. Update as of
+            # https://github.com/TileDB-Inc/TileDB-Py/pull/1304 these dims will read back OK.
            retval = A.query(attrs=[], dims=[self.dim_name])[:][self.dim_name].tolist()
-            return [e.decode() for e in retval]
+
+            retval = [e.decode() for e in retval]
+
+            if len(retval) > 0 and isinstance(retval[0], bytes):
+                return [e.decode() for e in retval]
+            else:
+                # list(...) is there to appease the linter which thinks we're returning `Any`
+                return list(retval)


Could something like this work?

retval = A.query(attrs=[], dims=[self.dim_name])[:][self.dim_name] return np.frombuffer(retval, dtype="U").tolist()

no sorry @nguyenv !

return np.frombuffer(retval, dtype="U").tolist() ValueError: itemsize cannot be zero in type

But regardless all I'm doing is putting an if around known-good code, invoking it when needed for older arrays which predate this PR!

@nguyenv what works remains to accept or reject this PR?

johnkerl · 2022-09-23T20:50:59Z

Thanks @nguyenv ! :)

johnkerl requested a review from nguyenv August 31, 2022 19:02

johnkerl marked this pull request as ready for review August 31, 2022 19:02

johnkerl force-pushed the kerl/ascii branch from 9b88641 to 4abbe3f Compare August 31, 2022 19:03

johnkerl requested review from Shelnutt2 and aaronwolen August 31, 2022 19:27

johnkerl mentioned this pull request Aug 31, 2022

Restore ability to store and query non-ASCII dataframe attributes #274

Closed

johnkerl force-pushed the kerl/ascii branch from 3c8b6d4 to a730221 Compare August 31, 2022 21:04

johnkerl force-pushed the kerl/ascii branch 2 times, most recently from 3edca86 to ad10f3e Compare September 2, 2022 21:18

johnkerl force-pushed the kerl/ascii branch from ad10f3e to 25927cc Compare September 15, 2022 23:43

johnkerl changed the base branch from main to main-old September 16, 2022 00:36

johnkerl force-pushed the kerl/ascii branch 3 times, most recently from 1b487b5 to f3a0d62 Compare September 16, 2022 01:00

johnkerl force-pushed the kerl/ascii branch 9 times, most recently from a75f33e to 4da886a Compare September 16, 2022 20:58

nguyenv reviewed Sep 19, 2022

View reviewed changes

johnkerl added 3 commits September 21, 2022 23:49

tiledbsc-py stats experiment

d46dbae

ingestor --stats/--corestats for performance-analysis collaboration

5ed75bd

finer detail on X/data ingest

846138a

johnkerl added 4 commits September 21, 2022 23:49

typofix

2a93db6

Update ASCII storage for dataframes

226a856

Update unit tests to reflect feature change

63666d6

fix failing unit test

eda0866

johnkerl force-pushed the kerl/ascii branch from 4da886a to eda0866 Compare September 22, 2022 03:49

nguyenv approved these changes Sep 23, 2022

View reviewed changes

johnkerl merged commit bd41346 into main-old Sep 23, 2022

johnkerl deleted the kerl/ascii branch September 23, 2022 20:51

johnkerl mentioned this pull request Oct 12, 2022

[python/c++] Connect C++ reader for dataframes #400

Merged

johnkerl changed the title ~~Update ASCII storage for dataframes~~ [python] Update ASCII storage for dataframes Oct 26, 2022

aaronwolen mentioned this pull request Nov 4, 2022

Don't use default assay name for Seurat objects #507

Merged

aaronwolen mentioned this pull request Nov 17, 2022

[r] Expand batched reader support to AnnotationMatrix arrays #548

Merged

aaronwolen mentioned this pull request Dec 2, 2022

[r] Make SOMAs and X layers selectable when converting to a Seurat object #571

Merged

aaronwolen mentioned this pull request Dec 19, 2022

[r] Release 0.1.19 #621

Merged

aaronwolen mentioned this pull request Jan 23, 2023

[r] Add pbmc3k dataset helper #792

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Update ASCII storage for dataframes#273

[python] Update ASCII storage for dataframes#273
johnkerl merged 7 commits intomain-oldfrom
kerl/ascii

johnkerl commented Aug 31, 2022 •

edited

Loading

Uh oh!

johnkerl commented Aug 31, 2022

Uh oh!

johnkerl commented Sep 16, 2022 •

edited

Loading

Uh oh!

johnkerl commented Sep 17, 2022

Uh oh!

nguyenv Sep 19, 2022

Uh oh!

johnkerl Sep 22, 2022

Uh oh!

johnkerl Sep 22, 2022

Uh oh!

johnkerl Sep 22, 2022

Uh oh!

johnkerl commented Sep 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

johnkerl commented Aug 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnkerl commented Aug 31, 2022

Uh oh!

johnkerl commented Sep 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnkerl commented Sep 17, 2022

Uh oh!

nguyenv Sep 19, 2022

Choose a reason for hiding this comment

Uh oh!

johnkerl Sep 22, 2022

Choose a reason for hiding this comment

Uh oh!

johnkerl Sep 22, 2022

Choose a reason for hiding this comment

Uh oh!

johnkerl Sep 22, 2022

Choose a reason for hiding this comment

Uh oh!

johnkerl commented Sep 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

johnkerl commented Aug 31, 2022 •

edited

Loading

johnkerl commented Sep 16, 2022 •

edited

Loading