[python/c++] Connect C++ reader for dataframes by johnkerl · Pull Request #400 · single-cell-data/TileDB-SOMA

johnkerl · 2022-10-12T22:56:27Z

Status

SOMADataFrame and SOMAIndexedDataFrame are on this PR
For reviewer mercy, SOMASparseNdArray and SOMADenseNdArray will be on a separate PR
Re-does [python/c++] Connect libtiledbsoma to tiledbsoma readers for dataframes [WIP] #360 on top of Refine soma_* column handling in Python API #397. This PR is significantly smaller since there was some code overlap with Refine soma_* column handling in Python API #397. However, alas, the typeguard errors persist:

RuntimeError: Static type (UINT64) does not match expected type (INT64)

PR context

This is the third in a group of three related PRs:

[python] Conform to spec by reading as pyarrow.Table not pyarrow.RecordBatch #355: Conform to https://github.com/single-cell-data/SOMA/blob/main/abstract_specification.md by having read returnpyarrow.Table not pyarrow.RecordBatch (as in an outdated version of that spec) -- now merged
[python] Use true ASCII attributes in dataframes #359: Foreward-port [python] Update ASCII storage for dataframes #273 from main-old which will truly have ASCII columns, obviating the need for our util_arrow.ascii_to_unicode_pyarrow_readback -- now merged
This PR: drop in C++ acceleration-library support for read methods, which will go in cleanly now
- The C++ code returns pyarrow.Table and with the first PR our unit tests will be ready to go
- When the C++ code reads Unicode cells it returns them as pyarrow.LargeBinaryArray (needing decode) but when we are properly writing ASCII cells via the Python write path then the C++ code will read ASCII cells and return them as strings (no longer needing decoding)

apis/python/src/tiledbsoma/util.py

johnkerl · 2022-10-13T17:14:47Z

@gspowley @bkmartinjr @Shelnutt2 everything seems working now, especially after some nice C++ mods which were included in #397 which was merged last evening.

The remaining fail is in Windows CI, whereat we do not build libtiledbsoma for Windows. And (AFAICT) the early-abort in MacOS CI is just because Windows CI failed (I think but am not certain).

Now that we have a hard dependency on libtiledbsoma, and since we've made the decision to entirely remove these lines which do work on Windows (rather than using them as fallback when import tiledbsoma.libtiledbsoma raises ModuleNotFoundError) -- it seems we now have the decision to formally drop support for Windows, including taking it out of the CI job.

Thoughts?

gspowley · 2022-10-13T18:02:49Z

it seems we now have the decision to formally drop support for Windows, including taking it out of the CI job.

Thoughts?

I vote we delay supporting Windows. I believe it's more important to focus on developing SOMA features on Linux and MacOS now. We can revisit supporting Windows later.

bkmartinjr · 2022-10-13T18:47:35Z

CC @ambrosejcarr @aaronwolen

I will check with our team and comment if the short-term deprioritization of Windows is of concern.

Can I assume the proposal is to revisit and support by the time we release an "alpha" (ie, feature complete) version? We definitely have users on Windows, and we will want to enable them by the time we release data in this format.

* temp double-dylib workaround * title goes here * fix ci * code-review feedback * Remove badly rebase bits of #400

gspowley · 2022-10-22T21:33:45Z

apis/python/src/tiledbsoma/soma_dataframe.py

-                    attrs=attr_names,
-                )
+            if ids is not None:
+                sr.set_dim_points(SOMA_ROWID, util.ids_to_list(ids))


If ids is an Arrow array, we should pass the Arrow array to set_dim_points instead of converting it to a list. This will reduce memory usage and improve performance by avoiding creating a copy.

gspowley · 2022-10-22T21:34:30Z

apis/python/src/tiledbsoma/soma_indexed_dataframe.py

-            for table in iterator:
-                yield table
+            if ids is not None:
+                sr.set_dim_points(A.schema.domain.dim(0).name, util.ids_to_list(ids))


Same comment as above.

gspowley

Let's sync up on Monday.

gspowley · 2022-10-23T21:30:52Z

apis/python/src/tiledbsoma/util.py

+    """
+    For the interface between ``SOMADataFrame::read`` et al. (Python) and ``SOMAReader`` (C++): the
+    ``ids`` argument to the former can be slice or list; the argument to
+    ``SOMAReader::set_dim_points`` must be a list.


When setting a slice for a SOMAReader query, we should use SOMAReader::set_dim_ranges instead of converting the slice to a list.

This test shows an example:
https://github.com/single-cell-data/TileDB-SOMA/blob/main/libtiledbsoma/test/test_soma_reader.py#L82

gspowley · 2022-10-23T21:40:02Z

apis/python/src/tiledbsoma/util.py

+                step = -1
+        stop = ids.stop + step
+        return pa.chunked_array(pa.array(list(range(ids.start, stop, step))))
+    if isinstance(ids, pa.Array):


The intention of supporting Arrow arrays is captured in this test (currently in a PR):
https://github.com/single-cell-data/TileDB-SOMA/blob/gspowley/obs-slice-x-test/libtiledbsoma/test/test_soma_reader.py#L162

In this test, the "ids" will be type pa.ChunkedArray, so it will fall through the if isintance(...) checks and raise an exception.

gspowley · 2022-10-23T21:46:00Z

apis/python/src/tiledbsoma/util.py

+    ``SOMAReader::set_dim_points`` must be a list.
+    """
+    if isinstance(ids, list):
+        return pa.chunked_array(pa.array(ids))


We don't need to convert a list to an Arrow array, it can remain a list.

gspowley

Looks great!

johnkerl requested a review from gspowley October 12, 2022 22:56

johnkerl changed the base branch from main to bkmartinjr/386-soma-columns October 12, 2022 22:57

johnkerl mentioned this pull request Oct 12, 2022

[python/c++] Connect libtiledbsoma to tiledbsoma readers for dataframes [WIP] #360

Closed

johnkerl force-pushed the kerl/temp2 branch from 467b1f1 to 0dbceb5 Compare October 12, 2022 23:06

Base automatically changed from bkmartinjr/386-soma-columns to main October 12, 2022 23:19

johnkerl force-pushed the kerl/temp2 branch from 0dbceb5 to af5a520 Compare October 12, 2022 23:23

bkmartinjr reviewed Oct 12, 2022

View reviewed changes

apis/python/src/tiledbsoma/util.py Outdated Show resolved Hide resolved

johnkerl force-pushed the kerl/temp2 branch from 3f6bc8b to df03c98 Compare October 13, 2022 15:31

johnkerl marked this pull request as ready for review October 13, 2022 15:41

johnkerl changed the title ~~Connect libtiledbsoma to tiledbsoma readers for dataframes (#360 re-do) [WIP]~~ Connect libtiledbsoma to tiledbsoma readers for dataframes Oct 13, 2022

johnkerl force-pushed the kerl/temp2 branch 17 times, most recently from 7aa4f6d to 7566501 Compare October 15, 2022 22:13

johnkerl mentioned this pull request Oct 21, 2022

[python/c++] Connect C++ nnz to Python #446

Merged

johnkerl added a commit that referenced this pull request Oct 21, 2022

Remove badly rebase bits of #400

fbc6199

johnkerl added a commit that referenced this pull request Oct 21, 2022

Remove badly rebase bits of #400

3e9c9f9

johnkerl force-pushed the kerl/temp2 branch 3 times, most recently from 27f6d43 to a832ee4 Compare October 21, 2022 19:48

johnkerl marked this pull request as ready for review October 21, 2022 19:53

johnkerl added a commit that referenced this pull request Oct 21, 2022

Connect C++ nnz to Python (#446)

36cd078

* temp double-dylib workaround * title goes here * fix ci * code-review feedback * Remove badly rebase bits of #400

johnkerl force-pushed the kerl/temp2 branch 2 times, most recently from 8c98b2d to ad14bb0 Compare October 21, 2022 20:34

johnkerl mentioned this pull request Oct 21, 2022

[python] Apply black/isort/flake8/mypy to all repo .py files #453

Closed

gspowley suggested changes Oct 22, 2022

View reviewed changes

johnkerl force-pushed the kerl/temp2 branch from a604735 to 4a16e0f Compare October 23, 2022 16:10

johnkerl added 5 commits October 23, 2022 12:12

temp double-dylib workaround

f5ee484

temp

20e548f

fix ci

fd4d971

Support indexing by pyarrow.Array

e32fae5

code-review feedback

c72fc11

johnkerl force-pushed the kerl/temp2 branch from 2f0447f to c72fc11 Compare October 23, 2022 16:12

johnkerl requested a review from gspowley October 23, 2022 16:13

gspowley suggested changes Oct 23, 2022

View reviewed changes

Merge branch 'main' into kerl/temp2

2ca1541

johnkerl force-pushed the kerl/temp2 branch from 2edde90 to 01a1293 Compare October 24, 2022 15:36

johnkerl requested a review from gspowley October 24, 2022 15:37

code-review feedback

d046c81

johnkerl force-pushed the kerl/temp2 branch from 01a1293 to d046c81 Compare October 24, 2022 15:38

gspowley approved these changes Oct 24, 2022

View reviewed changes

johnkerl merged commit 293e4cd into main Oct 24, 2022

johnkerl deleted the kerl/temp2 branch October 24, 2022 16:03

johnkerl changed the title ~~Connect C++ reader for dataframes~~ [python/c++] Connect C++ reader for dataframes Oct 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python/c++] Connect C++ reader for dataframes#400

[python/c++] Connect C++ reader for dataframes#400
johnkerl merged 7 commits intomainfrom
kerl/temp2

johnkerl commented Oct 12, 2022 •

edited

Loading

Uh oh!

Uh oh!

johnkerl commented Oct 13, 2022 •

edited

Loading

Uh oh!

gspowley commented Oct 13, 2022

Uh oh!

bkmartinjr commented Oct 13, 2022 •

edited

Loading

Uh oh!

gspowley Oct 22, 2022

Uh oh!

gspowley Oct 22, 2022

Uh oh!

gspowley left a comment

Uh oh!

gspowley Oct 23, 2022

Uh oh!

gspowley Oct 23, 2022

Uh oh!

gspowley Oct 23, 2022

Uh oh!

gspowley left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

johnkerl commented Oct 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Status

PR context

Uh oh!

Uh oh!

johnkerl commented Oct 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gspowley commented Oct 13, 2022

Uh oh!

bkmartinjr commented Oct 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gspowley Oct 22, 2022

Choose a reason for hiding this comment

Uh oh!

gspowley Oct 22, 2022

Choose a reason for hiding this comment

Uh oh!

gspowley left a comment

Choose a reason for hiding this comment

Uh oh!

gspowley Oct 23, 2022

Choose a reason for hiding this comment

Uh oh!

gspowley Oct 23, 2022

Choose a reason for hiding this comment

Uh oh!

gspowley Oct 23, 2022

Choose a reason for hiding this comment

Uh oh!

gspowley left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

johnkerl commented Oct 12, 2022 •

edited

Loading

johnkerl commented Oct 13, 2022 •

edited

Loading

bkmartinjr commented Oct 13, 2022 •

edited

Loading