[REVIEW] Add MultiIndex support for Dataframes and Series#1301
Conversation
|
rerun tests |
|
@thomcom given we're still working on the correct approach for |
|
I'm writing the first implementation dumb/slow/naive, based on the assumption that we can use a libcudf gather call for the efficient method in the near future. Once I have 100% test passing we can talk about the more efficient solution and make bindings for gather into cudf. Sound good? More progress coming today. |
Sounds perfect, thanks for leading the charge on this! |
…instead of a list of lists.
…ring.py test that now passes.
|
Tests aren't passing for obscure circular dependency problems (I think), but I'm able to continue development and run tests locally without any issues. I'm asking for comments at this time. Before the end of the 0.6 dev cycle I intend to add proper multiindex output to groupby results and I hope to add slicing. Slicing is most likely to get bumped. |
This PR adds the MultiIndex class to cudf. MultiIndex (MI) is used for slicing and manipulating dataframes in a higher dimensionality than 2d. It is of particular importance for groupby, which uses it in an index for and a column form.
This PR creates the class and adds 90% of the public methods from Pandas to the MultiIndex. The core functionality (MI codes) is implemented using cudf dataframes, as one code row must be created for each row or column in the dataframe the index is attached to.
I've left a few important tasks incomplete, as the basic functionality and groupby support is all available now for 0.7. The MI work checklist is below:
MultiIndex checklist
This will fix #483
Fixes #1337
Fixes rapidsai/dask-cudf#191
Fixes rapidsai/dask-cudf#125
Fixes rapidsai/dask-cudf#132