-
Notifications
You must be signed in to change notification settings - Fork 192
FEAT - Add interdependence score to column associations #1724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
FEAT - Add interdependence score to column associations #1724
Conversation
|
Hi @JadeAffolabi, a few points about this PR. As I was expecting, this will be a very complex PR that will need various iterations of discussions and implementation, and will take a long time to review. If your expectation is for this PR to be merged in the short term (e.g., for some kind of assignment...), that will most likely not be the case. Then, there is no need to open and close the PR multiple times. If you need to re-run the CI, you need to push a new commit to trigger the execution of the CI again. The PR is missing all tests (which is also why coverage is not passing), so they should be added. You can refer to #1310 for a PR that adds a similar feature, and how it is tested. Similarly, the PR is missing proper examples in the docstrings, and a relative example in the documentation. Again, check #1310 for more info on how this should be done. So this PR needs a lot more work before it can be merged, and if you happen to need to meet a deadline with it you might want to reconsider working on this. Very often, contributions to open source software can take many months to be merged, and this looks like one that won't be merged anytime soon. |
|
Hi @rcap107 , Thank you for your feedback. I will keep working on the PR. Have a nice day. |
That sounds good! What matters to me is that working on this may take a long time, but if that is ok with you then you are of course more than welcome to continue. Thanks for the effort, and feel free to ask if you need further help. |
|
Hi @rcap107 , I was also wondering if I should add the interdependence score (IdS) to the output of the column_associations function or if I should make the function that implements the IdS public. |
|
Hello, What do you think? |
In the docstring there should be a snippet of code that shows how you're supposed to call the function and what the result looks like on an example dataset. Then, there should be a script in the "examples" folder that explains gives some additional context on the operation. #1310 is a very good example to take inspiration from because it's a very similar feature and has both the docstring examples and the full script to check out. You can also use this document to find more info on how to write an example. It should be fairly short, and it's important to have one the gallery.
The function should be public, and the output should also be added to the Associations tab in the TableReport (although this part can be done in a separate PR). From a quick look at the code, it seems like the output of the ids function is a square matrix. If that is the case, it should be modified so that it is a dataframe that contains the list of all the associations between the columns. Check out the |
|
Small note, this pr should get an entry in the changelog. |
…abi/skrub into enhancement-issue-1576 # Conflicts: # skrub/__init__.py
|
Hi @JadeAffolabi, I really appreciate the effort you're putting in this PR, especially considering it is quite a complex issue. Unfortunately, as a result of it being a complex issue it is also something that takes quite a bit of time to review, to make sure it is working as intended and properly implements the original paper. For your information, I won't be able to allocate the time necessary for a while more (likely, not until January), though if I can I'll try to leave some feedback before then. |
|
Hello, |
rcap107
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a first review. As I expected, this will take time and iterations to get right.
Overall, I have two main points I am concerned by:
- A lot of this code is taken directly from the original repository (https://github.com/aradha/interdependence_scores), and it's not clear to me whether the original author is aware of this, and how we're going to deal with the licenses.
- The function
interdependence_scoreis fit_transforming a OneHotEncoder, meaning that there is a state that should be kept track of. However, since this function is just meant to give a metric, it may be fine to discard it.
Some other points:
- For consistency with the code base I removed type hints.
feature_map_functionis never used as a parameter, so it should be removed and_gaussian_feature_mapshould be used directly instead.
There are some other cosmetic fixes that need to be handled.
|
|
||
| np.random.seed(42) | ||
| n = 1000 | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it would be better to write explicitly that some of the variables are dependent on each other (v1 depends on v0 etc.), and that we would expect them to have a higher IDS with each other than the other variables
|
|
||
| Its basic idea is to approximate the HSIC (Hilbert Schmidt Independence Criterion) | ||
| by computing the first k terms of an infinite dimensional feature map | ||
| for the universal gaussian kernel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add here a reference to the paper this was taken from
| plt.tight_layout() | ||
| plt.show() | ||
| # %% | ||
| # First of all each variable have perfect dependence with itself (ids = 1). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # First of all each variable have perfect dependence with itself (ids = 1). | |
| # First of all, each variable has perfect dependence with itself (IDS = 1). |
| # The linearly dependent variables (v0, v1) have ids greater than 0.90, | ||
| # and the non-linear variables as well (v2, v3, v4). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewording a bit, formatting variables
| # The linearly dependent variables (v0, v1) have ids greater than 0.90, | |
| # and the non-linear variables as well (v2, v3, v4). | |
| # Both linearly dependent (``v1`` wrt ``v0``) and non-linearly dependent | |
| # variables (``v3`` and ``v4`` wrt ``v2``) have IDS greater than 0.90, signaling | |
| # a strong dependency on the starting variable. |
| # First of all each variable have perfect dependence with itself (ids = 1). | ||
| # The linearly dependent variables (v0, v1) have ids greater than 0.90, | ||
| # and the non-linear variables as well (v2, v3, v4). | ||
| # The variable v6 which is tanh(v0*v2), has a high score with v0 and v2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # The variable v6 which is tanh(v0*v2), has a high score with v0 and v2, | |
| # ``v6``, which is defined as ``tanh(v0*v2)``, has a high score with ``v0`` and ``v2``, |
| def _ids_matrix( | ||
| X: DataFrame, | ||
| feature_map_function=_gaussian_feature_map, | ||
| k_terms=6, | ||
| p_val=False, | ||
| num_tests=100, | ||
| bandwidth_term=1 / 2, | ||
| ) -> tuple[np.ndarray, np.ndarray] | tuple[np.ndarray, None]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def _ids_matrix( | |
| X: DataFrame, | |
| feature_map_function=_gaussian_feature_map, | |
| k_terms=6, | |
| p_val=False, | |
| num_tests=100, | |
| bandwidth_term=1 / 2, | |
| ) -> tuple[np.ndarray, np.ndarray] | tuple[np.ndarray, None]: | |
| def _ids_matrix( | |
| X, | |
| feature_map_function=_gaussian_feature_map, | |
| k_terms=6, | |
| p_val=False, | |
| num_tests=100, | |
| bandwidth_term=1 / 2, | |
| ): |
| def interdependence_score( | ||
| X: DataFrame, | ||
| k_terms=6, | ||
| p_val=False, | ||
| num_tests=100, | ||
| bandwidth_term=1 / 2, | ||
| return_matrix=False, | ||
| ) -> DataFrame | tuple[DataFrame, DataFrame]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type hints
| def interdependence_score( | |
| X: DataFrame, | |
| k_terms=6, | |
| p_val=False, | |
| num_tests=100, | |
| bandwidth_term=1 / 2, | |
| return_matrix=False, | |
| ) -> DataFrame | tuple[DataFrame, DataFrame]: | |
| def interdependence_score( | |
| X, | |
| k_terms=6, | |
| p_val=False, | |
| num_tests=100, | |
| bandwidth_term=1 / 2, | |
| return_matrix=False, | |
| ): |
|
|
||
| Notes | ||
| ----- | ||
| The result is a dataframe with columns: ``['left_column_name', 'right_column_name', 'cramer_v', 'pearson_corr']`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The result is a dataframe with columns: ``['left_column_name', 'right_column_name', 'cramer_v', 'pearson_corr']`` | |
| The result is a dataframe with columns: ``['left_column_name', 'right_column_name', 'interdependence_score', 'pvalue']`` |
| ids_table = _join_utils.left_join( | ||
| ids_table, pval_table, right_on=on, left_on=on | ||
| ) | ||
| ids_table = drop_columns_with_substring(ids_table, "skrub") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be replaced by skrub selectors
add import skrub.selectors as s at the top of the file
then
| ids_table = drop_columns_with_substring(ids_table, "skrub") | |
| ids_table = s.select(ids_table, ~s.glob("*skrub*")) |
drop_columns_with_substring can be removed
| ) | ||
| ids_table = drop_columns_with_substring(ids_table, "skrub") | ||
|
|
||
| ids_table = drop_columns_with_substring(ids_table, "idx") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here as above
| ids_table = drop_columns_with_substring(ids_table, "idx") | |
| ids_table = s.select(ids_table, ~s.glob("*idx*")) |
| @@ -0,0 +1,492 @@ | |||
| """Detect which columns have strong statistical dependence using InterDependenceScore""" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add here a link to the original repository and explain that this is an adaptation of the original implementation so that it can be used in skrub.
|
A couple of additional notes after IRL discussions:
Given that the original license is MIT, it should be fine to reuse the code, provided that it gets referenced properly in the code.
Since this is a metric, fitting the OneHotEncoder is not a problem. |
|
Hi @JadeAffolabi, I'm discussing with other maintainers whether the PR should be merged after all, and for the time being it's better if you spend any more time on this particular problem. I should have stopped you before you embarked in this, I apologize for the mess-up. In any case, thank you very much for the effort you put into this. |
Description
This pull request is linked to issue #1576 .
I implemented the numpy version of the Interdependence Score (IDS) from the paper Efficiently quantifying dependence in massive scientific datasets using InterDependence Scores.
New file
The new functions are in the file
skrub/_interdependence_score.py.The main function of the file is _ids_matrix(). It can be called directly.
TO DO
If the PR is accepted, I can modified the file
skrub/_column_associations.pyso that the IDS could be displayed in the output of column_associations, along with the pvalues associated with the IDS.