Skip to content

Conversation

@JadeAffolabi
Copy link

@JadeAffolabi JadeAffolabi commented Nov 2, 2025

Description
This pull request is linked to issue #1576 .

I implemented the numpy version of the Interdependence Score (IDS) from the paper Efficiently quantifying dependence in massive scientific datasets using InterDependence Scores.

New file
The new functions are in the file skrub/_interdependence_score.py.
The main function of the file is _ids_matrix(). It can be called directly.

TO DO
If the PR is accepted, I can modified the file skrub/_column_associations.py so that the IDS could be displayed in the output of column_associations, along with the pvalues associated with the IDS.

@JadeAffolabi JadeAffolabi reopened this Nov 3, 2025
@JadeAffolabi JadeAffolabi reopened this Nov 3, 2025
@rcap107
Copy link
Member

rcap107 commented Nov 3, 2025

Hi @JadeAffolabi, a few points about this PR.

As I was expecting, this will be a very complex PR that will need various iterations of discussions and implementation, and will take a long time to review. If your expectation is for this PR to be merged in the short term (e.g., for some kind of assignment...), that will most likely not be the case.

Then, there is no need to open and close the PR multiple times. If you need to re-run the CI, you need to push a new commit to trigger the execution of the CI again.

The PR is missing all tests (which is also why coverage is not passing), so they should be added. You can refer to #1310 for a PR that adds a similar feature, and how it is tested.

Similarly, the PR is missing proper examples in the docstrings, and a relative example in the documentation. Again, check #1310 for more info on how this should be done.

So this PR needs a lot more work before it can be merged, and if you happen to need to meet a deadline with it you might want to reconsider working on this. Very often, contributions to open source software can take many months to be merged, and this looks like one that won't be merged anytime soon.

@rcap107 rcap107 changed the title Enhancement issue 1576 FEAT - Add interdependence score to column associations Nov 3, 2025
@JadeAffolabi
Copy link
Author

Hi @rcap107 ,

Thank you for your feedback. I will keep working on the PR.
This work was indeed part of an assignment, but it doesn't matter, I will do my best to complete the PR.

Have a nice day.

@rcap107
Copy link
Member

rcap107 commented Nov 5, 2025

Hi @rcap107 ,

Thank you for your feedback. I will keep working on the PR. This work was indeed part of an assignment, but it doesn't matter, I will do my best to complete the PR.

Have a nice day.

That sounds good! What matters to me is that working on this may take a long time, but if that is ok with you then you are of course more than welcome to continue. Thanks for the effort, and feel free to ask if you need further help.

@JadeAffolabi
Copy link
Author

Hi @rcap107 ,
You said that the "PR is missing proper examples in the docstrings, and a relative example in the documentation."
It seems you a talking about two types of examples. I assume that one is the type of example that shows a snippet of code to demonstrate how to use the new function and that example will be in the main function docstrings. But for the other type of example I don't get it.

I was also wondering if I should add the interdependence score (IdS) to the output of the column_associations function or if I should make the function that implements the IdS public.

@JadeAffolabi
Copy link
Author

Hello,
I realized that using norm 1 or norm 2 to calculate the IdS does not respect the following property : Ids(x,x) = 1, where a variable has perfect dependence with itself. That's why I propose implementing only the norm infinity (I had implemented all three).

What do you think?

@rcap107
Copy link
Member

rcap107 commented Nov 18, 2025

Hi @rcap107 , You said that the "PR is missing proper examples in the docstrings, and a relative example in the documentation." It seems you a talking about two types of examples. I assume that one is the type of example that shows a snippet of code to demonstrate how to use the new function and that example will be in the main function docstrings. But for the other type of example I don't get it.

In the docstring there should be a snippet of code that shows how you're supposed to call the function and what the result looks like on an example dataset. Then, there should be a script in the "examples" folder that explains gives some additional context on the operation. #1310 is a very good example to take inspiration from because it's a very similar feature and has both the docstring examples and the full script to check out.

You can also use this document to find more info on how to write an example.

It should be fairly short, and it's important to have one the gallery.

I was also wondering if I should add the interdependence score (IdS) to the output of the column_associations function or if I should make the function that implements the IdS public.

The function should be public, and the output should also be added to the Associations tab in the TableReport (although this part can be done in a separate PR).

From a quick look at the code, it seems like the output of the ids function is a square matrix. If that is the case, it should be modified so that it is a dataframe that contains the list of all the associations between the columns. Check out the _column_associations.py file where the _melt operation is done for what I mean.

@rcap107
Copy link
Member

rcap107 commented Nov 18, 2025

Small note, this pr should get an entry in the changelog.

@rcap107
Copy link
Member

rcap107 commented Nov 24, 2025

Hi @JadeAffolabi, I really appreciate the effort you're putting in this PR, especially considering it is quite a complex issue.

Unfortunately, as a result of it being a complex issue it is also something that takes quite a bit of time to review, to make sure it is working as intended and properly implements the original paper.

For your information, I won't be able to allocate the time necessary for a while more (likely, not until January), though if I can I'll try to leave some feedback before then.

@JadeAffolabi
Copy link
Author

Hello,
No worries, take your time.
If there are any other changes to be made, let me know.

Copy link
Member

@rcap107 rcap107 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a first review. As I expected, this will take time and iterations to get right.

Overall, I have two main points I am concerned by:

  • A lot of this code is taken directly from the original repository (https://github.com/aradha/interdependence_scores), and it's not clear to me whether the original author is aware of this, and how we're going to deal with the licenses.
  • The function interdependence_score is fit_transforming a OneHotEncoder, meaning that there is a state that should be kept track of. However, since this function is just meant to give a metric, it may be fine to discard it.

Some other points:

  • For consistency with the code base I removed type hints.
  • feature_map_function is never used as a parameter, so it should be removed and _gaussian_feature_map should be used directly instead.

There are some other cosmetic fixes that need to be handled.


np.random.seed(42)
n = 1000

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it would be better to write explicitly that some of the variables are dependent on each other (v1 depends on v0 etc.), and that we would expect them to have a higher IDS with each other than the other variables


Its basic idea is to approximate the HSIC (Hilbert Schmidt Independence Criterion)
by computing the first k terms of an infinite dimensional feature map
for the universal gaussian kernel.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add here a reference to the paper this was taken from

plt.tight_layout()
plt.show()
# %%
# First of all each variable have perfect dependence with itself (ids = 1).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# First of all each variable have perfect dependence with itself (ids = 1).
# First of all, each variable has perfect dependence with itself (IDS = 1).

Comment on lines +93 to +94
# The linearly dependent variables (v0, v1) have ids greater than 0.90,
# and the non-linear variables as well (v2, v3, v4).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewording a bit, formatting variables

Suggested change
# The linearly dependent variables (v0, v1) have ids greater than 0.90,
# and the non-linear variables as well (v2, v3, v4).
# Both linearly dependent (``v1`` wrt ``v0``) and non-linearly dependent
# variables (``v3`` and ``v4`` wrt ``v2``) have IDS greater than 0.90, signaling
# a strong dependency on the starting variable.

# First of all each variable have perfect dependence with itself (ids = 1).
# The linearly dependent variables (v0, v1) have ids greater than 0.90,
# and the non-linear variables as well (v2, v3, v4).
# The variable v6 which is tanh(v0*v2), has a high score with v0 and v2,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# The variable v6 which is tanh(v0*v2), has a high score with v0 and v2,
# ``v6``, which is defined as ``tanh(v0*v2)``, has a high score with ``v0`` and ``v2``,

Comment on lines +286 to +293
def _ids_matrix(
X: DataFrame,
feature_map_function=_gaussian_feature_map,
k_terms=6,
p_val=False,
num_tests=100,
bandwidth_term=1 / 2,
) -> tuple[np.ndarray, np.ndarray] | tuple[np.ndarray, None]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _ids_matrix(
X: DataFrame,
feature_map_function=_gaussian_feature_map,
k_terms=6,
p_val=False,
num_tests=100,
bandwidth_term=1 / 2,
) -> tuple[np.ndarray, np.ndarray] | tuple[np.ndarray, None]:
def _ids_matrix(
X,
feature_map_function=_gaussian_feature_map,
k_terms=6,
p_val=False,
num_tests=100,
bandwidth_term=1 / 2,
):

Comment on lines +344 to +351
def interdependence_score(
X: DataFrame,
k_terms=6,
p_val=False,
num_tests=100,
bandwidth_term=1 / 2,
return_matrix=False,
) -> DataFrame | tuple[DataFrame, DataFrame]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type hints

Suggested change
def interdependence_score(
X: DataFrame,
k_terms=6,
p_val=False,
num_tests=100,
bandwidth_term=1 / 2,
return_matrix=False,
) -> DataFrame | tuple[DataFrame, DataFrame]:
def interdependence_score(
X,
k_terms=6,
p_val=False,
num_tests=100,
bandwidth_term=1 / 2,
return_matrix=False,
):


Notes
-----
The result is a dataframe with columns: ``['left_column_name', 'right_column_name', 'cramer_v', 'pearson_corr']``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The result is a dataframe with columns: ``['left_column_name', 'right_column_name', 'cramer_v', 'pearson_corr']``
The result is a dataframe with columns: ``['left_column_name', 'right_column_name', 'interdependence_score', 'pvalue']``

ids_table = _join_utils.left_join(
ids_table, pval_table, right_on=on, left_on=on
)
ids_table = drop_columns_with_substring(ids_table, "skrub")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be replaced by skrub selectors

add import skrub.selectors as s at the top of the file

then

Suggested change
ids_table = drop_columns_with_substring(ids_table, "skrub")
ids_table = s.select(ids_table, ~s.glob("*skrub*"))

drop_columns_with_substring can be removed

)
ids_table = drop_columns_with_substring(ids_table, "skrub")

ids_table = drop_columns_with_substring(ids_table, "idx")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here as above

Suggested change
ids_table = drop_columns_with_substring(ids_table, "idx")
ids_table = s.select(ids_table, ~s.glob("*idx*"))

@@ -0,0 +1,492 @@
"""Detect which columns have strong statistical dependence using InterDependenceScore"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add here a link to the original repository and explain that this is an adaptation of the original implementation so that it can be used in skrub.

@rcap107
Copy link
Member

rcap107 commented Jan 13, 2026

A couple of additional notes after IRL discussions:

  • A lot of this code is taken directly from the original repository (aradha/interdependence_scores), and it's not clear to me whether the original author is aware of this, and how we're going to deal with the licenses.

Given that the original license is MIT, it should be fine to reuse the code, provided that it gets referenced properly in the code.

  • The function interdependence_score is fit_transforming a OneHotEncoder, meaning that there is a state that should be kept track of. However, since this function is just meant to give a metric, it may be fine to discard it.

Since this is a metric, fitting the OneHotEncoder is not a problem.

@rcap107
Copy link
Member

rcap107 commented Jan 13, 2026

Hi @JadeAffolabi,
The more I think about this PR, the less convinced I am it should be merged after all. It's unclear how useful it would be in practice, and beside the comments in my review, there needs to be some extensive performance benchmarking. Basically, it's a lot of work left to be done for very unclear gains.

I'm discussing with other maintainers whether the PR should be merged after all, and for the time being it's better if you spend any more time on this particular problem. I should have stopped you before you embarked in this, I apologize for the mess-up. In any case, thank you very much for the effort you put into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants