FEAT - Add interdependence score to column associations #1724

JadeAffolabi · 2025-11-02T22:03:07Z

Description
This pull request is linked to issue #1576 .

I implemented the numpy version of the Interdependence Score (IDS) from the paper Efficiently quantifying dependence in massive scientific datasets using InterDependence Scores.

New file
The new functions are in the file skrub/_interdependence_score.py.
The main function of the file is _ids_matrix(). It can be called directly.

TO DO
If the PR is accepted, I can modified the file skrub/_column_associations.py so that the IDS could be displayed in the output of column_associations, along with the pvalues associated with the IDS.

rcap107 · 2025-11-03T11:04:19Z

Hi @JadeAffolabi, a few points about this PR.

As I was expecting, this will be a very complex PR that will need various iterations of discussions and implementation, and will take a long time to review. If your expectation is for this PR to be merged in the short term (e.g., for some kind of assignment...), that will most likely not be the case.

Then, there is no need to open and close the PR multiple times. If you need to re-run the CI, you need to push a new commit to trigger the execution of the CI again.

The PR is missing all tests (which is also why coverage is not passing), so they should be added. You can refer to #1310 for a PR that adds a similar feature, and how it is tested.

Similarly, the PR is missing proper examples in the docstrings, and a relative example in the documentation. Again, check #1310 for more info on how this should be done.

So this PR needs a lot more work before it can be merged, and if you happen to need to meet a deadline with it you might want to reconsider working on this. Very often, contributions to open source software can take many months to be merged, and this looks like one that won't be merged anytime soon.

JadeAffolabi · 2025-11-04T08:46:24Z

Hi @rcap107 ,

Thank you for your feedback. I will keep working on the PR.
This work was indeed part of an assignment, but it doesn't matter, I will do my best to complete the PR.

Have a nice day.

rcap107 · 2025-11-05T12:57:28Z

Hi @rcap107 ,

Thank you for your feedback. I will keep working on the PR. This work was indeed part of an assignment, but it doesn't matter, I will do my best to complete the PR.

Have a nice day.

That sounds good! What matters to me is that working on this may take a long time, but if that is ok with you then you are of course more than welcome to continue. Thanks for the effort, and feel free to ask if you need further help.

JadeAffolabi · 2025-11-15T16:34:50Z

Hi @rcap107 ,
You said that the "PR is missing proper examples in the docstrings, and a relative example in the documentation."
It seems you a talking about two types of examples. I assume that one is the type of example that shows a snippet of code to demonstrate how to use the new function and that example will be in the main function docstrings. But for the other type of example I don't get it.

I was also wondering if I should add the interdependence score (IdS) to the output of the column_associations function or if I should make the function that implements the IdS public.

JadeAffolabi · 2025-11-18T07:23:49Z

Hello,
I realized that using norm 1 or norm 2 to calculate the IdS does not respect the following property : Ids(x,x) = 1, where a variable has perfect dependence with itself. That's why I propose implementing only the norm infinity (I had implemented all three).

What do you think?

rcap107 · 2025-11-18T08:57:00Z

Hi @rcap107 , You said that the "PR is missing proper examples in the docstrings, and a relative example in the documentation." It seems you a talking about two types of examples. I assume that one is the type of example that shows a snippet of code to demonstrate how to use the new function and that example will be in the main function docstrings. But for the other type of example I don't get it.

In the docstring there should be a snippet of code that shows how you're supposed to call the function and what the result looks like on an example dataset. Then, there should be a script in the "examples" folder that explains gives some additional context on the operation. #1310 is a very good example to take inspiration from because it's a very similar feature and has both the docstring examples and the full script to check out.

You can also use this document to find more info on how to write an example.

It should be fairly short, and it's important to have one the gallery.

I was also wondering if I should add the interdependence score (IdS) to the output of the column_associations function or if I should make the function that implements the IdS public.

The function should be public, and the output should also be added to the Associations tab in the TableReport (although this part can be done in a separate PR).

From a quick look at the code, it seems like the output of the ids function is a square matrix. If that is the case, it should be modified so that it is a dataframe that contains the list of all the associations between the columns. Check out the _column_associations.py file where the _melt operation is done for what I mean.

rcap107 · 2025-11-18T08:57:19Z

Small note, this pr should get an entry in the changelog.

…abi/skrub into enhancement-issue-1576 # Conflicts: # skrub/__init__.py

rcap107 · 2025-11-24T12:35:30Z

Hi @JadeAffolabi, I really appreciate the effort you're putting in this PR, especially considering it is quite a complex issue.

Unfortunately, as a result of it being a complex issue it is also something that takes quite a bit of time to review, to make sure it is working as intended and properly implements the original paper.

For your information, I won't be able to allocate the time necessary for a while more (likely, not until January), though if I can I'll try to leave some feedback before then.

JadeAffolabi · 2025-11-25T10:28:54Z

Hello,
No worries, take your time.
If there are any other changes to be made, let me know.

rcap107

Here is a first review. As I expected, this will take time and iterations to get right.

Overall, I have two main points I am concerned by:

A lot of this code is taken directly from the original repository (https://github.com/aradha/interdependence_scores), and it's not clear to me whether the original author is aware of this, and how we're going to deal with the licenses.
The function interdependence_score is fit_transforming a OneHotEncoder, meaning that there is a state that should be kept track of. However, since this function is just meant to give a metric, it may be fine to discard it.

Some other points:

For consistency with the code base I removed type hints.
feature_map_function is never used as a parameter, so it should be removed and _gaussian_feature_map should be used directly instead.

There are some other cosmetic fixes that need to be handled.

rcap107 · 2026-01-13T09:33:12Z

examples/0110_interdependence_score.py

+
+np.random.seed(42)
+n = 1000
+


Here it would be better to write explicitly that some of the variables are dependent on each other (v1 depends on v0 etc.), and that we would expect them to have a higher IDS with each other than the other variables

rcap107 · 2026-01-13T09:33:52Z

examples/0110_interdependence_score.py

+
+Its basic idea is to approximate the HSIC (Hilbert Schmidt Independence Criterion)
+by computing the first k terms of an infinite dimensional feature map
+for the universal gaussian kernel.


Add here a reference to the paper this was taken from

rcap107 · 2026-01-13T09:34:49Z

examples/0110_interdependence_score.py

+plt.tight_layout()
+plt.show()
+# %%
+# First of all each variable have perfect dependence with itself (ids = 1).


Suggested change

# First of all each variable have perfect dependence with itself (ids = 1).

# First of all, each variable has perfect dependence with itself (IDS = 1).

rcap107 · 2026-01-13T09:38:24Z

examples/0110_interdependence_score.py

+# The linearly dependent variables (v0, v1) have ids greater than 0.90,
+# and the non-linear variables as well (v2, v3, v4).


Rewording a bit, formatting variables

Suggested change

# The linearly dependent variables (v0, v1) have ids greater than 0.90,

# and the non-linear variables as well (v2, v3, v4).

# Both linearly dependent (``v1`` wrt ``v0``) and non-linearly dependent

# variables (``v3`` and ``v4`` wrt ``v2``) have IDS greater than 0.90, signaling

# a strong dependency on the starting variable.

rcap107 · 2026-01-13T09:39:36Z

examples/0110_interdependence_score.py

+# First of all each variable have perfect dependence with itself (ids = 1).
+# The linearly dependent variables (v0, v1) have ids greater than 0.90,
+# and the non-linear variables as well (v2, v3, v4).
+# The variable v6  which is tanh(v0*v2), has a high score with v0 and v2,


Suggested change

# The variable v6 which is tanh(v0*v2), has a high score with v0 and v2,

# ``v6``, which is defined as ``tanh(v0*v2)``, has a high score with ``v0`` and ``v2``,

rcap107 · 2026-01-13T10:14:36Z