-
Notifications
You must be signed in to change notification settings - Fork 192
fix: replace columns with many nulls with missing indicators #1723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
0827a29 to
67212d5
Compare
67212d5 to
b592378
Compare
|
|
||
| def test_missing_indicator_replacement(df_module): | ||
| """Check that columns with too many nulls are replaced with missing indicators.""" | ||
| pytest.importorskip("pyarrow") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this skip added?
|
Hi @honghanhh, thanks for the PR. I think it's already in a good shape, but I am not fully convinced by this implementation. I don't think we should replace the "drop column" functionality entirely, because very often if the columns in my dataframe are mostly nulls I might just want to get rid of them, as I don't care about their content at all. I would rather allow the user to choose between dropping the columns (selecting the fraction), and replacing null values with the indicator. Something like: DropUninformative(drop_if_null_fraction="replace") as an additional parameter besides the fraction and In general, is there a reason for using a threshold to decide which columns should use the missingness indicator, rather than using it everywhere? I am not familiar with its usage, so I don't know how it's normally done. Paging @jeromedockes as well. |
|
@rcap107 I see your point and agree. The current implementation doesn’t let you drop columns when
I would propose adding a where What do you think? Let's switch this discussion and finalize the solution in #1714. And then, I would love to update this PR. |
for columns we keep, typically either
but here the issue is about the behavior of DropUninformative, which allows dropping the column altogether if it contains many nulls. that option can be problematic because it is quite frequent that a column is mostly empty, but the rare event that a value is present is actually very informative. so in that case, I think a good tradeoff can be keeping just the missingness mask. it preserves the information that the value was there or not, without wasting many dimensions on encoding values for which we have too few examples to use them effectively or on a column that might not be useful for prediction. in the tablevectorizer for example, I htink it might be a 'safer' default behavior I haven't thought it through any further than that |
Hi @honghanhh, we can continue talking here, no problem (and I apologize if my previous comment sounded rude). About the strategy, I think we should distinguish between using In the first case (more likely to be encountered by the Cleaner), a user might want to use DropUninformative to remove the columns directly: the transformer becomes a convenience function for removing columns with a lot of null values. However, this might not be the best solution if we actually want to encode the column (more of a TableVectorizer use case), because then the missing values become informative on their own, and for that we should use the missingness indicator. Wouldn't replacing the data with the missingness mask remove information anyway, however? I am wondering if we should have a transformer that is separate from DropUninformative, and that handles encoding columns with missing indicators. In this way, we could keep the functionality of DropUninformative clear and limited to "dropping columns we consider uninformative" according to various requirements, and have a new transformer (e.g., We could also extend the current ApplyToCols(MissingnessEncoder(), cols=s.has_nulls(fraction=0.5))This clearly depends on the interface we're going with, and modifying the selector should be done in a separate PR. What do you think? @honghanhh @jeromedockes |
I agree with your point about the distinction between cleaning / removing poorly-filled columns and encoding / finding a good representation for columns with many nulls. In a way that reflects the 2 reasons a field might be NULL: the information is missing due to poor data quality, or it is a true NULL / the field does not apply (eg "pool surface" for a house that doesn't have a pool) maybe as you say the best option is to have a selector for this null proportion threshold and keep the simple "drop" option in the cleaner ... note there is no need for a new MissingnessEncoder, as it already exists in |
Fixes
Fixes #1714
Description
This PR replaces columns with many nulls with missing indicators. Instead of dropping columns that exceed the drop_null_fraction threshold, replace them with missing indicator columns (
float32: 1.0for missing,0.0for present). This preserves information about whether values were present while avoiding spending feature dimensions on sparse columns.drop_null_fraction < 1.0)drop_null_fraction=1.0only drops entirely null columns)DropUninformative,Cleaner, andTableVectorizer