Skip to content

ColumnTransformer horizontally stacks the output #1000

@reinierstorm

Description

@reinierstorm

The dask ColumnTransformer stacks the different transformers. The following code (essentially #365) gives an undesirable output

import pandas as pd
import dask.dataframe as dd

import dask_ml.compose
import dask_ml.preprocessing

df = pd.DataFrame({"A": pd.Categorical(["a", "a", "b", "a"]), "B": [1.0, 2, 4, 5]})
ddf = dd.from_pandas(df, npartitions=2).reset_index(drop=True)

ct = dask_ml.compose.ColumnTransformer([
    ("A",  dask_ml.preprocessing.OneHotEncoder(dtype='uint8'), ['A']),  # Example categorical feature
    ("B",  dask_ml.preprocessing.RobustScaler(), ['B'])  # Numeric features
    ],
     )
ct.fit_transform(ddf).compute()

The output I get is:

 	A_a 	A_b 	B
0 	1.0 	0 	NaN
1 	1.0 	0 	NaN
0 	0 	1.0 	NaN
1 	1.0 	0 	NaN
0 	NaN 	NaN 	-1.000000
1 	NaN 	NaN 	-0.666667
0 	NaN 	NaN 	0.000000
1 	NaN 	NaN 	0.333333

The output should be like that of #365

   A_a  A_b         B
0  1.0  0.0 -1.000000
1  1.0  0.0 -0.666667
0  0.0  1.0  0.000000
1  1.0  0.0  0.333333

Environment:

  • dask-ml version: 2024.4.4
  • dask version: 2024.8.1
  • Python version:3.10.14
  • Operating System: Ubuntu 23.04
  • Install method (conda, pip, source): pip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions