Skip to content

ColumnTransformer does not work with Dask dataframes #993

@nprihodko

Description

@nprihodko

Describe the issue:

dask_ml.compose.ColumnTransformer does not work with objects of types dask_expr._collection.DataFrame or dask.dataframe.core.DataFrame.

Minimal Complete Verifiable Example:

import numpy as np
import pandas as pd
from dask_ml.compose import ColumnTransformer
from dask_ml.preprocessing import StandardScaler

import dask.dataframe as dd
from dask.distributed import Client

client = Client()

# Create a sample dataframe
df = pd.DataFrame({"A": np.random.rand(1000)})
ddf = dd.from_pandas(df, npartitions=2)

ColumnTransformer, specifying the columns using strings:

scaler = ColumnTransformer(
    transformers=[("StandardScaler", StandardScaler(), ["A"])],
    remainder="passthrough",
)
scaler.fit_transform(ddf)  # or scaler.fit_transform(ddf.to_legacy_dataframe())

Out:

ValueError: Specifying the columns using strings is only supported for dataframes.

ColumnTransformer, specifying the columns using integers:

scaler = ColumnTransformer(
    transformers=[("StandardScaler", StandardScaler(), [0])],
    remainder="passthrough",
)
scaler.fit_transform(ddf)  # or scaler.fit_transform(ddf.to_legacy_dataframe())

Out:

AttributeError: 'DataFrame' object has no attribute 'take'

Anything else we need to know?:

Pandas data frames, i.e.

scaler.fit_transform(ddf.compute())

works as expected.

Could be related to #962 and #887. If this is the same issue indeed, and there are no plans to fix it in the foreseeable future, could it better to remove it from the Dask ML API?

Environment:

  • Dask version: 2024.4.1
  • Dask ML version: 2024.4.1
  • Scikit-learn version: 1.4.0
  • Python version: 3.10.13
  • Operating System: MacOS.
  • Install method (conda, pip, source): pip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions