Skip to content

Add Spark 4.0 support via deequ:2.0.14-spark-4.0 #258

@m-aciek

Description

@m-aciek

The upstream deequ library released 2.0.14-spark-4.0 on March 23, 2026, adding official Apache Spark 4.0 support (see awslabs/deequ#676, awslabs/deequ#678).

pydeequ currently does not support Spark 4 because:

  1. SPARK_TO_DEEQU_COORD_MAPPING in pydeequ/configs.py only maps up to Spark 3.5
  2. The PySpark optional dependency in pyproject.toml is capped at <3.4.0
  3. Spark 4 uses Scala 2.13, which removed scala.collection.JavaConversions and changed how Seq.empty is accessed via reflection — both used in pydeequ internals

Required changes

  • pydeequ/configs.py: add "4.0": "com.amazon.deequ:deequ:2.0.14-spark-4.0" to SPARK_TO_DEEQU_COORD_MAPPING
  • pyproject.toml: widen pyspark optional dep from >=2.4.7,<3.4.0 to >=2.4.7,<5.0.0
  • pydeequ/scala_utils.py: replace removed JavaConversions with JavaConverters (iterableAsScalaIterableConverter, mapAsJavaMapConverter)
  • pydeequ/profiles.py: same JavaConversionsJavaConverters fix
  • pydeequ/analyzers.py + pydeequ/checks.py: replace scala.collection.Seq.empty() (inaccessible via Py4J in Scala 2.13) with an empty java.util.ArrayList converted via to_scala_seq
  • .github/workflows/base.yml: add Spark 4.0.0 to the test matrix with Java 17 (required by Spark 4); use include: matrix style to pair each Spark version with its Java version

Additional context


Issue authored with assistance from Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions