Skip to content

Add Spark 4.0 support via deequ:2.0.14-spark-4.0#259

Open
m-aciek wants to merge 1 commit intoawslabs:masterfrom
m-aciek:spark-4-support
Open

Add Spark 4.0 support via deequ:2.0.14-spark-4.0#259
m-aciek wants to merge 1 commit intoawslabs:masterfrom
m-aciek:spark-4-support

Conversation

@m-aciek
Copy link
Copy Markdown

@m-aciek m-aciek commented Mar 26, 2026

Closes #258

Summary

  • Add "4.0": "com.amazon.deequ:deequ:2.0.14-spark-4.0" to SPARK_TO_DEEQU_COORD_MAPPING in configs.py
  • Widen pyspark optional dep from >=2.4.7,<3.4.0 to >=2.4.7,<5.0.0 in pyproject.toml
  • Replace scala.collection.JavaConversions (removed in Scala 2.13) with JavaConverters in scala_utils.py and profiles.py
  • Replace scala.collection.Seq.empty() (inaccessible via Py4J in Scala 2.13) with an empty Java list converted via to_scala_seq in analyzers.py and checks.py
  • Add Spark 4.0.0 to the CI matrix with Java 17; restructure matrix to use include: style so each Spark version carries its required Java version

Root causes fixed

Spark 4 uses Scala 2.13, which introduced two breaking changes affecting pydeequ:

  1. scala.collection.JavaConversions was removed — replaced by JavaConverters with explicit .asScala()/.asJava() calls
  2. scala.collection.Seq.empty() is not accessible via Py4J reflection — replaced with to_scala_seq(jvm, jvm.java.util.ArrayList()) which constructs an empty Scala Seq via the already-fixed converter

Test plan

  • All 99 existing tests pass with SPARK_VERSION=4.0.0 / pyspark==4.0.0
  • CI matrix extended to cover Spark 4.0.0 with Java 17
  • Existing Spark 3.x matrix entries unchanged

PR authored with assistance from Claude Code

- Add "4.0" entry to SPARK_TO_DEEQU_COORD_MAPPING in configs.py
- Widen pyspark optional dep bound to <5.0.0 in pyproject.toml
- Replace scala.collection.JavaConversions (removed in Scala 2.13) with
  JavaConverters in scala_utils.py and profiles.py
- Replace scala.collection.Seq.empty() (inaccessible via Py4J in Scala 2.13)
  with to_scala_seq(jvm, jvm.java.util.ArrayList()) in analyzers.py and checks.py
- Add Spark 4.0.0 to CI matrix with Java 17; use include: style to pair
  each Spark version with its required Java version

Fixes awslabs#258

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Spark 4.0 support via deequ:2.0.14-spark-4.0

1 participant