generated from amazon-archives/__template_Apache-2.0
-
Notifications
You must be signed in to change notification settings - Fork 151
Open
Description
The upstream deequ library released 2.0.14-spark-4.0 on March 23, 2026, adding official Apache Spark 4.0 support (see awslabs/deequ#676, awslabs/deequ#678).
pydeequ currently does not support Spark 4 because:
SPARK_TO_DEEQU_COORD_MAPPINGinpydeequ/configs.pyonly maps up to Spark 3.5- The PySpark optional dependency in
pyproject.tomlis capped at<3.4.0 - Spark 4 uses Scala 2.13, which removed
scala.collection.JavaConversionsand changed howSeq.emptyis accessed via reflection — both used in pydeequ internals
Required changes
pydeequ/configs.py: add"4.0": "com.amazon.deequ:deequ:2.0.14-spark-4.0"toSPARK_TO_DEEQU_COORD_MAPPINGpyproject.toml: widen pyspark optional dep from>=2.4.7,<3.4.0to>=2.4.7,<5.0.0pydeequ/scala_utils.py: replace removedJavaConversionswithJavaConverters(iterableAsScalaIterableConverter,mapAsJavaMapConverter)pydeequ/profiles.py: sameJavaConversions→JavaConvertersfixpydeequ/analyzers.py+pydeequ/checks.py: replacescala.collection.Seq.empty()(inaccessible via Py4J in Scala 2.13) with an emptyjava.util.ArrayListconverted viato_scala_seq.github/workflows/base.yml: add Spark 4.0.0 to the test matrix with Java 17 (required by Spark 4); useinclude:matrix style to pair each Spark version with its Java version
Additional context
- deequ Spark 4 feature request: [FEATURE] Support for Spark 4 deequ#670
- Maven artifact:
com.amazon.deequ:deequ:2.0.14-spark-4.0 - Spark 4 requires Java 17 (not Java 11 used for Spark 3.x)
- A working implementation with all 99 tests passing on Spark 4.0.0 is available as a PR
Issue authored with assistance from Claude Code
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels