feat(spark): add some date functions#373
Conversation
6b061aa to
47d020a
Compare
Blizzara
left a comment
There was a problem hiding this comment.
Thanks! Makes sense for the extract parts, but for the DateAdd and DateSub I don't think the signature is correct (output type would be wrong). I think we should instead add the correct signatures.
| } | ||
| } | ||
|
|
||
| private def validateOutputType( |
There was a problem hiding this comment.
I'd just get rid of these calls, maybe leave a TODO in place instead
| } | ||
|
|
||
| // spotless:off | ||
| val failingSQL: Set[String] = Set( |
|
|
||
| val opTypesStr = operands.map { | ||
| case e: SExpression => e.getType.accept(ToTypeString.INSTANCE) | ||
| case t: Type => t.accept(ToTypeString.INSTANCE) |
There was a problem hiding this comment.
do you know if we have any tests for this (t: Type) case?
There was a problem hiding this comment.
Yes, this gets driven by the last test in DateTimeSuite.scala and by three of the TCP-H tests (that extract the year component - this was previously handled by an internal definition in spark.yaml)
There was a problem hiding this comment.
Don't the use the EnumArg case, not the Type case?
There was a problem hiding this comment.
Yes, sorry. The Type case is there for completeness. As far as I can see, there are no currently supported functions that use type arguments. I suppose we could throw an unsupported exception in this case, if you'd prefer.
| } | ||
|
|
||
| def unapply(name_args: (String, Seq[Expression])): Option[Expression] = name_args match { | ||
| case ("add:date_iday", Seq(startDate, i @ Literal(_, DayTimeIntervalType.DEFAULT))) => |
There was a problem hiding this comment.
I don't think these are correct, the add:date_iday has a return type of timestamp, while Spark returns date. We should rather add new definitions, either into substrait itself or into spark.yml here, for add:date_i32 -> date
There was a problem hiding this comment.
Shouldn't we try to ensure that as many of the default Substrait functions are mapped to Spark and trying to keep the Spark specific mappings as minimal as possible? Especially for the Substrait -> Spark direction?
There was a problem hiding this comment.
When they match, yes, but in this case they don't. In Substrait you can give here an interval like 1d 12h, so if the date is let's say today 2025-04-09, the result should be 2025-04-10T12:00:00, while this in Spark would return 2025-04-10.
There was a problem hiding this comment.
So we agree we would need to ensure that for this mapping the behavior in Spark matches the behavior in Substrait for the Substrait -> Spark direction?
I do understand that when we talk about the other direction Spark -> Substrait that we should make sure we are not losing something that can be expressed in Spark when mapping to Substrait.
I'm just a bit concerned if we keep adding custom extensions for the subtrait-spark mappings that we are losing some of the utility of having a common format for query plans independent of engines.
There was a problem hiding this comment.
would need to ensure that for this mapping the behavior in Spark matches the behavior in Substrait for the Substrait -> Spark direction?
Yes, which is currently not the case.
I do understand that when we talk about the other direction Spark -> Substrait that we should make sure we are not losing something that can be expressed in Spark when mapping to Substrait.
In this case the Spark expression is more limited (takes number of days, while Substrait takes an interval), so there is no losing. However the return type is wrong - the Spark expr returns date, while the Substrait definition claims a timestamp.
I'm just a bit concerned if we keep adding custom extensions for the subtrait-spark mappings that we are losing some of the utility of having a common format for query plans independent of engines.
Yeah, that's tricky. It's a nice goal, and worth going for when possible, but it just isn't always doable (or at least not easy) since the engines support different things and in different ways.
Internally, we've found it most useful to map to Substrait extensions what maps nicely, and for the rest do Spark specific mappings (and then replicate those in other engines we want to use). Not saying that's what this repo should do necessarily.
One way to solve this could be to map add:date_iday into Add(Cast(X as TimestampNTZ), Y), I guess that should match for types.
There was a problem hiding this comment.
To move this forward (we don't use an internal fork), I've added add:date_i32 to spark.yaml for now. I'll work separately on pushing this down into the core substrait library.
| import org.apache.spark.sql.catalyst.expressions.{LeafExpression, Unevaluable} | ||
| import org.apache.spark.sql.types.{DataType, NullType} | ||
|
|
||
| case class Enum(value: String) extends LeafExpression with Unevaluable { |
There was a problem hiding this comment.
makes sense, mind adding a docstring to explain what this is for though?
| Util | ||
| .seqToOption(children.map(translateUp)) | ||
| .seqToOption(children.map { | ||
| case Enum(value) => Some(ImmutableEnumArg.builder.value(Optional.of(value)).build) |
There was a problem hiding this comment.
nit: if EnumArg doesn't have the builder on itself, you could add it there so that you could do just EnumArg.builder...
| case _ => None | ||
| } | ||
| val tz = | ||
| if (Cast.needsTimeZone(childExp.dataType, tt)) |
|
FWIW, in our fork I've done something similar, but maybe even more generic still (yeah, I should pull it upstream too..): this then allows you to do both simple and more complex mappings logic: |
The date/time functions in Spark don’t map directly to the Substrait eqivalents. E.g. - `date ± interval-days` are handled by the `DateAdd` & `DateSub` functions in Spark, but as a variant of the arithmetic `add` function in substrait. - The date/time component extraction functions are all handled by different functions in Spark, but by a single `extract` function in Substrait with an `enum` argument to specify which component. Neither of these could be handled using the existing function mapping capabilities in the `spark` module. This commit exends this capability so that it can now handle these two scenarios in (I hope) a generic way. I’ve added a few variants of the `extract` function - more can follow. Adding this will give us 100% pass rate for all the TPC-DS querues. The README is updated accordingly.
47d020a to
13db2bc
Compare
The date/time functions in Spark don’t map directly to the Substrait eqivalents. E.g.
date ± interval-daysare handled by theDateAdd&DateSubfunctions in Spark, but as a variant of the arithmeticaddfunction in substrait.extractfunction in Substrait with anenumargument to specify which component.Neither of these could be handled using the existing function mapping capabilities in the
sparkmodule.This commit exends this capability so that it can now handle these two scenarios in (I hope) a generic way.
I’ve added a few variants of the
extractfunction - more can follow.Adding this will give us 100% pass rate for all the TPC-DS querues. The README is updated accordingly.