fix(isthmus): handle subqueries with outer field references#426
Conversation
|
For the sake of reviewability could I ask you split this into 3-ish PRs. One for your initial cleanups/improvements, one for just the visitor changes, and one for subquery changes. |
Sure, I already thought that this might be necessary. |
|
First extracted PR for the visitor changes: #427 |
7b6ef99 to
764206a
Compare
|
I removed the cleanup from this PR and rebased it on #427 |
a2f0c8e to
86545b6
Compare
|
updated to use the new |
vbarua
left a comment
There was a problem hiding this comment.
Did a very cursory first pass.
Something that I think would be helpful would be some tests based directly on Substrait plans, and not just the TPC based tests. This would help contributors, myself included, understand what exactly the structure of these subqueries looks like in Substrait, and how your processing is being applied on top of it.
vbarua
left a comment
There was a problem hiding this comment.
Did a very cursory first pass.
Something that I think would be helpful would be some tests based directly on Substrait plans, and not just the TPC based tests. This would help contributors, myself included, understand what exactly the structure of these subqueries looks like in Substrait, and how your processing is being applied on top of it.
will spend some time on tests towards end of the week when I have some spare time |
f574de2 to
1e09ea6
Compare
8d8efb8 to
8b771b9
Compare
24ae01c to
cbd7e88
Compare
vbarua
left a comment
There was a problem hiding this comment.
Now that we've merged the precursor PR, this is much easier to look at. Your comments and tests help a lot as well.
We've only have tests that go from Substrait -> Calcite. Does this work going the other way? Not something that I think needs to happen as part of this PR, but possibly something to document as an issue.
To your points about
Subqueries are an expression element so they only appear within relations that are configured using expressions e.g. FilterRel, ProjectRel, JoinRel.
Calcite only supports providing a variablesSet for Project, Filter and Join if you search for variablesSet in the Calcite RelBuilder.
SortRels also contain expressions which could contain subqueries. Would the conversion just blow up now if we encountered that?
AggregateRels also contains expressions, but a because of the https://github.com/substrait-io/substrait-java/blob/main/isthmus/src/main/java/io/substrait/isthmus/PreCalciteAggregateValidator.java that may Just Work (TM) because they would get moved into a Project before the Aggregate.
I don't expect subqueries in those relations to be that common, but it might be worth a quick check to see what happens and filing a follow-up issue.
| * FROM orders | ||
| */ | ||
| final Rel root = | ||
| substraitBuilder.project( |
There was a problem hiding this comment.
minor: I often alias substraitBuilder to b in order to minimize it's presence when building plans like this and making them easier to read.
| public class TpcdsQueryTest extends PlanTestBase { | ||
| private static final Set<Integer> toSubstraitExclusions = Set.of(9, 27, 36, 70, 86); | ||
| private static final Set<Integer> fromSubstraitExclusions = Set.of(6, 8, 67); | ||
| private static final Set<Integer> fromSubstraitExclusions = Set.of(1, 8, 30, 67, 81); |
There was a problem hiding this comment.
Did 1, 30 and 81 work before, and now they don't?
There was a problem hiding this comment.
They did convert without throwing an exception because they were converted incorrectly. Those queries also have subqueries with outer field references which previously were not converted to outer field references (correlation variables) but as direct field references and they just happened to not throw an exception during conversion.
If you look at e.g. TPC-DS query 1:
with customer_total_return as
(select
sr_customer_sk as ctr_customer_sk
,sr_store_sk as ctr_store_sk
,sum(SR_RETURN_AMT_INC_TAX) as ctr_total_return
from store_returns
,date_dim
where sr_returned_date_sk = d_date_sk
and d_year =2001
group by sr_customer_sk
,sr_store_sk)
select c_customer_id
from customer_total_return ctr1
,store
,customer
where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2 <--- has an aggregate
from customer_total_return ctr2
where ctr1.ctr_store_sk = ctr2.ctr_store_sk) <--- filter has an outer reference to ctr1
and s_store_sk = ctr1.ctr_store_sk
and s_state = 'distmember(fips_county, 61, 3)'
and ctr1.ctr_customer_sk = c_customer_sk
order by c_customer_id
LIMIT 100Calcite explain after going from Substrait back to Calcite:
LogicalProject(C_CUSTOMER_ID=[$0])
LogicalSort(sort0=[$0], dir0=[ASC], fetch=[100])
LogicalProject(C_CUSTOMER_ID0=[$33])
LogicalFilter(condition=[AND(>(CAST($2):DECIMAL(9, 3), $SCALAR_QUERY({
LogicalProject($f1=[*($0, 1.2:DECIMAL(2, 1))])
LogicalAggregate(group=[{}], agg#0=[AVG($0)])
LogicalProject($f20=[$2])
LogicalFilter(condition=[=($cor0.SR_STORE_SK0, $1)]) <-- LogicalFilter with correl var
LogicalAggregate(group=[{0, 1}], agg#0=[SUM($2)]) <-- directly on top of LogicalAggregate
LogicalProject(SR_CUSTOMER_SK0=[$3], SR_STORE_SK0=[$7], SR_RETURN_AMT_INC_TAX0=[$13])
LogicalFilter(condition=[AND(=($0, $20), =($26, 2001))])
LogicalJoin(condition=[true], joinType=[inner])
LogicalTableScan(table=[[tpcds, STORE_RETURNS]])
LogicalTableScan(table=[[tpcds, DATE_DIM]])
})), =($3, $1), =(CAST($27):VARCHAR(30) NOT NULL, 'distmember(fips_county, 61, 3)'), =($0, $32))], variablesSet=[[$cor0]])
LogicalJoin(condition=[true], joinType=[inner])
LogicalJoin(condition=[true], joinType=[inner])
LogicalAggregate(group=[{0, 1}], agg#0=[SUM($2)])
LogicalProject(SR_CUSTOMER_SK0=[$3], SR_STORE_SK0=[$7], SR_RETURN_AMT_INC_TAX0=[$13])
LogicalFilter(condition=[AND(=($0, $20), =($26, 2001))])
LogicalJoin(condition=[true], joinType=[inner])
LogicalTableScan(table=[[tpcds, STORE_RETURNS]])
LogicalTableScan(table=[[tpcds, DATE_DIM]])
LogicalTableScan(table=[[tpcds, STORE]])
LogicalTableScan(table=[[tpcds, CUSTOMER]])
This is hitting some bug in Calcite since the queries turn out to have a LogicalFilter directly on top of a LogicalAggregate where the filter now contains a correlation variable and Calcite has special handling of a LogicalFilter on top of a LogicalAggregate which treats the filter as a HAVING clause when converting Rel to SQL. For some reason unknown to me right now the org.apache.calcite.rel.rel2sql.SqlImplementor$Context used here is not a org.apache.calcite.rel.rel2sql.SqlImplementor$AliasContext but a context which has no implementation of the getAliasContext() method. I was planning do dig into the Calcite fix separately.
Caused by: java.lang.UnsupportedOperationException
at org.apache.calcite.rel.rel2sql.SqlImplementor$Context.getAliasContext(SqlImplementor.java:1074)
at org.apache.calcite.rel.rel2sql.SqlImplementor$Context.toSql(SqlImplementor.java:682)
at org.apache.calcite.rel.rel2sql.SqlImplementor$Context.toSql(SqlImplementor.java:1196)
at org.apache.calcite.rel.rel2sql.SqlImplementor$Context.callToSql(SqlImplementor.java:880)
at org.apache.calcite.rel.rel2sql.SqlImplementor$Context.toSql(SqlImplementor.java:853)
at org.apache.calcite.rel.rel2sql.RelToSqlConverter.visit(RelToSqlConverter.java:477)
...
I did check whether the Calcite RelNode structure produced by SqlToRelNodeConverter can be translated back to SQL and it does. It only breaks when we go through Substrait and the reason seems to be the missing intermediate field names are leading to missing intermediate LogicalProject rels compared to the Calcite explain before the Substrait conversion:
LogicalProject(C_CUSTOMER_ID=[$0])
LogicalSort(sort0=[$0], dir0=[ASC], fetch=[100])
LogicalProject(C_CUSTOMER_ID=[$33])
LogicalFilter(condition=[AND(>(CAST($2):DECIMAL(9, 3), $SCALAR_QUERY({
LogicalProject(EXPR$0=[*($0, 1.2:DECIMAL(2, 1))])
LogicalAggregate(group=[{}], agg#0=[AVG($0)])
LogicalProject(CTR_TOTAL_RETURN=[$2])
LogicalFilter(condition=[=($cor0.CTR_STORE_SK, $1)]) <-- HERE is our LogicalFilter
LogicalProject(CTR_CUSTOMER_SK=[$0], CTR_STORE_SK=[$1], CTR_TOTAL_RETURN=[$2])
LogicalAggregate(group=[{0, 1}], CTR_TOTAL_RETURN=[SUM($2)]) <-- HERE is the LogicalAggregate
LogicalProject(SR_CUSTOMER_SK=[$3], SR_STORE_SK=[$7], SR_RETURN_AMT_INC_TAX=[$13])
LogicalFilter(condition=[AND(=($0, $20), =($26, 2001))])
LogicalJoin(condition=[true], joinType=[inner])
LogicalTableScan(table=[[tpcds, STORE_RETURNS]])
LogicalTableScan(table=[[tpcds, DATE_DIM]])
})), =($3, $1), =(CAST($27):VARCHAR(30) NOT NULL, 'distmember(fips_county, 61, 3)'), =($0, $32))], variablesSet=[[$cor0]])
LogicalJoin(condition=[true], joinType=[inner])
LogicalJoin(condition=[true], joinType=[inner])
LogicalProject(CTR_CUSTOMER_SK=[$0], CTR_STORE_SK=[$1], CTR_TOTAL_RETURN=[$2])
LogicalAggregate(group=[{0, 1}], CTR_TOTAL_RETURN=[SUM($2)])
LogicalProject(SR_CUSTOMER_SK=[$3], SR_STORE_SK=[$7], SR_RETURN_AMT_INC_TAX=[$13])
LogicalFilter(condition=[AND(=($0, $20), =($26, 2001))])
LogicalJoin(condition=[true], joinType=[inner])
LogicalTableScan(table=[[tpcds, STORE_RETURNS]])
LogicalTableScan(table=[[tpcds, DATE_DIM]])
LogicalTableScan(table=[[tpcds, STORE]])
LogicalTableScan(table=[[tpcds, CUSTOMER]])
For TPC-H query 17 which works now the Calcite explain coming from Substrait looks like this. It has a different pattern not hitting the Calcite bug:
LogicalProject(AVG_YEARLY=[$0])
LogicalProject($f1=[/($0, 7.0:DECIMAL(2, 1))])
LogicalAggregate(group=[{}], agg#0=[SUM($0)])
LogicalProject(L_EXTENDEDPRICE0=[$5])
LogicalFilter(condition=[AND(=($16, $1), =(CAST($19):VARCHAR(10), 'Brand#13'), =(CAST($22):VARCHAR(10), 'JUMBO CAN'), <($4, CAST($SCALAR_QUERY({
LogicalProject($f1=[*(0.2:DECIMAL(2, 1), $0)])
LogicalAggregate(group=[{}], agg#0=[AVG($0)]) <-- LogicalAggregate indirectly on top of
LogicalProject(L_QUANTITY0=[$4])
LogicalFilter(condition=[=($1, $cor0.P_PARTKEY)]) <-- LogicalFilter with correl var
LogicalTableScan(table=[[LINEITEM]])
})):DECIMAL(38, 0)))], variablesSet=[[$cor0]])
LogicalJoin(condition=[true], joinType=[inner])
LogicalTableScan(table=[[LINEITEM]])
LogicalTableScan(table=[[PART]])
There was a problem hiding this comment.
the Calcite fix has been merged meanwhile so we should get the TPC-DS queries working again with the next Calcite release
There was a problem hiding this comment.
but as direct field references and they just happened to not throw an exception during conversion.
Ah. Classic. Thanks for digging into this 🙇
| RelNode node = relBuilder.push(input).filter(filterCondition).build(); | ||
| RelNode node = | ||
| relBuilder.push(input).filter(context.popCorrelationIds(), filterCondition).build(); | ||
| context.popParentRelNodes(); |
There was a problem hiding this comment.
This might be clearer if we called pushParentRelNodes something like pushExpressionBase* and added some docs to the method to indicate that this is the Calcite relation on which the expressions we are converting are built**, and to which subquery references are resolved.
The converted input is a parent of relations in any subqueries we might encounter when converting expressions. But that's the uncommon case, and more commonly I think people will see this and think of it as the parent of expression, which is what I did at first and found somewhat confusing. The name also made me think that we had to push it parent in every relation.
This also comes up when looking at the join which has context.pushParentRelNodes(left, right) which was weird in a "how can these expression have 2 parents" kind of way, and when we call getParentRelation and get multiple parents it's not immediately clear that these are 2 parents at the same level vs a chain of parents.
* I don't know if this is actually a better name, I just find the use of "Parent" here confusing.
** The work you're doing here will actually provide a mechanism to verify the agreement of Calcite and Substrait type inference because we can extract the Calcite schema from the context #379.
There was a problem hiding this comment.
I changed the code by moving the RangeMap creation into the Context class. We don't need the full outer input relations but only the row types so I changed the method name to pushOuterRowType which is more accurate. I also added javadoc comments for all the methods in the Context class which should make it a bit easier to understand what's going on.
vbarua
left a comment
There was a problem hiding this comment.
Now that we've merged the precursor PR, this is much easier to look at. Your comments and tests help a lot as well.
We've only have tests that go from Substrait -> Calcite. Does this work going the other way? Not something that I think needs to happen as part of this PR, but possibly something to document as an issue.
To your points about
Subqueries are an expression element so they only appear within relations that are configured using expressions e.g. FilterRel, ProjectRel, JoinRel.
Calcite only supports providing a variablesSet for Project, Filter and Join if you search for variablesSet in the Calcite RelBuilder.
SortRels also contain expressions which could contain subqueries. Would the conversion just blow up now if we encountered that?
AggregateRels also contains expressions, but a because of the https://github.com/substrait-io/substrait-java/blob/main/isthmus/src/main/java/io/substrait/isthmus/PreCalciteAggregateValidator.java that may Just Work (TM) because they would get moved into a Project before the Aggregate.
I don't expect subqueries in those relations to be that common, but it might be worth a quick check to see what happens and filing a follow-up issue.
we have test coverage since we round trip for the TPC-H / TPC-DS queries but we can also add unit tests from Calcite to Substrait |
Good question. My goal was to get to the bottom of why TPC-H query 17 didn't convert back to Calcite in order to wrap up converting all 22 TPC-H queries successfully. Turns out I opened a huge can of worms which I would like to address step by step. In my opinion this PR is a step in the right direction but definitely not a fix-it-all. |
4b20605 to
4707efc
Compare
vbarua
left a comment
There was a problem hiding this comment.
Once thing I would like changed before merging is the name of popParentRelNodes, which should probably be popOuterRowType. But other than that, these changes look good. As you point out
this PR is a step in the right direction but definitely not a fix-it-all.
and I'm happy to have this merged at this point and iterate upon it in future PRs.
| * <p>Row types are stored as a {@link RangeMap} with field indices as keys and the {@link | ||
| * RelDataType} row type containing the field at the field index by continuously numbering the | ||
| * field indices from 0 across all provided row types in the order the row types are passed as | ||
| * arguments. |
There was a problem hiding this comment.
The RangeMap is super nice for this ✨
| * | ||
| * @param inputs the row types to add | ||
| */ | ||
| public void pushOuterRowType(final RelDataType... inputs) { |
There was a problem hiding this comment.
pushOuterRowType is a good name for this. We should update the popParentRelNodes to match.
| public class TpcdsQueryTest extends PlanTestBase { | ||
| private static final Set<Integer> toSubstraitExclusions = Set.of(9, 27, 36, 70, 86); | ||
| private static final Set<Integer> fromSubstraitExclusions = Set.of(6, 8, 67); | ||
| private static final Set<Integer> fromSubstraitExclusions = Set.of(1, 8, 30, 67, 81); |
There was a problem hiding this comment.
but as direct field references and they just happened to not throw an exception during conversion.
Ah. Classic. Thanks for digging into this 🙇
fixes substrait-io#382 Signed-off-by: Niels Pardon <par@zurich.ibm.com>
Signed-off-by: Niels Pardon <par@zurich.ibm.com>
Signed-off-by: Niels Pardon <par@zurich.ibm.com>
4707efc to
e1f5573
Compare
Signed-off-by: Niels Pardon <par@zurich.ibm.com>

changes
SubstraitRelNodeConverterandExpressionRexConverterto support handling subqueries with outer field referencesfixes #382
replaces #383