Skip to content

fix(isthmus): handle subqueries with outer field references#426

Merged
nielspardon merged 4 commits into
substrait-io:mainfrom
nielspardon:par-tpch-17
Aug 1, 2025
Merged

fix(isthmus): handle subqueries with outer field references#426
nielspardon merged 4 commits into
substrait-io:mainfrom
nielspardon:par-tpch-17

Conversation

@nielspardon
Copy link
Copy Markdown
Member

@nielspardon nielspardon commented Jun 27, 2025

changes SubstraitRelNodeConverter and ExpressionRexConverter to support handling subqueries with outer field references

fixes #382

replaces #383

@vbarua
Copy link
Copy Markdown
Member

vbarua commented Jun 27, 2025

For the sake of reviewability could I ask you split this into 3-ish PRs. One for your initial cleanups/improvements, one for just the visitor changes, and one for subquery changes.

@nielspardon
Copy link
Copy Markdown
Member Author

For the sake of reviewability could I ask you split this into 3-ish PRs. One for your initial cleanups/improvements, one for just the visitor changes, and one for subquery changes.

Sure, I already thought that this might be necessary.

@nielspardon
Copy link
Copy Markdown
Member Author

First extracted PR for the visitor changes: #427

@nielspardon nielspardon force-pushed the par-tpch-17 branch 5 times, most recently from 7b6ef99 to 764206a Compare June 27, 2025 19:24
@nielspardon
Copy link
Copy Markdown
Member Author

I removed the cleanup from this PR and rebased it on #427

@nielspardon nielspardon force-pushed the par-tpch-17 branch 2 times, most recently from a2f0c8e to 86545b6 Compare July 1, 2025 17:33
@nielspardon
Copy link
Copy Markdown
Member Author

updated to use the new VisitationContext interface from #427

Copy link
Copy Markdown
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a very cursory first pass.

Something that I think would be helpful would be some tests based directly on Substrait plans, and not just the TPC based tests. This would help contributors, myself included, understand what exactly the structure of these subqueries looks like in Substrait, and how your processing is being applied on top of it.

Copy link
Copy Markdown
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a very cursory first pass.

Something that I think would be helpful would be some tests based directly on Substrait plans, and not just the TPC based tests. This would help contributors, myself included, understand what exactly the structure of these subqueries looks like in Substrait, and how your processing is being applied on top of it.

@nielspardon
Copy link
Copy Markdown
Member Author

nielspardon commented Jul 7, 2025

Something that I think would be helpful would be some tests based directly on Substrait plans, and not just the TPC based tests. This would help contributors, myself included, understand what exactly the structure of these subqueries looks like in Substrait, and how your processing is being applied on top of it.

I drew this diagram for myself to reason about how outer field references work in Substrait vs. Calcite:
image

  • Subqueries are an expression element so they only appear within relations that are configured using expressions e.g. FilterRel, ProjectRel, JoinRel.
  • Substrait addresses field references to the parent relation operations by a number of subquery boundaries one has to cross to find the parent relation of the referenced field which means we need to keep track of the subquery depth.
  • Calcite uses correlation variables e.g. $cor0 to address a field from a parent relation e.g. $cor0.P_PARTKEY. The correlation variable needs to be added to the variablesSet of the parent relation the field is coming from variablesSet=[$cor0]. Additionally when creating the correlation variable we need to get the row type of the parent relation providing the field. When creating the field access in the subquery we need to have access to input relation row type of the parent relation and we need to ensure we add all the correlation variables pointing to this parent relation to the variablesSet of this parent relation.
  • We do not need to track all parent relations but we just need to track the parent relations which support expressions. Like in the example above we need to know the 3 FilterRels and don't really care about the other relations in the tree.
  • Calcite only supports providing a variablesSet for Project, Filter and Join if you search for variablesSet in the Calcite RelBuilder.

I hope this helps.

@nielspardon
Copy link
Copy Markdown
Member Author

Something that I think would be helpful would be some tests based directly on Substrait plans, and not just the TPC based tests.

will spend some time on tests towards end of the week when I have some spare time

@nielspardon nielspardon force-pushed the par-tpch-17 branch 2 times, most recently from f574de2 to 1e09ea6 Compare July 11, 2025 06:25
@nielspardon nielspardon force-pushed the par-tpch-17 branch 8 times, most recently from 8d8efb8 to 8b771b9 Compare July 11, 2025 21:40
Copy link
Copy Markdown
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we've merged the precursor PR, this is much easier to look at. Your comments and tests help a lot as well.

We've only have tests that go from Substrait -> Calcite. Does this work going the other way? Not something that I think needs to happen as part of this PR, but possibly something to document as an issue.

To your points about

Subqueries are an expression element so they only appear within relations that are configured using expressions e.g. FilterRel, ProjectRel, JoinRel.

Calcite only supports providing a variablesSet for Project, Filter and Join if you search for variablesSet in the Calcite RelBuilder.

SortRels also contain expressions which could contain subqueries. Would the conversion just blow up now if we encountered that?

AggregateRels also contains expressions, but a because of the https://github.com/substrait-io/substrait-java/blob/main/isthmus/src/main/java/io/substrait/isthmus/PreCalciteAggregateValidator.java that may Just Work (TM) because they would get moved into a Project before the Aggregate.

I don't expect subqueries in those relations to be that common, but it might be worth a quick check to see what happens and filing a follow-up issue.

Comment thread core/src/main/java/io/substrait/dsl/SubstraitBuilder.java Outdated
Comment thread isthmus/src/test/java/io/substrait/isthmus/expression/SubqueryConversionTest.java Outdated
* FROM orders
*/
final Rel root =
substraitBuilder.project(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: I often alias substraitBuilder to b in order to minimize it's presence when building plans like this and making them easier to read.

public class TpcdsQueryTest extends PlanTestBase {
private static final Set<Integer> toSubstraitExclusions = Set.of(9, 27, 36, 70, 86);
private static final Set<Integer> fromSubstraitExclusions = Set.of(6, 8, 67);
private static final Set<Integer> fromSubstraitExclusions = Set.of(1, 8, 30, 67, 81);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did 1, 30 and 81 work before, and now they don't?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They did convert without throwing an exception because they were converted incorrectly. Those queries also have subqueries with outer field references which previously were not converted to outer field references (correlation variables) but as direct field references and they just happened to not throw an exception during conversion.

If you look at e.g. TPC-DS query 1:

with customer_total_return as
    (select 
        sr_customer_sk as ctr_customer_sk
        ,sr_store_sk as ctr_store_sk
        ,sum(SR_RETURN_AMT_INC_TAX) as ctr_total_return
        from store_returns
        ,date_dim
    where sr_returned_date_sk = d_date_sk
        and d_year =2001
    group by sr_customer_sk
        ,sr_store_sk)
        
select  c_customer_id
from customer_total_return ctr1
,store
,customer
where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2     <--- has an aggregate
      from customer_total_return ctr2
      where ctr1.ctr_store_sk = ctr2.ctr_store_sk)     <--- filter has an outer reference to ctr1
and s_store_sk = ctr1.ctr_store_sk
and s_state = 'distmember(fips_county, 61, 3)'
and ctr1.ctr_customer_sk = c_customer_sk
order by c_customer_id
LIMIT 100

Calcite explain after going from Substrait back to Calcite:

LogicalProject(C_CUSTOMER_ID=[$0])
  LogicalSort(sort0=[$0], dir0=[ASC], fetch=[100])
    LogicalProject(C_CUSTOMER_ID0=[$33])
      LogicalFilter(condition=[AND(>(CAST($2):DECIMAL(9, 3), $SCALAR_QUERY({
LogicalProject($f1=[*($0, 1.2:DECIMAL(2, 1))])
  LogicalAggregate(group=[{}], agg#0=[AVG($0)])
    LogicalProject($f20=[$2])
      LogicalFilter(condition=[=($cor0.SR_STORE_SK0, $1)])         <-- LogicalFilter with correl var
        LogicalAggregate(group=[{0, 1}], agg#0=[SUM($2)])           <-- directly on top of LogicalAggregate
          LogicalProject(SR_CUSTOMER_SK0=[$3], SR_STORE_SK0=[$7], SR_RETURN_AMT_INC_TAX0=[$13])
            LogicalFilter(condition=[AND(=($0, $20), =($26, 2001))])
              LogicalJoin(condition=[true], joinType=[inner])
                LogicalTableScan(table=[[tpcds, STORE_RETURNS]])
                LogicalTableScan(table=[[tpcds, DATE_DIM]])
})), =($3, $1), =(CAST($27):VARCHAR(30) NOT NULL, 'distmember(fips_county, 61, 3)'), =($0, $32))], variablesSet=[[$cor0]])
        LogicalJoin(condition=[true], joinType=[inner])
          LogicalJoin(condition=[true], joinType=[inner])
            LogicalAggregate(group=[{0, 1}], agg#0=[SUM($2)])
              LogicalProject(SR_CUSTOMER_SK0=[$3], SR_STORE_SK0=[$7], SR_RETURN_AMT_INC_TAX0=[$13])
                LogicalFilter(condition=[AND(=($0, $20), =($26, 2001))])
                  LogicalJoin(condition=[true], joinType=[inner])
                    LogicalTableScan(table=[[tpcds, STORE_RETURNS]])
                    LogicalTableScan(table=[[tpcds, DATE_DIM]])
            LogicalTableScan(table=[[tpcds, STORE]])
          LogicalTableScan(table=[[tpcds, CUSTOMER]])

This is hitting some bug in Calcite since the queries turn out to have a LogicalFilter directly on top of a LogicalAggregate where the filter now contains a correlation variable and Calcite has special handling of a LogicalFilter on top of a LogicalAggregate which treats the filter as a HAVING clause when converting Rel to SQL. For some reason unknown to me right now the org.apache.calcite.rel.rel2sql.SqlImplementor$Context used here is not a org.apache.calcite.rel.rel2sql.SqlImplementor$AliasContext but a context which has no implementation of the getAliasContext() method. I was planning do dig into the Calcite fix separately.

Caused by: java.lang.UnsupportedOperationException
	at org.apache.calcite.rel.rel2sql.SqlImplementor$Context.getAliasContext(SqlImplementor.java:1074)
	at org.apache.calcite.rel.rel2sql.SqlImplementor$Context.toSql(SqlImplementor.java:682)
	at org.apache.calcite.rel.rel2sql.SqlImplementor$Context.toSql(SqlImplementor.java:1196)
	at org.apache.calcite.rel.rel2sql.SqlImplementor$Context.callToSql(SqlImplementor.java:880)
	at org.apache.calcite.rel.rel2sql.SqlImplementor$Context.toSql(SqlImplementor.java:853)
	at org.apache.calcite.rel.rel2sql.RelToSqlConverter.visit(RelToSqlConverter.java:477)
	...

https://github.com/apache/calcite/blob/3719447e125982aaa47f0a7b2564f94b9900539e/core/src/main/java/org/apache/calcite/rel/rel2sql/RelToSqlConverter.java#L465-L485

I did check whether the Calcite RelNode structure produced by SqlToRelNodeConverter can be translated back to SQL and it does. It only breaks when we go through Substrait and the reason seems to be the missing intermediate field names are leading to missing intermediate LogicalProject rels compared to the Calcite explain before the Substrait conversion:

LogicalProject(C_CUSTOMER_ID=[$0])
  LogicalSort(sort0=[$0], dir0=[ASC], fetch=[100])
    LogicalProject(C_CUSTOMER_ID=[$33])
      LogicalFilter(condition=[AND(>(CAST($2):DECIMAL(9, 3), $SCALAR_QUERY({
LogicalProject(EXPR$0=[*($0, 1.2:DECIMAL(2, 1))])
  LogicalAggregate(group=[{}], agg#0=[AVG($0)])
    LogicalProject(CTR_TOTAL_RETURN=[$2])
      LogicalFilter(condition=[=($cor0.CTR_STORE_SK, $1)])                 <-- HERE is our LogicalFilter
        LogicalProject(CTR_CUSTOMER_SK=[$0], CTR_STORE_SK=[$1], CTR_TOTAL_RETURN=[$2])
          LogicalAggregate(group=[{0, 1}], CTR_TOTAL_RETURN=[SUM($2)])         <-- HERE is the LogicalAggregate
            LogicalProject(SR_CUSTOMER_SK=[$3], SR_STORE_SK=[$7], SR_RETURN_AMT_INC_TAX=[$13])
              LogicalFilter(condition=[AND(=($0, $20), =($26, 2001))])
                LogicalJoin(condition=[true], joinType=[inner])
                  LogicalTableScan(table=[[tpcds, STORE_RETURNS]])
                  LogicalTableScan(table=[[tpcds, DATE_DIM]])
})), =($3, $1), =(CAST($27):VARCHAR(30) NOT NULL, 'distmember(fips_county, 61, 3)'), =($0, $32))], variablesSet=[[$cor0]])
        LogicalJoin(condition=[true], joinType=[inner])
          LogicalJoin(condition=[true], joinType=[inner])
            LogicalProject(CTR_CUSTOMER_SK=[$0], CTR_STORE_SK=[$1], CTR_TOTAL_RETURN=[$2])
              LogicalAggregate(group=[{0, 1}], CTR_TOTAL_RETURN=[SUM($2)])
                LogicalProject(SR_CUSTOMER_SK=[$3], SR_STORE_SK=[$7], SR_RETURN_AMT_INC_TAX=[$13])
                  LogicalFilter(condition=[AND(=($0, $20), =($26, 2001))])
                    LogicalJoin(condition=[true], joinType=[inner])
                      LogicalTableScan(table=[[tpcds, STORE_RETURNS]])
                      LogicalTableScan(table=[[tpcds, DATE_DIM]])
            LogicalTableScan(table=[[tpcds, STORE]])
          LogicalTableScan(table=[[tpcds, CUSTOMER]])

For TPC-H query 17 which works now the Calcite explain coming from Substrait looks like this. It has a different pattern not hitting the Calcite bug:

LogicalProject(AVG_YEARLY=[$0])
  LogicalProject($f1=[/($0, 7.0:DECIMAL(2, 1))])
    LogicalAggregate(group=[{}], agg#0=[SUM($0)])
      LogicalProject(L_EXTENDEDPRICE0=[$5])
        LogicalFilter(condition=[AND(=($16, $1), =(CAST($19):VARCHAR(10), 'Brand#13'), =(CAST($22):VARCHAR(10), 'JUMBO CAN'), <($4, CAST($SCALAR_QUERY({
LogicalProject($f1=[*(0.2:DECIMAL(2, 1), $0)])
  LogicalAggregate(group=[{}], agg#0=[AVG($0)])              <-- LogicalAggregate indirectly on top of
    LogicalProject(L_QUANTITY0=[$4])
      LogicalFilter(condition=[=($1, $cor0.P_PARTKEY)])       <-- LogicalFilter with correl var
        LogicalTableScan(table=[[LINEITEM]])
})):DECIMAL(38, 0)))], variablesSet=[[$cor0]])
          LogicalJoin(condition=[true], joinType=[inner])
            LogicalTableScan(table=[[LINEITEM]])
            LogicalTableScan(table=[[PART]])

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the issue: apache/calcite#4481

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the Calcite fix has been merged meanwhile so we should get the TPC-DS queries working again with the next Calcite release

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but as direct field references and they just happened to not throw an exception during conversion.

Ah. Classic. Thanks for digging into this 🙇

RelNode node = relBuilder.push(input).filter(filterCondition).build();
RelNode node =
relBuilder.push(input).filter(context.popCorrelationIds(), filterCondition).build();
context.popParentRelNodes();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be clearer if we called pushParentRelNodes something like pushExpressionBase* and added some docs to the method to indicate that this is the Calcite relation on which the expressions we are converting are built**, and to which subquery references are resolved.

The converted input is a parent of relations in any subqueries we might encounter when converting expressions. But that's the uncommon case, and more commonly I think people will see this and think of it as the parent of expression, which is what I did at first and found somewhat confusing. The name also made me think that we had to push it parent in every relation.

This also comes up when looking at the join which has context.pushParentRelNodes(left, right) which was weird in a "how can these expression have 2 parents" kind of way, and when we call getParentRelation and get multiple parents it's not immediately clear that these are 2 parents at the same level vs a chain of parents.

* I don't know if this is actually a better name, I just find the use of "Parent" here confusing.

** The work you're doing here will actually provide a mechanism to verify the agreement of Calcite and Substrait type inference because we can extract the Calcite schema from the context #379.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the code by moving the RangeMap creation into the Context class. We don't need the full outer input relations but only the row types so I changed the method name to pushOuterRowType which is more accurate. I also added javadoc comments for all the methods in the Context class which should make it a bit easier to understand what's going on.

Copy link
Copy Markdown
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we've merged the precursor PR, this is much easier to look at. Your comments and tests help a lot as well.

We've only have tests that go from Substrait -> Calcite. Does this work going the other way? Not something that I think needs to happen as part of this PR, but possibly something to document as an issue.

To your points about

Subqueries are an expression element so they only appear within relations that are configured using expressions e.g. FilterRel, ProjectRel, JoinRel.

Calcite only supports providing a variablesSet for Project, Filter and Join if you search for variablesSet in the Calcite RelBuilder.

SortRels also contain expressions which could contain subqueries. Would the conversion just blow up now if we encountered that?

AggregateRels also contains expressions, but a because of the https://github.com/substrait-io/substrait-java/blob/main/isthmus/src/main/java/io/substrait/isthmus/PreCalciteAggregateValidator.java that may Just Work (TM) because they would get moved into a Project before the Aggregate.

I don't expect subqueries in those relations to be that common, but it might be worth a quick check to see what happens and filing a follow-up issue.

@nielspardon
Copy link
Copy Markdown
Member Author

We've only have tests that go from Substrait -> Calcite. Does this work going the other way? Not something that I think needs to happen as part of this PR, but possibly something to document as an issue.

we have test coverage since we round trip for the TPC-H / TPC-DS queries but we can also add unit tests from Calcite to Substrait

@nielspardon
Copy link
Copy Markdown
Member Author

SortRels also contain expressions which could contain subqueries. Would the conversion just blow up now if we encountered that?

Good question. My goal was to get to the bottom of why TPC-H query 17 didn't convert back to Calcite in order to wrap up converting all 22 TPC-H queries successfully. Turns out I opened a huge can of worms which I would like to address step by step. In my opinion this PR is a step in the right direction but definitely not a fix-it-all.

@nielspardon nielspardon force-pushed the par-tpch-17 branch 2 times, most recently from 4b20605 to 4707efc Compare July 28, 2025 08:37
@nielspardon nielspardon requested a review from vbarua July 28, 2025 09:04
Copy link
Copy Markdown
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once thing I would like changed before merging is the name of popParentRelNodes, which should probably be popOuterRowType. But other than that, these changes look good. As you point out

this PR is a step in the right direction but definitely not a fix-it-all.

and I'm happy to have this merged at this point and iterate upon it in future PRs.

* <p>Row types are stored as a {@link RangeMap} with field indices as keys and the {@link
* RelDataType} row type containing the field at the field index by continuously numbering the
* field indices from 0 across all provided row types in the order the row types are passed as
* arguments.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RangeMap is super nice for this ✨

*
* @param inputs the row types to add
*/
public void pushOuterRowType(final RelDataType... inputs) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushOuterRowType is a good name for this. We should update the popParentRelNodes to match.

public class TpcdsQueryTest extends PlanTestBase {
private static final Set<Integer> toSubstraitExclusions = Set.of(9, 27, 36, 70, 86);
private static final Set<Integer> fromSubstraitExclusions = Set.of(6, 8, 67);
private static final Set<Integer> fromSubstraitExclusions = Set.of(1, 8, 30, 67, 81);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but as direct field references and they just happened to not throw an exception during conversion.

Ah. Classic. Thanks for digging into this 🙇

fixes substrait-io#382

Signed-off-by: Niels Pardon <par@zurich.ibm.com>
Signed-off-by: Niels Pardon <par@zurich.ibm.com>
Signed-off-by: Niels Pardon <par@zurich.ibm.com>
Signed-off-by: Niels Pardon <par@zurich.ibm.com>
@nielspardon nielspardon merged commit 7cf1ccf into substrait-io:main Aug 1, 2025
12 checks passed
@nielspardon nielspardon deleted the par-tpch-17 branch August 5, 2025 06:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[isthmus] subqueries/set predicates with field references outside of the subquery fail

2 participants