Skip to content

fix(isthmus): handle Subqueries/set predicates with field references outside of the subquery#383

Closed
mbwhite wants to merge 2 commits into
substrait-io:mainfrom
mbwhite:tpc-h-fixes
Closed

fix(isthmus): handle Subqueries/set predicates with field references outside of the subquery#383
mbwhite wants to merge 2 commits into
substrait-io:mainfrom
mbwhite:tpc-h-fixes

Conversation

@mbwhite
Copy link
Copy Markdown
Contributor

@mbwhite mbwhite commented Apr 10, 2025

This is resolving issue #382 found with TPC-H 17, when converting back to SQL from Substrait
The subquery references fields in an outer scope, the calcite correlation variables where referenced by Subsrtrait's 'outer reference'

However when converting back to calcite from a plan with these 'outer references' they were ignored.

Notes::

  • I've added two new TPC-x tests, these are similar the existing tests, but do conversions SQL to Substrait and Substrait to SQL
  • Additional SQL using subqueries has been added; would like to add some more. Suggestions welcome.
  • Added these as 1xx.sql files but could move to a separate folder?

@mbwhite mbwhite changed the title Handle Subqueries/set predicates with field references outside of the subquery fix(isthmus): handle Subqueries/set predicates with field references outside of the subquery Apr 10, 2025
@mbwhite mbwhite marked this pull request as ready for review April 10, 2025 11:57
Comment thread isthmus/src/test/resources/tpch/queries/101.sql Outdated
Copy link
Copy Markdown
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some questions

Comment thread isthmus/src/main/java/io/substrait/isthmus/SubstraitToSql.java Outdated
Comment thread isthmus/src/test/java/io/substrait/isthmus/TestTpcdsQuery.java
Comment thread isthmus/src/main/java/io/substrait/isthmus/SubstraitRelNodeConverter.java Outdated
@mbwhite mbwhite force-pushed the tpc-h-fixes branch 4 times, most recently from eaa231c to 074c9ef Compare April 11, 2025 09:47
@mbwhite
Copy link
Copy Markdown
Contributor Author

mbwhite commented Apr 11, 2025

@vbarua questions have been answered; I've updated the test cases to better separate concerns.

found the hidden setting in vscode to change the formatted back to spotless!

Comment thread isthmus/src/test/java/io/substrait/isthmus/TestTpcdsQuery.java
io.substrait.plan.Plan plan = new ProtoPlanConverter().from(possible);
SubstraitToCalcite substraitToCalcite = new SubstraitToCalcite(extensions, typeFactory);
RelNode relRoot = substraitToCalcite.convert(plan.getRoots().get(0)).project(true);
System.out.println(SubstraitToSql.toSql(relRoot));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: avoid printing during test. I suggest assertNotNull instead

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also changed the other existing tests where this occurred.

typeConverter.toCalcite(typeFactory, expr.getType()), expr.input().accept(this), safeCast);
}

AtomicInteger correlIdCount = new AtomicInteger(0);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for? It's never used.

if (outerref.isPresent()) {
if (segment instanceof FieldReference.StructField) {
FieldReference.StructField f = (FieldReference.StructField) segment;
var node = referenceRelList.get(outerref.get() - 1).get();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to track or handle the fact the there might be multiple Filters that can add correlation variables? We only ever add to this list.

How would we know which ones came from which input?

This is one the issue I had in mind when I mentioned
https://github.com/substrait-io/substrait-java/pull/383/files#r2038429378

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly the Calcite docs don't really help with the scope of these variables. There is some concept of 'namespace' eg from this error .

All correlation variables should resolve to the same namespace. Prev ns=org.apache.calcite.sql.validate.IdentifierNamespace@d36c1c3, new ns=org.apache.calcite.sql.validate.IdentifierNamespace@96abc7

which came from

select
    c1.c_name,
    o1.o_orderstatus,
    o1.o_totalprice
from
    customer c1,
    orders o1
where
    o1.o_custkey = c1.c_custkey
    and o1.o_totalprice > (
        select
            avg(o_totalprice)
        from
            orders o2, customer c2
        where
            o2.o_totalprice < c1.c_acctbal
            and o2.o_totalprice > (
                select
                    avg(c3.c_acctbal)
                from
                    customer c3
                where
                    c3.c_custkey = o2.o_custkey
                    and c3.c_address = o1.o_clerk
            )
    );

change the last line to c3.c_address = o2.o_clerk and it's ok..

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implications is that each relation and it's immediate subexpression is the same namespace.

mbwhite added 2 commits April 11, 2025 18:18
Signed-off-by: MBWhite <whitemat@uk.ibm.com>
Signed-off-by: MBWhite <whitemat@uk.ibm.com>
Copy link
Copy Markdown
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a couple of things I'm not sure about in this PR.

When processing the Filter

@Override
  public RelNode visit(Filter filter) throws RuntimeException {
    RelNode input = filter.getInput().accept(this);
    final Holder<RexCorrelVariable> v = Holder.empty();
    expressionRexConverter.addCorrelVariable(v);

    RelBuilder r1 = relBuilder.push(input).variable(v::set);
    RexNode filterCondition = filter.getCondition().accept(expressionRexConverter);
    RelNode node = r1.filter(List.of(v.get().id), filterCondition).build();
    return applyRemap(node, filter.getRemap());
  }

How does this work if there is more than 1 FieldReference to a variable outside the subquery? It's perfectly valid to construct a Substrait Filter like OUTER_REF_1.foo = 'a' AND OUTER_REF_2.bar = bar in Substrait, and I believe you can do something similar in SQL if you reference multiple fields from outside the subquery boundary.

Additionally, when constructing a Substrait subquery you could have multiple Filters. Your resolution logic uses

var node = referenceRelList.get(outerref.get() - 1).get();

but we only ever append to referenceRelList. If multiple Filters have outer references with outer step 1, the first one processed would point to referenceRelList[0] which would be correct, but the second Filter processed would also point to referenceRelList[0], which would be wrong.

What happens if you have multiple subquery boundaries and you have an outer reference more than 1 step out? For example if we have 2 subqueries and we only add 1 correlation variable to referenceRelList, then you would try to fetch referenceRelList[1], which would throw a NullPointerException.

I'm not confident that these changes are able to handle outer references more generally beyond the specific 01.sql test you've added.

What I would like to see is more testing, both from the SQL side but also from the Substrait side directly.

import org.junit.jupiter.params.provider.ValueSource;

/**
* Updated TPC-H test to convert SQL to Substrait and replay those plans back to SQL Validating that
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you mean TPC-D here?

*/
@TestMethodOrder(OrderAnnotation.class)
@TestInstance(Lifecycle.PER_CLASS)
public class TestTpcdsQuery extends PlanTestBase {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add these tests separately from your PR if possible, as it would be reasonable to have as many of these that already work running while we figure out the outer reference issues.

*/
@TestMethodOrder(OrderAnnotation.class)
@TestInstance(Lifecycle.PER_CLASS)
public class TestTpchQuery extends PlanTestBase {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add these tests separately from your PR if possible, as it would be reasonable to have as many of these that already work running while we figure out the outer reference issues.

@mbwhite
Copy link
Copy Markdown
Contributor Author

mbwhite commented Apr 22, 2025

Hi @vbarua - thanks for the additional thoughts; would share the concern that there might be cases such that this change isn't general enough.

Though I do admit I've found it hard to create SQL queries that use these references!

based on the last comments - is it worth doing that separate PR updating the testing and come back to this fix?

@vbarua
Copy link
Copy Markdown
Member

vbarua commented Apr 26, 2025

Amusingly, I was able to craft an example

select
    c1.c_name,
    o1.o_orderstatus,
    o1.o_totalprice
from
    CUSTOMER c1,
    ORDERS o1
where
    o1.o_custkey = c1.c_custkey
    and o1.o_totalprice > (select avg(o_totalprice) from ORDERS o2 where o2.o_totalprice < c1.c_acctbal)
    and o1.o_totalprice < (select max(c_acctbal) from CUSTOMER c2 where c2.c_custkey = o1.o_custkey);

that is parseable and plannable in both MySql and Postgres (dbfiddle) but fails in Calcite with

All correlation variables should resolve to the same namespace.

which is a new one for me.

I've found it hard to create SQL queries that use these references!

It's definitely tricky, and it's probably easier to write tests directly as Substrait plans to check various things. Strictly speaking as well, the API you're modifying here is the Substrait -> Calcite conversion. The TPCH query that you were fixing just happens to generate a plan that uses the Substrait feature we're trying to test.

is it worth doing that separate PR updating the testing and come back to this fix?

If you want to get the tests in that's what I recommend. I think it makes sense generally to be able to run those tests and IMO that's a much more straightforward change generally.

@bestbeforetoday
Copy link
Copy Markdown
Member

Closing since this PR has been superseded by #426.

@mbwhite mbwhite deleted the tpc-h-fixes branch December 16, 2025 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants