fix(isthmus): handle Subqueries/set predicates with field references outside of the subquery#383
fix(isthmus): handle Subqueries/set predicates with field references outside of the subquery#383mbwhite wants to merge 2 commits into
Conversation
eaa231c to
074c9ef
Compare
|
@vbarua questions have been answered; I've updated the test cases to better separate concerns. found the hidden setting in vscode to change the formatted back to spotless! |
| io.substrait.plan.Plan plan = new ProtoPlanConverter().from(possible); | ||
| SubstraitToCalcite substraitToCalcite = new SubstraitToCalcite(extensions, typeFactory); | ||
| RelNode relRoot = substraitToCalcite.convert(plan.getRoots().get(0)).project(true); | ||
| System.out.println(SubstraitToSql.toSql(relRoot)); |
There was a problem hiding this comment.
minor: avoid printing during test. I suggest assertNotNull instead
There was a problem hiding this comment.
also changed the other existing tests where this occurred.
| typeConverter.toCalcite(typeFactory, expr.getType()), expr.input().accept(this), safeCast); | ||
| } | ||
|
|
||
| AtomicInteger correlIdCount = new AtomicInteger(0); |
There was a problem hiding this comment.
What is this for? It's never used.
| if (outerref.isPresent()) { | ||
| if (segment instanceof FieldReference.StructField) { | ||
| FieldReference.StructField f = (FieldReference.StructField) segment; | ||
| var node = referenceRelList.get(outerref.get() - 1).get(); |
There was a problem hiding this comment.
Do we need to track or handle the fact the there might be multiple Filters that can add correlation variables? We only ever add to this list.
How would we know which ones came from which input?
This is one the issue I had in mind when I mentioned
https://github.com/substrait-io/substrait-java/pull/383/files#r2038429378
There was a problem hiding this comment.
Honestly the Calcite docs don't really help with the scope of these variables. There is some concept of 'namespace' eg from this error .
All correlation variables should resolve to the same namespace. Prev ns=org.apache.calcite.sql.validate.IdentifierNamespace@d36c1c3, new ns=org.apache.calcite.sql.validate.IdentifierNamespace@96abc7
which came from
select
c1.c_name,
o1.o_orderstatus,
o1.o_totalprice
from
customer c1,
orders o1
where
o1.o_custkey = c1.c_custkey
and o1.o_totalprice > (
select
avg(o_totalprice)
from
orders o2, customer c2
where
o2.o_totalprice < c1.c_acctbal
and o2.o_totalprice > (
select
avg(c3.c_acctbal)
from
customer c3
where
c3.c_custkey = o2.o_custkey
and c3.c_address = o1.o_clerk
)
);
change the last line to c3.c_address = o2.o_clerk and it's ok..
There was a problem hiding this comment.
Implications is that each relation and it's immediate subexpression is the same namespace.
Signed-off-by: MBWhite <whitemat@uk.ibm.com>
Signed-off-by: MBWhite <whitemat@uk.ibm.com>
vbarua
left a comment
There was a problem hiding this comment.
There's a couple of things I'm not sure about in this PR.
When processing the Filter
@Override
public RelNode visit(Filter filter) throws RuntimeException {
RelNode input = filter.getInput().accept(this);
final Holder<RexCorrelVariable> v = Holder.empty();
expressionRexConverter.addCorrelVariable(v);
RelBuilder r1 = relBuilder.push(input).variable(v::set);
RexNode filterCondition = filter.getCondition().accept(expressionRexConverter);
RelNode node = r1.filter(List.of(v.get().id), filterCondition).build();
return applyRemap(node, filter.getRemap());
}How does this work if there is more than 1 FieldReference to a variable outside the subquery? It's perfectly valid to construct a Substrait Filter like OUTER_REF_1.foo = 'a' AND OUTER_REF_2.bar = bar in Substrait, and I believe you can do something similar in SQL if you reference multiple fields from outside the subquery boundary.
Additionally, when constructing a Substrait subquery you could have multiple Filters. Your resolution logic uses
var node = referenceRelList.get(outerref.get() - 1).get();but we only ever append to referenceRelList. If multiple Filters have outer references with outer step 1, the first one processed would point to referenceRelList[0] which would be correct, but the second Filter processed would also point to referenceRelList[0], which would be wrong.
What happens if you have multiple subquery boundaries and you have an outer reference more than 1 step out? For example if we have 2 subqueries and we only add 1 correlation variable to referenceRelList, then you would try to fetch referenceRelList[1], which would throw a NullPointerException.
I'm not confident that these changes are able to handle outer references more generally beyond the specific 01.sql test you've added.
What I would like to see is more testing, both from the SQL side but also from the Substrait side directly.
| import org.junit.jupiter.params.provider.ValueSource; | ||
|
|
||
| /** | ||
| * Updated TPC-H test to convert SQL to Substrait and replay those plans back to SQL Validating that |
| */ | ||
| @TestMethodOrder(OrderAnnotation.class) | ||
| @TestInstance(Lifecycle.PER_CLASS) | ||
| public class TestTpcdsQuery extends PlanTestBase { |
There was a problem hiding this comment.
We should add these tests separately from your PR if possible, as it would be reasonable to have as many of these that already work running while we figure out the outer reference issues.
| */ | ||
| @TestMethodOrder(OrderAnnotation.class) | ||
| @TestInstance(Lifecycle.PER_CLASS) | ||
| public class TestTpchQuery extends PlanTestBase { |
There was a problem hiding this comment.
We should add these tests separately from your PR if possible, as it would be reasonable to have as many of these that already work running while we figure out the outer reference issues.
|
Hi @vbarua - thanks for the additional thoughts; would share the concern that there might be cases such that this change isn't general enough. Though I do admit I've found it hard to create SQL queries that use these references! based on the last comments - is it worth doing that separate PR updating the testing and come back to this fix? |
|
Amusingly, I was able to craft an example select
c1.c_name,
o1.o_orderstatus,
o1.o_totalprice
from
CUSTOMER c1,
ORDERS o1
where
o1.o_custkey = c1.c_custkey
and o1.o_totalprice > (select avg(o_totalprice) from ORDERS o2 where o2.o_totalprice < c1.c_acctbal)
and o1.o_totalprice < (select max(c_acctbal) from CUSTOMER c2 where c2.c_custkey = o1.o_custkey);that is parseable and plannable in both MySql and Postgres (dbfiddle) but fails in Calcite with which is a new one for me.
It's definitely tricky, and it's probably easier to write tests directly as Substrait plans to check various things. Strictly speaking as well, the API you're modifying here is the Substrait -> Calcite conversion. The TPCH query that you were fixing just happens to generate a plan that uses the Substrait feature we're trying to test.
If you want to get the tests in that's what I recommend. I think it makes sense generally to be able to run those tests and IMO that's a much more straightforward change generally. |
|
Closing since this PR has been superseded by #426. |
This is resolving issue #382 found with TPC-H 17, when converting back to SQL from Substrait
The subquery references fields in an outer scope, the calcite correlation variables where referenced by Subsrtrait's 'outer reference'
However when converting back to calcite from a plan with these 'outer references' they were ignored.
Notes::