-
Notifications
You must be signed in to change notification settings - Fork 78
fix(isthmus): handle Subqueries/set predicates with field references outside of the subquery #383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some questions
isthmus/src/main/java/io/substrait/isthmus/SubstraitRelNodeConverter.java
Outdated
Show resolved
Hide resolved
isthmus/src/main/java/io/substrait/isthmus/SubstraitRelNodeConverter.java
Show resolved
Hide resolved
isthmus/src/main/java/io/substrait/isthmus/SubstraitRelNodeConverter.java
Show resolved
Hide resolved
eaa231c
to
074c9ef
Compare
@vbarua questions have been answered; I've updated the test cases to better separate concerns. found the hidden setting in vscode to change the formatted back to spotless! |
io.substrait.plan.Plan plan = new ProtoPlanConverter().from(possible); | ||
SubstraitToCalcite substraitToCalcite = new SubstraitToCalcite(extensions, typeFactory); | ||
RelNode relRoot = substraitToCalcite.convert(plan.getRoots().get(0)).project(true); | ||
System.out.println(SubstraitToSql.toSql(relRoot)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: avoid printing during test. I suggest assertNotNull
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also changed the other existing tests where this occurred.
isthmus/src/main/java/io/substrait/isthmus/expression/ExpressionRexConverter.java
Show resolved
Hide resolved
@@ -487,25 +492,44 @@ public RexNode visit(Expression.Cast expr) throws RuntimeException { | |||
typeConverter.toCalcite(typeFactory, expr.getType()), expr.input().accept(this), safeCast); | |||
} | |||
|
|||
AtomicInteger correlIdCount = new AtomicInteger(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this for? It's never used.
if (outerref.isPresent()) { | ||
if (segment instanceof FieldReference.StructField) { | ||
FieldReference.StructField f = (FieldReference.StructField) segment; | ||
var node = referenceRelList.get(outerref.get() - 1).get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to track or handle the fact the there might be multiple Filters that can add correlation variables? We only ever add to this list.
How would we know which ones came from which input?
This is one the issue I had in mind when I mentioned
https://github.com/substrait-io/substrait-java/pull/383/files#r2038429378
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly the Calcite docs don't really help with the scope of these variables. There is some concept of 'namespace' eg from this error .
All correlation variables should resolve to the same namespace. Prev ns=org.apache.calcite.sql.validate.IdentifierNamespace@d36c1c3, new ns=org.apache.calcite.sql.validate.IdentifierNamespace@96abc7
which came from
select
c1.c_name,
o1.o_orderstatus,
o1.o_totalprice
from
customer c1,
orders o1
where
o1.o_custkey = c1.c_custkey
and o1.o_totalprice > (
select
avg(o_totalprice)
from
orders o2, customer c2
where
o2.o_totalprice < c1.c_acctbal
and o2.o_totalprice > (
select
avg(c3.c_acctbal)
from
customer c3
where
c3.c_custkey = o2.o_custkey
and c3.c_address = o1.o_clerk
)
);
change the last line to c3.c_address = o2.o_clerk
and it's ok..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implications is that each relation and it's immediate subexpression is the same namespace.
Signed-off-by: MBWhite <[email protected]>
Signed-off-by: MBWhite <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a couple of things I'm not sure about in this PR.
When processing the Filter
@Override
public RelNode visit(Filter filter) throws RuntimeException {
RelNode input = filter.getInput().accept(this);
final Holder<RexCorrelVariable> v = Holder.empty();
expressionRexConverter.addCorrelVariable(v);
RelBuilder r1 = relBuilder.push(input).variable(v::set);
RexNode filterCondition = filter.getCondition().accept(expressionRexConverter);
RelNode node = r1.filter(List.of(v.get().id), filterCondition).build();
return applyRemap(node, filter.getRemap());
}
How does this work if there is more than 1 FieldReference to a variable outside the subquery? It's perfectly valid to construct a Substrait Filter like OUTER_REF_1.foo = 'a' AND OUTER_REF_2.bar = bar
in Substrait, and I believe you can do something similar in SQL if you reference multiple fields from outside the subquery boundary.
Additionally, when constructing a Substrait subquery you could have multiple Filters. Your resolution logic uses
var node = referenceRelList.get(outerref.get() - 1).get();
but we only ever append to referenceRelList
. If multiple Filters have outer references with outer step 1, the first one processed would point to referenceRelList[0]
which would be correct, but the second Filter processed would also point to referenceRelList[0]
, which would be wrong.
What happens if you have multiple subquery boundaries and you have an outer reference more than 1 step out? For example if we have 2 subqueries and we only add 1 correlation variable to referenceRelList
, then you would try to fetch referenceRelList[1]
, which would throw a NullPointerException.
I'm not confident that these changes are able to handle outer references more generally beyond the specific 01.sql
test you've added.
What I would like to see is more testing, both from the SQL side but also from the Substrait side directly.
isthmus/src/main/java/io/substrait/isthmus/expression/ExpressionRexConverter.java
Show resolved
Hide resolved
import org.junit.jupiter.params.provider.ValueSource; | ||
|
||
/** | ||
* Updated TPC-H test to convert SQL to Substrait and replay those plans back to SQL Validating that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you mean TPC-D here?
*/ | ||
@TestMethodOrder(OrderAnnotation.class) | ||
@TestInstance(Lifecycle.PER_CLASS) | ||
public class TestTpcdsQuery extends PlanTestBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add these tests separately from your PR if possible, as it would be reasonable to have as many of these that already work running while we figure out the outer reference issues.
*/ | ||
@TestMethodOrder(OrderAnnotation.class) | ||
@TestInstance(Lifecycle.PER_CLASS) | ||
public class TestTpchQuery extends PlanTestBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add these tests separately from your PR if possible, as it would be reasonable to have as many of these that already work running while we figure out the outer reference issues.
Hi @vbarua - thanks for the additional thoughts; would share the concern that there might be cases such that this change isn't general enough. Though I do admit I've found it hard to create SQL queries that use these references! based on the last comments - is it worth doing that separate PR updating the testing and come back to this fix? |
Amusingly, I was able to craft an example select
c1.c_name,
o1.o_orderstatus,
o1.o_totalprice
from
CUSTOMER c1,
ORDERS o1
where
o1.o_custkey = c1.c_custkey
and o1.o_totalprice > (select avg(o_totalprice) from ORDERS o2 where o2.o_totalprice < c1.c_acctbal)
and o1.o_totalprice < (select max(c_acctbal) from CUSTOMER c2 where c2.c_custkey = o1.o_custkey); that is parseable and plannable in both MySql and Postgres (dbfiddle) but fails in Calcite with
which is a new one for me.
It's definitely tricky, and it's probably easier to write tests directly as Substrait plans to check various things. Strictly speaking as well, the API you're modifying here is the Substrait -> Calcite conversion. The TPCH query that you were fixing just happens to generate a plan that uses the Substrait feature we're trying to test.
If you want to get the tests in that's what I recommend. I think it makes sense generally to be able to run those tests and IMO that's a much more straightforward change generally. |
This is resolving issue #382 found with TPC-H 17, when converting back to SQL from Substrait
The subquery references fields in an outer scope, the calcite correlation variables where referenced by Subsrtrait's 'outer reference'
However when converting back to calcite from a plan with these 'outer references' they were ignored.
Notes::