Skip to content

fix(isthmus): handle Subqueries/set predicates with field references outside of the subquery #383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mbwhite
Copy link
Contributor

@mbwhite mbwhite commented Apr 10, 2025

This is resolving issue #382 found with TPC-H 17, when converting back to SQL from Substrait
The subquery references fields in an outer scope, the calcite correlation variables where referenced by Subsrtrait's 'outer reference'

However when converting back to calcite from a plan with these 'outer references' they were ignored.

Notes::

  • I've added two new TPC-x tests, these are similar the existing tests, but do conversions SQL to Substrait and Substrait to SQL
  • Additional SQL using subqueries has been added; would like to add some more. Suggestions welcome.
  • Added these as 1xx.sql files but could move to a separate folder?

@mbwhite mbwhite changed the title Handle Subqueries/set predicates with field references outside of the subquery fix(isthmus): handle Subqueries/set predicates with field references outside of the subquery Apr 10, 2025
@mbwhite mbwhite marked this pull request as ready for review April 10, 2025 11:57
Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some questions

@mbwhite mbwhite force-pushed the tpc-h-fixes branch 4 times, most recently from eaa231c to 074c9ef Compare April 11, 2025 09:47
@mbwhite
Copy link
Contributor Author

mbwhite commented Apr 11, 2025

@vbarua questions have been answered; I've updated the test cases to better separate concerns.

found the hidden setting in vscode to change the formatted back to spotless!

io.substrait.plan.Plan plan = new ProtoPlanConverter().from(possible);
SubstraitToCalcite substraitToCalcite = new SubstraitToCalcite(extensions, typeFactory);
RelNode relRoot = substraitToCalcite.convert(plan.getRoots().get(0)).project(true);
System.out.println(SubstraitToSql.toSql(relRoot));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: avoid printing during test. I suggest assertNotNull instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also changed the other existing tests where this occurred.

@@ -487,25 +492,44 @@ public RexNode visit(Expression.Cast expr) throws RuntimeException {
typeConverter.toCalcite(typeFactory, expr.getType()), expr.input().accept(this), safeCast);
}

AtomicInteger correlIdCount = new AtomicInteger(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for? It's never used.

if (outerref.isPresent()) {
if (segment instanceof FieldReference.StructField) {
FieldReference.StructField f = (FieldReference.StructField) segment;
var node = referenceRelList.get(outerref.get() - 1).get();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to track or handle the fact the there might be multiple Filters that can add correlation variables? We only ever add to this list.

How would we know which ones came from which input?

This is one the issue I had in mind when I mentioned
https://github.com/substrait-io/substrait-java/pull/383/files#r2038429378

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly the Calcite docs don't really help with the scope of these variables. There is some concept of 'namespace' eg from this error .

All correlation variables should resolve to the same namespace. Prev ns=org.apache.calcite.sql.validate.IdentifierNamespace@d36c1c3, new ns=org.apache.calcite.sql.validate.IdentifierNamespace@96abc7

which came from

select
    c1.c_name,
    o1.o_orderstatus,
    o1.o_totalprice
from
    customer c1,
    orders o1
where
    o1.o_custkey = c1.c_custkey
    and o1.o_totalprice > (
        select
            avg(o_totalprice)
        from
            orders o2, customer c2
        where
            o2.o_totalprice < c1.c_acctbal
            and o2.o_totalprice > (
                select
                    avg(c3.c_acctbal)
                from
                    customer c3
                where
                    c3.c_custkey = o2.o_custkey
                    and c3.c_address = o1.o_clerk
            )
    );

change the last line to c3.c_address = o2.o_clerk and it's ok..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implications is that each relation and it's immediate subexpression is the same namespace.

Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a couple of things I'm not sure about in this PR.

When processing the Filter

@Override
  public RelNode visit(Filter filter) throws RuntimeException {
    RelNode input = filter.getInput().accept(this);
    final Holder<RexCorrelVariable> v = Holder.empty();
    expressionRexConverter.addCorrelVariable(v);

    RelBuilder r1 = relBuilder.push(input).variable(v::set);
    RexNode filterCondition = filter.getCondition().accept(expressionRexConverter);
    RelNode node = r1.filter(List.of(v.get().id), filterCondition).build();
    return applyRemap(node, filter.getRemap());
  }

How does this work if there is more than 1 FieldReference to a variable outside the subquery? It's perfectly valid to construct a Substrait Filter like OUTER_REF_1.foo = 'a' AND OUTER_REF_2.bar = bar in Substrait, and I believe you can do something similar in SQL if you reference multiple fields from outside the subquery boundary.

Additionally, when constructing a Substrait subquery you could have multiple Filters. Your resolution logic uses

var node = referenceRelList.get(outerref.get() - 1).get();

but we only ever append to referenceRelList. If multiple Filters have outer references with outer step 1, the first one processed would point to referenceRelList[0] which would be correct, but the second Filter processed would also point to referenceRelList[0], which would be wrong.

What happens if you have multiple subquery boundaries and you have an outer reference more than 1 step out? For example if we have 2 subqueries and we only add 1 correlation variable to referenceRelList, then you would try to fetch referenceRelList[1], which would throw a NullPointerException.

I'm not confident that these changes are able to handle outer references more generally beyond the specific 01.sql test you've added.

What I would like to see is more testing, both from the SQL side but also from the Substrait side directly.

import org.junit.jupiter.params.provider.ValueSource;

/**
* Updated TPC-H test to convert SQL to Substrait and replay those plans back to SQL Validating that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you mean TPC-D here?

*/
@TestMethodOrder(OrderAnnotation.class)
@TestInstance(Lifecycle.PER_CLASS)
public class TestTpcdsQuery extends PlanTestBase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add these tests separately from your PR if possible, as it would be reasonable to have as many of these that already work running while we figure out the outer reference issues.

*/
@TestMethodOrder(OrderAnnotation.class)
@TestInstance(Lifecycle.PER_CLASS)
public class TestTpchQuery extends PlanTestBase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add these tests separately from your PR if possible, as it would be reasonable to have as many of these that already work running while we figure out the outer reference issues.

@mbwhite
Copy link
Contributor Author

mbwhite commented Apr 22, 2025

Hi @vbarua - thanks for the additional thoughts; would share the concern that there might be cases such that this change isn't general enough.

Though I do admit I've found it hard to create SQL queries that use these references!

based on the last comments - is it worth doing that separate PR updating the testing and come back to this fix?

@vbarua
Copy link
Member

vbarua commented Apr 26, 2025

Amusingly, I was able to craft an example

select
    c1.c_name,
    o1.o_orderstatus,
    o1.o_totalprice
from
    CUSTOMER c1,
    ORDERS o1
where
    o1.o_custkey = c1.c_custkey
    and o1.o_totalprice > (select avg(o_totalprice) from ORDERS o2 where o2.o_totalprice < c1.c_acctbal)
    and o1.o_totalprice < (select max(c_acctbal) from CUSTOMER c2 where c2.c_custkey = o1.o_custkey);

that is parseable and plannable in both MySql and Postgres (dbfiddle) but fails in Calcite with

All correlation variables should resolve to the same namespace.

which is a new one for me.

I've found it hard to create SQL queries that use these references!

It's definitely tricky, and it's probably easier to write tests directly as Substrait plans to check various things. Strictly speaking as well, the API you're modifying here is the Substrait -> Calcite conversion. The TPCH query that you were fixing just happens to generate a plan that uses the Substrait feature we're trying to test.

is it worth doing that separate PR updating the testing and come back to this fix?

If you want to get the tests in that's what I recommend. I think it makes sense generally to be able to run those tests and IMO that's a much more straightforward change generally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants