Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Filter Pushdown Rule #140

Merged
merged 46 commits into from
Apr 9, 2024
Merged

feat: Filter Pushdown Rule #140

merged 46 commits into from
Apr 9, 2024

Conversation

jurplel
Copy link
Member

@jurplel jurplel commented Mar 26, 2024

This PR brings a filter pushdown heuristic rule, built on @AveryQi115's hybrid scheme.

Filter Pushdown Rule

  • This series of rules matches on any filter, and pushes any part of the filter down below a node if possible.
  • They are registered in the HeuristicsRuleWrapper currently, but they work as cost-based rules as well.

Helper Functions

  • LogOpExpr::new_flattened_nested_logical creates a new LogOpExpr from an ExprList, and it flattens any nested LogOpExprs of the same LogOpType.
  • Expr::rewrite_column_refs recursively rewrites any ColumnExpr in an expression tree, using a provided rewrite_fn.
  • LogicalJoin::map_through_join takes in left/right schema sizes, and maps an index to be as it would if it were pushed down to the left or right side of a join.
  • LogicalProjection::compute_column_mapping creates a ColumnMapping object from a LogicalProjection.
    • The ColumnMapping object has a few methods, but most importantly it has rewrite_condition, which given an expr, will rewrite the expression with the projection's mapping.

Testing Utilities

  • new_test_optimizer creates a new heuristic optimizer, which applies a given rule. It uses a TpchCatalog.
  • TpchCatalog is a catalog implementing a couple of tables from the TPC-H schema. It can be extended to have more as needed.
  • DummyCostModel implements a cost model, only giving zero cost. It is used for constructing a cascades optimizer without a real cost model, and isn't used in this PR.
  • This pull request, using these test optimizer components, pioneers a new testing scheme, based on running a constructed query plan through an optimizer, rather than text-based SQL planner tests, which may be flaky. They also test rules in isolation.

@yliang412
Copy link
Member

I was looking at Umbra's plan For Q7. Would it be helpful to push down hints for the (n1.n_name = 'FRANCE' AND n2.n_name = 'GERMANY') OR (n1.n_name = 'GERMANY' AND n2.n_name = 'FRANCE') predicate? You still need to keep the join, but you could push down FRANCE or GERMANY to both sides and reduce the number of things to look at.

TPC-H Q7:

SELECT
    supp_nation,
    cust_nation,
    l_year,
    SUM(volume) AS revenue
FROM
    (
        SELECT
            n1.n_name AS supp_nation,
            n2.n_name AS cust_nation,
            EXTRACT(YEAR FROM l_shipdate) AS l_year,
            l_extendedprice * (1 - l_discount) AS volume
        FROM
            supplier,
            lineitem,
            orders,
            customer,
            nation n1,
            nation n2
        WHERE
            s_suppkey = l_suppkey
            AND o_orderkey = l_orderkey
            AND c_custkey = o_custkey
            AND s_nationkey = n1.n_nationkey
            AND c_nationkey = n2.n_nationkey
            AND (
                (n1.n_name = 'FRANCE' AND n2.n_name = 'GERMANY')
                OR (n1.n_name = 'GERMANY' AND n2.n_name = 'FRANCE')
            )
            AND l_shipdate BETWEEN DATE '1995-01-01' AND DATE '1996-12-31'
    ) AS shipping
GROUP BY
    supp_nation,
    cust_nation,
    l_year
ORDER BY
    supp_nation,
    cust_nation,
    l_year;

@jurplel
Copy link
Member Author

jurplel commented Mar 27, 2024

I was looking at Umbra's plan For Q7. Would it be helpful to push down hints for the (n1.n_name = 'FRANCE' AND n2.n_name = 'GERMANY') OR (n1.n_name = 'GERMANY' AND n2.n_name = 'FRANCE') predicate? You still need to keep the join, but you could push down FRANCE or GERMANY to both sides and reduce the number of things to look at.
...

This is a great example, but if you look at what Umbra is doing, this is actually a separate step from pushdown:
Unoptimized:
image
Expression Simplification:
image

If the filter were simplified to a conjunction first, then the filter pushdown implementation would be able to operate on the clauses independently. Dealing with an in expression with two columns is a separate issue, but I'm not sure optd would simplify the expression like that.

@yliang412
Copy link
Member

I was looking at Umbra's plan For Q7. Would it be helpful to push down hints for the (n1.n_name = 'FRANCE' AND n2.n_name = 'GERMANY') OR (n1.n_name = 'GERMANY' AND n2.n_name = 'FRANCE') predicate? You still need to keep the join, but you could push down FRANCE or GERMANY to both sides and reduce the number of things to look at.
...

This is a great example, but if you look at what Umbra is doing, this is actually a separate step from pushdown: Unoptimized: image Expression Simplification: image

If the filter were simplified to a conjunction first, then the filter pushdown implementation would be able to operate on the clauses independently. Dealing with an in expression with two columns is a separate issue, but I'm not sure optd would simplify the expression like that.

Yep, I agree that this seems to be relevant to expression optimizations.

Copy link
Contributor

@Sweetsuro Sweetsuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple comments that should probably be addressed

Copy link
Member

@yliang412 yliang412 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Feel free to merge after addressing the comments (or do it in a separate PR)

optd-datafusion-repr/src/rules/filter_pushdown.rs Outdated Show resolved Hide resolved
optd-datafusion-repr/src/rules/filter_pushdown.rs Outdated Show resolved Hide resolved
@jurplel
Copy link
Member Author

jurplel commented Apr 9, 2024

@Sweetsuro all comments finally addressed—approve and i will merge it!

Copy link
Contributor

@Sweetsuro Sweetsuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Maybe update the PR description before merging

@jurplel jurplel merged commit 3b0e6b7 into main Apr 9, 2024
1 check passed
@jurplel jurplel deleted the bowad/filter_pushdown branch April 9, 2024 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants