Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Physical Properties #123

Open
AveryQi115 opened this issue Mar 21, 2024 · 0 comments
Open

Physical Properties #123

AveryQi115 opened this issue Mar 21, 2024 · 0 comments
Assignees

Comments

@AveryQi115
Copy link
Contributor

AveryQi115 commented Mar 21, 2024

Concept

In cascades framework, physical properties have two features:

  • derived
  • required

Derived are the physical properties derived from children, for example, scan from a table whose column 1 is ordered ascend makes the scan node have the derived property as column 1 ordered ascend.

Required are the physical properties required by the query, for example, select * from t1 order by t1.v1 makes the root node having the required physical properties as v1 is required to be ordered.

Once derived properties cannot support all required properties, operator created by enforcer rule is inserted in the operator tree.

And where to insert the operator created by enforcer rule will depend on the calculated costs. (Determine select v1, v2 from t1 order by v1 is transferred to Sort->Projection->TableScan or Projection->Sort->TableScan, more complicated cases include join, filter, subqueries...)

Design

Derived properties are for expression base, which means multiple expressions in one group in optd can have different derived physical properties. (For example, sort-merge join provide the column in order, while hash join not, but they are logically equivalent.)

Required properties are for group base, imagining a query like select * from t1 order by v1. The root node group should have the required properties (ordering = v1).

Problems to be solved

A critical issue is that we should calculate the cost to determine where to put enforcer rule, for example, Select node can choose children which already have the required properties in their derived and the cost is the children expression cost, it can also choose children which not have the derived properties and insert an enforcer rule on top of that, the cost then becomes the enforcer operator + the children cost.

However, as the children representations are placeholder for group id in optd, we cannot directly pick best children (winner) with/without required physical properties. And as the derived properties are for expression based, we might need to traverse all the expressions in the child group.

This change of logic hugely interferes the optimizeGroup task(we need to provide an interface for required physical properties) and optimizeInput task(we need to calculate different winners for different physical properties requirement) and the main memo structure(we need to add physical properties storage for each expr/exprId and add infer derived physical properties when adding new expr to group).

References

CockroachDB's doc(https://github.com/cockroachdb/cockroach/blob/fe36467047fbbf88569fc03cff97fb4d728122f3/pkg/sql/opt/xform/optimizer.go#L319) provides a good example of how to develop physical properties and enforcer rule. It launches two optimizeGroup tasks, one with required properties, one without. The task without required properties need to add the cost of enforcer operator. And it then compares the cost of their result. For the children, it is separating the child group in different properties, so these plans are different: (select G2="ordering: y" G3) and (select G2 G3).

Columnbia's paper provides definition of the winner structure with physical properties. It states as

In Columbia, a winner is also used to store the temporary result of a search. While the costs of physical multi-expressions of a group are being calculated, the cheapest yet found expression is stored as a winner. During the optimization process, the winner is improved and finally the best (cheapest) plan is found. Sometimes, when no physical multi-expression can be found with the required physical property, we store the multi-expression pointer as NULL to indicate no winner for that physical property. Since no winner is also a solution for a search of this sub-problem, this information is memoized and will be useful in the future optimizing process. The following is the definition of data members in a WINNER class in Columbia.

image

Calcite is creating subGroups within groups. subGroups are expressions that logically equivalent and have same physical properties. It then optimizeSubGroups and finds winner within each subgroup. (https://github.com/apache/calcite/blob/4823cb7760913f236e7f0f2cb149325b55a3f124/core/src/main/java/org/apache/calcite/plan/volcano/RelSubset.java#L84)

Implementation

Generally, my idea is to

  1. change the winner in groupInfo into an array of winner with its physical properties (derived properties).
  2. Adds context(required properties) to optimizeGroup, optimizeInput, OptimizeExpression.
  3. For group with required properties, call optimizeGroup twice to get two kinds of winner, winner with required properties and winner without required properties + cost of enforcer operator, and storing winner for the required properties for current group.
  4. for each newly added expressions, call infer_physical_properties to created derived physical properties.
@AveryQi115 AveryQi115 self-assigned this Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant