-
Notifications
You must be signed in to change notification settings - Fork 928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Enable multi-partition Select
operations containing basic aggregations
#17941
base: branch-25.04
Are you sure you want to change the base?
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial comments
self.sub_expr = sub_expr | ||
self.children = children |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: what's the difference between a sub_expr
and a child? Is the sub_expr
just the thing this came from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A FusedExpr
node corresponds to multiple fused expression nodes that are effectively "owned" by the FusedExpr
object.
For example, let's assume we have an expression like A = B + Sum(C + D)
. Since A
contains a non-pointwise node, we can decompose it into two FusedExpr
nodes:
F0 = B + F1
F1 = Sum(C + D)
.
The sub_expr
attributes of F0
and F1
correspond to the Expr
nodes that they "own":
F0.sub_expr = B + F1
F1.sub_expr = Sum(C + D)
The children
of a FusedExpr
node can only contain other FusedExpr
nodes:
F0.children = (F1,)
F1.children = ()
The distinction between sub_expr
and children
is important during evaluation. When we evaluate F0
, we must already know the result of evaluating its children (F1
), but we don't need to know the result of any other nodes in F0.sub_expr
.
""" | ||
expr_partition_counts: MutableMapping[Expr, int] = update or {} | ||
for expr in exprs: | ||
for node in list(traversal([expr]))[::-1]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you doing this because you want a child-before-parent traversal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we are transforming the Expr
graph to comprise only FusedExpr
nodes, and FusedExpr.chilren
may only comprise other FusedExpr
nodes. Therefore, we must transform the expression using a child-before-parent traversal.
elif isinstance(node, Agg): | ||
# Assume all aggregations produce 1 partition | ||
expr_partition_counts[node] = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: This is not right, I think, if we observe the Agg inside a groupby.
We should probably attach the execution context of an expression when constructing it. (and remove the ExecutionContext
argument from do_evaluate).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you may need to explain this one to me offline :)
def rename_agg(agg: Agg, new_name: str): | ||
"""Modify the name of an aggregation expression.""" | ||
return CachingVisitor( | ||
replace_sub_expr, | ||
state={"mapping": {agg: Agg(agg.dtype, new_name, agg.options, *agg.children)}}, | ||
)(agg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Renaming" feels like the wrong thing, because the options for one agg might not apply to the options of another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the options for one agg might not apply to the options of another
Yeah, that's totally true. We definitely need to figure out how the options "plumbing" should work here. I was hoping the options would normally translate in a trivial way, but I don't have a great sense for the range of possibilities.
def replace_sub_expr(e: Expr, rec: ExprTransformer): | ||
"""Replace a target expression node.""" | ||
mapping = rec.state["mapping"] | ||
if e in mapping: | ||
return mapping[e] | ||
return reuse_if_unchanged(e, rec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could just call this replace
, and have replace(expr, mapping)
as the public function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm following the CachingVisitor
pattern used elsewhere. Don't I need to include the rec: ExprTransformer
argument?
Description
This is still a rough POC/WIP.
The overall goal is to enable us to decompose arbitrary
Expr
graphs containing one or more "non-pointwise" nodes. In order to achieve this, I propose that we add an experimentalFusedExpr
class (and relatedExpr
-graph decomposition utilities). The general idea is that we can iteratively traverse anExpr
-graph in reverse-topological order, and rewrite the graph until it is entirely composed ofFusedExpr
nodes. From there, it becomes relatively simple to build the task graph for eachFusedExpr
node independently.Checklist