New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[WIP] Enable multi-partition `Select` operations containing basic aggregations #17941

Draft

rjzamora wants to merge 12 commits into rapidsai:branch-25.04 from rjzamora:complex-aggregations

+486 −15

Member

rjzamora commented Feb 6, 2025

Description

This is still a rough POC/WIP.

The overall goal is to enable us to decompose arbitrary Expr graphs containing one or more "non-pointwise" nodes. In order to achieve this, I propose that we add an experimental FusedExpr class (and related Expr-graph decomposition utilities). The general idea is that we can iteratively traverse an Expr-graph in reverse-topological order, and rewrite the graph until it is entirely composed of FusedExpr nodes. From there, it becomes relatively simple to build the task graph for each FusedExpr node independently.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

rjzamora added 12 commits

February 6, 2025 07:58


          add basic aggregation support

523f0ef


          roll back change to literal.py

4b6f180


          make get_expr_partition_count more efficient

920c361


          make get_expr_partition_count more efficient

71bbe53


          fix copyright changes

a6b05a9


          use traversal

3092c58


          roll back unnecessary date change

c4ed2a6


          move fuse_expr_graph

5af267b


          cleanup

c13e916


          add mean support

663db89


          update some comments

062c322


          add todo comment

5f7d73c

rjzamora added feature request 2 - In Progress non-breaking cudf.polars labels

rjzamora self-assigned this

copy-pr-bot bot commented Feb 6, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

github-actions bot added the Python label

Member Author

rjzamora commented Feb 6, 2025

Note: The rewrite approach used in this PR is illustrated in the figure below.

wence- reviewed

View reviewed changes

Contributor

wence- left a comment

Partial comments

python/cudf_polars/cudf_polars/experimental/expressions.py

Comment on lines +55 to +56

    
                      self.sub_expr = sub_expr

                      self.children = children

Contributor

wence- Feb 7, 2025

question: what's the difference between a sub_expr and a child? Is the sub_expr just the thing this came from?

Member Author

rjzamora Feb 7, 2025 •

edited

Loading

A FusedExpr node corresponds to multiple fused expression nodes that are effectively "owned" by the FusedExpr object.

For example, let's assume we have an expression like A = B + Sum(C + D). Since A contains a non-pointwise node, we can decompose it into two FusedExpr nodes:

F0 = B + F1
F1 = Sum(C + D).

The sub_expr attributes of F0 and F1 correspond to the Expr nodes that they "own":

F0.sub_expr = B + F1
F1.sub_expr = Sum(C + D)

The children of a FusedExpr node can only contain other FusedExpr nodes:

F0.children = (F1,)
F1.children = ()

The distinction between sub_expr and children is important during evaluation. When we evaluate F0, we must already know the result of evaluating its children (F1), but we don't need to know the result of any other nodes in F0.sub_expr.

python/cudf_polars/cudf_polars/experimental/expressions.py

    
                  """

                  expr_partition_counts: MutableMapping[Expr, int] = update or {}

                  for expr in exprs:

                      for node in list(traversal([expr]))[::-1]:

Contributor

wence- Feb 7, 2025

Are you doing this because you want a child-before-parent traversal?

Member Author

rjzamora Feb 7, 2025

Yes, we are transforming the Expr graph to comprise only FusedExpr nodes, and FusedExpr.chilren may only comprise other FusedExpr nodes. Therefore, we must transform the expression using a child-before-parent traversal.

python/cudf_polars/cudf_polars/experimental/expressions.py

Comment on lines +116 to +118

    
                          elif isinstance(node, Agg):

                              # Assume all aggregations produce 1 partition

                              expr_partition_counts[node] = 1

Contributor

wence- Feb 7, 2025

TODO: This is not right, I think, if we observe the Agg inside a groupby.

We should probably attach the execution context of an expression when constructing it. (and remove the ExecutionContext argument from do_evaluate).

Member Author

rjzamora Feb 7, 2025

I think you may need to explain this one to me offline :)

python/cudf_polars/cudf_polars/experimental/expressions.py

Comment on lines +145 to +150

    
              def rename_agg(agg: Agg, new_name: str):

                  """Modify the name of an aggregation expression."""

                  return CachingVisitor(

                      replace_sub_expr,

                      state={"mapping": {agg: Agg(agg.dtype, new_name, agg.options, *agg.children)}},

                  )(agg)

Contributor

wence- Feb 7, 2025

"Renaming" feels like the wrong thing, because the options for one agg might not apply to the options of another.

Member Author

rjzamora Feb 7, 2025

the options for one agg might not apply to the options of another

Yeah, that's totally true. We definitely need to figure out how the options "plumbing" should work here. I was hoping the options would normally translate in a trivial way, but I don't have a great sense for the range of possibilities.

python/cudf_polars/cudf_polars/experimental/expressions.py

Comment on lines +137 to +142

    
              def replace_sub_expr(e: Expr, rec: ExprTransformer):

                  """Replace a target expression node."""

                  mapping = rec.state["mapping"]

                  if e in mapping:

                      return mapping[e]

                  return reuse_if_unchanged(e, rec)

Contributor

wence- Feb 7, 2025

I think you could just call this replace, and have replace(expr, mapping) as the public function.

Member Author

rjzamora Feb 7, 2025

I'm following the CachingVisitor pattern used elsewhere. Don't I need to include the rec: ExprTransformer argument?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2 - In Progress cudf.polars feature request non-breaking Python