Skip to content

fix(spark): remove internal functions MakeDecimal and UnscaledValue #386

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 11, 2025

Conversation

andrew-coleman
Copy link
Contributor

These two functions are inserted by the catalyst optimizer for queries that involve aggregation (sum & average) of decimal values.
Approx 50% of the TPC-DS tests rely on these internal functions which doesn’t make them interchangable with other query processors. This commit reverses this particular optimisation before conversion to substrait, and removes MakeDecimal and UnscaledValue from the spark.yaml file.

@@ -198,8 +221,7 @@ class ToSubstraitRel extends AbstractLogicalPlanVisitor with Logging {
override def visitWindow(window: Window): relation.Rel = {
val windowExpressions = window.windowExpressions.map {
case w: WindowExpression => fromWindowCall(w, window.child.output)
case a: Alias if a.child.isInstanceOf[WindowExpression] =>
fromWindowCall(a.child.asInstanceOf[WindowExpression], window.child.output)
case Alias(w: WindowExpression, _) => fromWindowCall(w, window.child.output)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same condition, just more scala-like!

val actualResultExprs = agg.aggregateExpressions.map {
// eliminate the internal MakeDecomal and UnscaledValue functions by undoing the spark optimisation:
// https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L2223
case Alias(expr, name) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, an alternative would be to disable the rule 🤔 that'd be a bit nicer maybe since there's no reason to optimize something just to unoptimize it, and also if someone wants to use substrait-spark for transferring optimized spark plans to other spark instance, they could then do that (I think we could still remove these functions from the spark.yml in this repo, but it'd be easier to then add them back). But I don't know if we have any good way of excluding rules, I guess it'd need to be like a readme thing saying "please exclude these rules before sending the plan over to substrait-spark"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that would be the cleanest solution (for us). But it would throw the problem back to the end user and make the API harder to use. Since we can handle the situation within this library, we don't have to bother our users with these nasty internal issues - it will "just work"! 😁

@@ -133,7 +133,30 @@ class ToSubstraitRel extends AbstractLogicalPlanVisitor with Logging {
*/
override def visitAggregate(agg: Aggregate): relation.Rel = {
val input = visit(agg.child)
val actualResultExprs = agg.aggregateExpressions
val actualResultExprs = agg.aggregateExpressions.map {
// eliminate the internal MakeDecomal and UnscaledValue functions by undoing the spark optimisation:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// eliminate the internal MakeDecomal and UnscaledValue functions by undoing the spark optimisation:
// eliminate the internal MakeDecimal and UnscaledValue functions by undoing the spark optimisation:

These two functions are inserted by the catalyst optimizer for
queries that involve aggregation (sum & average) of decimal
values.
Approx 50% of the TPC-DS tests rely on these internal functions
which doesn’t make them interchangable with other query processors.
This commit reverses this particular optimisation before
conversion to substrait, and removes MakeDecimal and
UnscaledValue from the `spark.yaml` file.
Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me, though mostly deferring to @Blizzara's review.

@vbarua vbarua merged commit 7a689e9 into substrait-io:main Apr 11, 2025
12 checks passed
@andrew-coleman andrew-coleman deleted the remove_make_decimal branch April 14, 2025 08:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants