Skip to content

Fix: handle column name collisions when combining UNION logical inputs & nested Column expressions in maybe_fix_physical_column_name #16064

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

LiaCastaneda
Copy link
Contributor

@LiaCastaneda LiaCastaneda commented May 16, 2025

Which issue does this PR close?

Rationale for this change

This PR handles column names not matching expression names for 2 cases:

  • Physical planning is failing on the uppermost projection due to inaccurate schema name coercion while building the union physical node UnionExec (see issue for wider explanation)
  • There is also the case when Columns are inside other type of expression, if so the fix won't apply, this PR handles that case as well since its the same error, but from a different origin.

What changes are included in this PR?

A workaround fix in union_schema by keeping the field names of the first input + a integration test with a reproducer.

Are these changes tested?

yes, for the union cercion case, a test was added in substrait_consumer tests. For columns inside other types of expressions I added a unit test.

Are there any user-facing changes?

no

@github-actions github-actions bot added the substrait Changes to the substrait crate label May 16, 2025
@LiaCastaneda LiaCastaneda marked this pull request as ready for review May 16, 2025 13:47
@LiaCastaneda LiaCastaneda changed the title Fix union schema name coercion Handle union schema name coercion May 16, 2025
@LiaCastaneda LiaCastaneda force-pushed the lia/fix-union-schema-field-names branch from 6bf7e8e to 1e3e9cb Compare May 19, 2025 08:39
@github-actions github-actions bot added the core Core DataFusion crate label May 19, 2025
Comment on lines 2074 to 2075
expr.and_then(|e| {
e.transform_down(|node| {
Copy link
Contributor Author

@LiaCastaneda LiaCastaneda May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes Columns can be inside other type of expressions (so they are not on the "top level") , for example:

BinaryExpr {
    left: IsNotNull(
        Column(
            Column {
                relation: Some(
                    Bare {
                        table: "left",
                    },
                ),
                name: "people_column",
            },
        ),
    ),
    op: Or,
    right: IsNotNull(
        Column(
            Column {
                relation: Some(
                    Bare {
                        table: "left",
                    },
                ),
                name: "people_column:1",
            },
        ),
    ),
},

if so the current fix won't apply, this change handles those cases by using transform_down but logic remains the same

@LiaCastaneda LiaCastaneda force-pushed the lia/fix-union-schema-field-names branch 2 times, most recently from 7bdaa6a to 99d1177 Compare May 19, 2025 14:54
@LiaCastaneda LiaCastaneda force-pushed the lia/fix-union-schema-field-names branch from 99d1177 to f756008 Compare May 19, 2025 15:24
@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label May 20, 2025
Copy link
Contributor

@gabotechs gabotechs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 nice! LGTM

@LiaCastaneda LiaCastaneda changed the title Handle union schema name coercion Fix: handle column name collisions when combining UNION logical inputs & nested Column expressions in maybe_fix_physical_column_name May 21, 2025
@LiaCastaneda
Copy link
Contributor Author

@berkaysynnada since you reviewed #15580 that added maybe_fix_physical_column_name, is there any chance you can review this one when you have time? It just adds handling for the edge case when we have columns inside other expressions and the coerces UnionExec schema names to take the left one, since for now it takes the name of the side that its nullable.

LiaCastaneda added a commit to DataDog/datafusion that referenced this pull request May 22, 2025
Copy link
Contributor

@berkaysynnada berkaysynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me, thank you @LiaCastaneda

@berkaysynnada
Copy link
Contributor

thank you also @gabotechs, I merge this after the checks

@LiaCastaneda
Copy link
Contributor Author

thanks for the reviews @gabotechs & @berkaysynnada !

@berkaysynnada berkaysynnada merged commit e5f596b into apache:main May 22, 2025
27 checks passed
LiaCastaneda added a commit to DataDog/datafusion that referenced this pull request May 22, 2025
…s & nested Column expressions in maybe_fix_physical_column_name (apache#16064)

* Fix union schema name coercion

* Address renaming for columns that are not in the top level as well

* Add unit test

* Format

* Use insta tests properly

* Address review - comment + minor simplification change

---------

Co-authored-by: Berkay Şahin <[email protected]>
(cherry picked from commit e5f596b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate physical-plan Changes to the physical-plan crate substrait Changes to the substrait crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Input field name $f3 does not match with the projection expression ...
3 participants