-
Notifications
You must be signed in to change notification settings - Fork 906
Fix(optimizer)!: Preserve struct-column parentheses for RisingWave dialect #5376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix(optimizer)!: Preserve struct-column parentheses for RisingWave dialect #5376
Conversation
…lect Added logic to preserve (possible) semantically significant parentheses in RisingWave dialect. Refactoring of logic in `simplify.simplify_parens` to use guard-clauses. BREAKING CHANGE: Added dialect as argument to `simplify_parens` function
Co-authored-by: Jo <[email protected]>
As per our discussion in Slack, I did some testing and dug around in both the docs and source code of RisingWave and decided to move my findings here for better archiving/visibility It appears that RisingWave are trying to align (partly) with Postgres in accessing Composite Types, docs here in terms of syntax for accessing structs. Relevant issues from RisingWave are: I am going to do a bit more work by adding more tests that work directly in RisingWave and then ensuring this PR covers those. I'll let you know if I run into any issues and/or get stuck and could use some help. Thanks again for the fast responses. |
Do you mean there's other issues besides simplifying parentheses? Did you check out this comment by any chance? |
Partially. It is still fundamentally down to parentheses having semantic meaning for struct types in RisingWave. However to support e.g. struct expansion and correct qualification of column(s) there is a need for more work. In essence I (ideally) want to be able to (for a column SELECT
(struct_col).*
FROM
t
) to the following:
i.e. we also need to handle these parentheses correctly in the Edit: I am unsure if we have any other dependencies in terms of just adding the |
Just to make sure– have you verified that there are issues with the qualification logic as it is today? Or was that next on your list, i.e., testing? I don't see anything that is obviously unsupported in your example, e.g., for this query: SELECT (struct_col).* FROM t I expect that we're already parsing |
Glad to confirm and double test. I just tested on a clean install of v27.0.0 with the following snippet: from sqlglot import parse_one
from sqlglot.optimizer import optimize
import sqlglot
sql = """
SELECT
(struct_col).*
FROM
t
"""
schema = {
"t" : {
"struct_col" : "struct<nested_int INT, nested_char VARCHAR>"
}
}
rw_ast = parse_one(sql,dialect='risingwave')
rw_opt = optimize(rw_ast,schema=schema,dialect='risingwave')
print(rw_opt) gives the following result SELECT "t"."struct_col".* FROM "t" AS "t" So the qualification sort of works, but there are two issues still that I am working to solve:
|
I did a lot more testing, but I think I managed to get it to a working stage that I am happy with for another review round. I am aware of the comment on an old PR here about introducing dialect specifics into the optimizer, when some of the logic should be shareable between e.g. BigQuery and RisingWave in regards to expanding Let me know if you have more feedback and/or need more test cases added and I will get on it asap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MisterWheatley thanks for the update– left another round of comments.
Co-authored-by: Jo <[email protected]>
…orrect level for RisingWave Updated logic for expanding (struct_col).* expressions in RisingWave to correctly handle the level of nesting. Moved struct expansion tests to tests/fixtures/qualify_columns.sql on behest of maintainers.
Hi again @georgesittas I have tried to update according to your comments and questions. Based on your question regarding unpacking, I uncovered a problem with the previous way of doing it, hence why the code is now quite a bit different. High-level the new logic walks "up" the AST from the The re-added casts/type annotations were partly to fix a few mypy errors when running I moved all the Again, thanks for the quick response times and looking forward to your feedback on the latest changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @MisterWheatley, thanks for addressing comments and sorry for the delayed response– I was OOO. The PR looks good to go; I will address the last couple of comments that I left after merging.
# find column definition to get data-type | ||
dot_column = t.cast(exp.Column, expression.find(exp.Column)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be safer if we did an instance check to ensure dot_column
is indeed a Column
in L612.
outer_paren = expression.this | ||
|
||
for struct_field_def in t.cast(exp.DataType, current_struct).expressions: | ||
new_identifier = exp.Identifier(this=struct_field_def.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should respect the identifier found in the type here, otherwise we may drop quotes when we shouldn't.
I assumed as much. Thanks for the follow up and taking it the last mile, very much appreciated. |
Just a few minor changes on my end: 8f43c5f. Nice work! |
This PR adds logic to the optimizer to preserve (possibly) semantically significant parentheses in the RisingWave dialect, see docs here
A minimal example looks like follows: Assume you have a column
struct_col (STRUCT<nested_col INT>)
and wish to access the nested column to bring it up to a top-level column. The select statement in RW would be (with quoting):The logic added checks if the dialect is RisingWave and the given parentheses follow the pattern
(<exp>).<identifier>
to preserve semantically significant parentheses.In addition this PR contains a minor refactor of the complicated conditional to instead use 2 guard clauses for better clarity.
Edit: Relevant Slack discussion: https://tobiko-data.slack.com/archives/C0448SFS3PF/p1751957408688319