Skip to content

Fix(optimizer)!: Preserve struct-column parentheses for RisingWave dialect #5376

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

MisterWheatley
Copy link
Contributor

@MisterWheatley MisterWheatley commented Jul 8, 2025

This PR adds logic to the optimizer to preserve (possibly) semantically significant parentheses in the RisingWave dialect, see docs here

A minimal example looks like follows: Assume you have a column struct_col (STRUCT<nested_col INT>) and wish to access the nested column to bring it up to a top-level column. The select statement in RW would be (with quoting):

SELECT
  ("table"."struct_col")."nested_col" AS top_level_col
FROM
  "table"

The logic added checks if the dialect is RisingWave and the given parentheses follow the pattern (<exp>).<identifier> to preserve semantically significant parentheses.

In addition this PR contains a minor refactor of the complicated conditional to instead use 2 guard clauses for better clarity.

Edit: Relevant Slack discussion: https://tobiko-data.slack.com/archives/C0448SFS3PF/p1751957408688319

…lect

Added logic to preserve (possible) semantically significant parentheses in RisingWave dialect.
Refactoring of logic in `simplify.simplify_parens` to use guard-clauses.

BREAKING CHANGE: Added dialect as argument to `simplify_parens` function
@MisterWheatley
Copy link
Contributor Author

Hi @georgesittas

As per our discussion in Slack, I did some testing and dug around in both the docs and source code of RisingWave and decided to move my findings here for better archiving/visibility

It appears that RisingWave are trying to align (partly) with Postgres in accessing Composite Types, docs here in terms of syntax for accessing structs.
As far as I am able to determine, RisingWave follows a BigQuery like syntax with the exception that you need a set of parentheses around each nested level.

Relevant issues from RisingWave are:

I am going to do a bit more work by adding more tests that work directly in RisingWave and then ensuring this PR covers those.

I'll let you know if I run into any issues and/or get stuck and could use some help. Thanks again for the fast responses.

@MisterWheatley MisterWheatley marked this pull request as draft July 8, 2025 11:26
@georgesittas
Copy link
Collaborator

Do you mean there's other issues besides simplifying parentheses? Did you check out this comment by any chance?

@MisterWheatley
Copy link
Contributor Author

MisterWheatley commented Jul 8, 2025

Do you mean there's other issues besides simplifying parentheses? Did you check out this comment by any chance?

Partially. It is still fundamentally down to parentheses having semantic meaning for struct types in RisingWave.

However to support e.g. struct expansion and correct qualification of column(s) there is a need for more work.

In essence I (ideally) want to be able to (for a column struct_col STRUCT<int_col INT, vchar_col VARCHAR>) to be able to do the following optimization

SELECT
  (struct_col).*
FROM
  t
)

to the following:

SELECT
  ("t"."struct_col")."int_col" AS "int_col",
  ("t"."struct_col")."vchar_col" AS "vchar_col"
FROM
  "t" as "t"

i.e. we also need to handle these parentheses correctly in the qualify_columns logic somewhat equivalent to BigQuery struct expansion.

Edit: I am unsure if we have any other dependencies in terms of just adding the exp.Dot to the list of exclusions as you suggested above. I will get back to you with regards to that, but I am wary that it would have unintended consequences somewhere in e.g. BigQuery

@georgesittas
Copy link
Collaborator

georgesittas commented Jul 8, 2025

Just to make sure– have you verified that there are issues with the qualification logic as it is today? Or was that next on your list, i.e., testing? I don't see anything that is obviously unsupported in your example, e.g., for this query:

SELECT (struct_col).* FROM t

I expect that we're already parsing struct_col as a Column, so its resolution should just work today. The rest of the identifiers in the dot path aren't Column instances, so they should also be left untouched, leading to the form you suggested.

@MisterWheatley
Copy link
Contributor Author

MisterWheatley commented Jul 9, 2025

Just to make sure– have you verified that there are issues with the qualification logic as it is today? Or was that next on your list, i.e., testing? I don't see anything that is obviously unsupported in your example, e.g., for this query:

SELECT (struct_col).* FROM t

I expect that we're already parsing struct_col as a Column, so its resolution should just work today. The rest of the identifiers in the dot path aren't Column instances, so they should also be left untouched, leading to the form you suggested.

Glad to confirm and double test. I just tested on a clean install of v27.0.0 with the following snippet:

from sqlglot import parse_one
from sqlglot.optimizer import optimize
import sqlglot
sql = """
SELECT
  (struct_col).*
FROM
  t
"""

schema = {
    "t" : {
        "struct_col" : "struct<nested_int INT, nested_char VARCHAR>"
    }
}

rw_ast = parse_one(sql,dialect='risingwave')
rw_opt = optimize(rw_ast,schema=schema,dialect='risingwave')
print(rw_opt)

gives the following result

SELECT "t"."struct_col".* FROM "t" AS "t"

So the qualification sort of works, but there are two issues still that I am working to solve:

  1. Preserve the parentheses as they have semantic meaning here.
  2. Do star expansion equivalent to what is currently done in BigQuery. If you run my toy example with dialect = "bigquery" it gives: SELECT "t"."struct_col"."nested_int" AS "nested_int", "t"."struct_col"."nested_char" AS "nested_char" FROM "t" AS "t"

@MisterWheatley
Copy link
Contributor Author

MisterWheatley commented Jul 9, 2025

Hi @georgesittas

I did a lot more testing, but I think I managed to get it to a working stage that I am happy with for another review round.

I am aware of the comment on an old PR here about introducing dialect specifics into the optimizer, when some of the logic should be shareable between e.g. BigQuery and RisingWave in regards to expanding struct.* columns. I deliberately didn't attempt to do a refactor, since that would have increased the complexity and I wanted to keep it as simple as possible

Let me know if you have more feedback and/or need more test cases added and I will get on it asap

@MisterWheatley MisterWheatley marked this pull request as ready for review July 9, 2025 11:21
Copy link
Collaborator

@georgesittas georgesittas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MisterWheatley thanks for the update– left another round of comments.

MisterWheatley and others added 2 commits July 10, 2025 08:31
…orrect level for RisingWave

Updated logic for expanding (struct_col).* expressions in RisingWave to correctly handle the level of nesting.

Moved struct expansion tests to tests/fixtures/qualify_columns.sql on behest of maintainers.
@MisterWheatley
Copy link
Contributor Author

MisterWheatley commented Jul 10, 2025

Hi again @georgesittas

I have tried to update according to your comments and questions.

Based on your question regarding unpacking, I uncovered a problem with the previous way of doing it, hence why the code is now quite a bit different. High-level the new logic walks "up" the AST from the exp.Column and down the exp.DataType.kind simultaneously to check if they are in alignment to be expanded between AST and STRUCT definition

The re-added casts/type annotations were partly to fix a few mypy errors when running make check. If this is still incorrect, let me know and I can try and refactor to not use casts. I am not an expert in mypy, so I would appreciate a bit of guidance or a pointer to some relevant docs in that case.

I moved all the struct.* tests into the fixture file as you suggested to keep them together, including the existing BigQuery ones.

Again, thanks for the quick response times and looking forward to your feedback on the latest changes

Copy link
Collaborator

@georgesittas georgesittas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @MisterWheatley, thanks for addressing comments and sorry for the delayed response– I was OOO. The PR looks good to go; I will address the last couple of comments that I left after merging.

Comment on lines +609 to +610
# find column definition to get data-type
dot_column = t.cast(exp.Column, expression.find(exp.Column))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be safer if we did an instance check to ensure dot_column is indeed a Column in L612.

outer_paren = expression.this

for struct_field_def in t.cast(exp.DataType, current_struct).expressions:
new_identifier = exp.Identifier(this=struct_field_def.name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should respect the identifier found in the type here, otherwise we may drop quotes when we shouldn't.

@georgesittas georgesittas merged commit 9e8d3ab into tobymao:main Jul 15, 2025
6 checks passed
@MisterWheatley
Copy link
Contributor Author

Hey @MisterWheatley, thanks for addressing comments and sorry for the delayed response– I was OOO. The PR looks good to go; I will address the last couple of comments that I left after merging.

I assumed as much. Thanks for the follow up and taking it the last mile, very much appreciated.

@georgesittas
Copy link
Collaborator

Just a few minor changes on my end: 8f43c5f. Nice work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants