Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: enhance VirtualTable to have expression as value #711

Merged
merged 4 commits into from
Oct 7, 2024

Conversation

anshuldata
Copy link
Contributor

@anshuldata anshuldata commented Sep 26, 2024

  • This enables expressing Virtual table having values as expression
  • Example query: select * from (values (1+2, 'Hello'||'World'))
  • Mark existing literal field "values" as deprecated

Copy link

ACTION NEEDED

Substrait follows the Conventional Commits
specification
for
release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@anshuldata anshuldata force-pushed the anshul/VTExpressionTable branch 2 times, most recently from 14eb28c to a0fc6ab Compare September 26, 2024 09:06
@anshuldata anshuldata changed the title feat: Add VirtualExpressionTable definition and use it in ReadRel feat: add VirtualExpressionTable definition and use it in ReadRel Sep 26, 2024
@jacques-n
Copy link
Contributor

Patch makes sense. We should update the description message in this PR to also explain the deprecation.

@@ -91,6 +92,11 @@ message ReadRel {
repeated Expression.Literal.Struct values = 1;
Copy link

@acruise acruise Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than deprecate VirtualTable and create VirtualExpressionTable, would we consider deprecating the values field and adding a new expressions field?

edit: I'm not a fan of Protobuf quirks becoming load-bearing, but this seems to indicate a field can be replaced with a oneof while maintaining binary compatibility. 😅

Copy link
Member

@vbarua vbarua Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edit: consider this retracted

Expanding on this, I'm almost inclined to have both? Like

 message VirtualTable {
    repeated Expression.Literal.Struct values = 1;
    repeated Expression expression = 2;
}

which emits first the values and then expressions. Makes the common case really easy to produce (literals only).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of this. It creates this confusing situation where you have something like

literal1, literal2, expression1, literal3

and it gets stored as

values: [literal1, literal2]
expression: [expresion1, literal3]


> Rather than deprecate VirtualTable and create VirtualExpressionTable, would we consider deprecating the values field and adding a new expressions field?

I don't see how that is better (or clearer for someone new to the format).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we have a protobuf check set up so if we can't use the oneof it will let us know. I do know that it is less permissive than I'm used to:

https://github.com/substrait-io/substrait/blob/main/.github/workflows/pr.yml#L35

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

protobuf oneof can't have repeated fields link
A oneof cannot be repeated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't add oneof as mentioned above so both fields can contain values
I too share concern with @jacques-n having both fields where initial literals are in Expression_Literal and rest in Expression (assuming there is one expression) doesn't seem intuitive .

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion on "VirtualTable with expressions field" vs. "VirtualExpressionTable as new relation" but if I had to pick I would go with @acruise 's suggestion (just add a new field, not a new relation).

Comment on lines 62 to 69
A virtual expression table is a table whose contents are embedded in the plan itself. The table data
is encoded as records consisting of expression values.

| Property | Description | Required |
| -------- | ----------- | -------- |
| Data | Required | Required |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an important property of a virtual (expression) table is that it doesn't contain any field references. This may not be obvious to new readers to Substrait. In other words, they may look at a virtual table and think "this is weird, it looks like a projection? Maybe this is some kind of select relation?"

Maybe something like...

A virtual expression table is a table whose contents are embedded in the plan itself.  Each column is an expression that can be resolved without referencing any input data.  These expressions will not contain field references but may contain function calls, literals, and other expression nodes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding to this, I assume that these expression shouldn't contain other relations (i.e. subqueries)?

Copy link
Contributor

@jacques-n jacques-n Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should have prohibitions against either (field references and subqueries).

That being said, a valid field reference in this context is impossible because there are zero fields to reference. So having field[47] here is just as illegal as having a readrel that has two columns and trying to do field[47].

Copy link
Contributor

@jacques-n jacques-n Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a fun aside, in Snowflake the following is legal.

select * from (values (
  (select 1 where (
    select true)
  ), 
  'Hello'||'World')
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with nested queries having a field reference is that the resulting structure is no longer scalar. For systems without virtual table support I have had to write tables to disk and then register named tables. For those systems I would have to reject field references.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should have prohibitions against either (field references and subqueries).

That's fine. I don't know that we need to go so far as prohibit things. My goal is just to make it as obvious as possible what the message is used for. I'd be find with wording like...

A virtual expression table is a table whose contents are embedded in the plan itself.
Each column is an expression that can be resolved without referencing any input data.
For example, a literal, a function call involving literals, or any other expression that does
not require input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be find with wording like
Fixed doc to indicate virtual expression table are not dependent on any input

@jacques-n
Copy link
Contributor

jacques-n commented Sep 27, 2024

I'm not a fan of Protobuf quirks becoming load-bearing, but this seems to indicate a field can be replaced with a oneof while maintaining binary compatibility. 😅

I believe that only works for optional/required fields. I believe repeated can be in a oneof because of how they are serialized.

This enables expressing Virtual table having values as expression
Also marked VirtualTable as deprecated
@anshuldata anshuldata force-pushed the anshul/VTExpressionTable branch from a0fc6ab to d6f4a9a Compare September 30, 2024 05:58
@anshuldata anshuldata force-pushed the anshul/VTExpressionTable branch from d6f4a9a to b2e3d6b Compare September 30, 2024 05:59
@anshuldata
Copy link
Contributor Author

We should update the description message in this PR to also explain the deprecation.
Updated PR description to indicate VirutalTable is deprecated

@jacques-n jacques-n requested a review from westonpace October 1, 2024 05:00
@jacques-n
Copy link
Contributor

I feel like the main outstanding question is whether we create new table type versus adding field to existing and deprecating old fields. @westonpace, @vbarua and @EpsilonPrime, thoughts? It's subjective but I'm definitely a fan of keeping the two table types separate so nobody can make a mistake and set both fields.

@westonpace
Copy link
Member

My opinion hasn't changed. I'll approve either approach. If I need to take a side I still prefer one relation. We have, several times now, deprecated fields and not worried too much about users setting both values.

I think confusion about "why are there two relations that seem to do the same thing" is going to be more confusing than "why are there two fields, one which is marked deprecated, and one that isn't".

westonpace
westonpace previously approved these changes Oct 2, 2024
@jacques-n
Copy link
Contributor

I'm fine with @westonpace's preferred change. @anshuldata can you update to single message with two sets of repeated fields?

@anshuldata anshuldata changed the title feat: add VirtualExpressionTable definition and use it in ReadRel feat: Enahnce VirtualTable to take expression value Oct 3, 2024
@anshuldata
Copy link
Contributor Author

anshuldata commented Oct 3, 2024

If I need to take a side I still prefer one relation.

Fixed as suggested. Basically to add another field in VirtualTable instead of another Relation type

@anshuldata anshuldata changed the title feat: Enahnce VirtualTable to take expression value feat: enahnce VirtualTable to take expression value Oct 3, 2024
@anshuldata anshuldata changed the title feat: enahnce VirtualTable to take expression value feat: enahnce VirtualTable to have expression as value Oct 3, 2024
@vbarua vbarua changed the title feat: enahnce VirtualTable to have expression as value feat: enhance VirtualTable to have expression as value Oct 3, 2024
vbarua
vbarua previously approved these changes Oct 3, 2024
Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the constraints of proto compatibility, this seems like as good as it's going to get. Thanks for working on this @anshuldata.

Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some minor comments as well which I don't consider blocking.

proto/substrait/algebra.proto Outdated Show resolved Hide resolved
proto/substrait/algebra.proto Outdated Show resolved Hide resolved
site/docs/relations/logical_relations.md Outdated Show resolved Hide resolved
@vbarua vbarua requested a review from westonpace October 3, 2024 21:05
@jacques-n jacques-n merged commit 954bcbc into substrait-io:main Oct 7, 2024
13 checks passed
@akoshchiy
Copy link

@anshuldata Hi! Trying to adapt the changes in datafusion. It's not clear how to handle multiple rows in values, eq SELECT * FROM VALUES (1+1, 2), (2+2, 3).
Should it be wrapped into Expression.Nested.List or Expression.Nested.Struct, or, maybe, there is another preferred way to handle it?

@tokoko
Copy link
Contributor

tokoko commented Oct 27, 2024

@akoshchiy expressions is a repeated field. You can set it to a list of Expression messages with Expression.Literal.Struct Expression.Nested.Struct set within (one Expression per row). It will be basically the same as the current behavior with deprecated values field with the only difference being that you need to wrap Expression.Literal.Struct messages in Expression messages now.

upd: my bad, never mind. didn't notice the pluses. I'm guessing it should be Expression.Nested.Struct messages within.

@tokoko
Copy link
Contributor

tokoko commented Oct 27, 2024

I think this is a really good question actually. with the values field, I'm assuming a single Literal.Struct value would be a single row. If that approach stays the same with the new expressions field, I think it only ever makes sense for the Expression to hold either Literal.Struct or Nested.Struct, it should never be a scalar as it's supposed to represent a row, not an individual value.

Wouldn't it make have made more sense in the first place for the new field to have been repeated Expression.Nested.Struct type instead of repeated Expression? Nested.Struct can contain all literals as well, after all.

@anshuldata
Copy link
Contributor Author

anshuldata commented Oct 28, 2024

Wouldn't it make have made more sense in the first place for the new field to have been repeated Expression.Nested.Struct type instead of repeated Expression

Yes, agree. I missed this in my original change. I will raise a PR to fix it
CC: @akoshchiy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants