-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add proto converters and visitor, both incomplete atm #4
base: main
Are you sure you want to change the base?
Conversation
This change adds some more logic to the substrait client. 1) it adds logic to map proto messages to domain objects for relations and expressions. 2) it adds a high level visitor control flow. This commit also introduces a lineage extractor application to drive development and testing of the control flow. For the testing it leverages pre-constructed substrait plans.
Hi @jcamachor, @westonpace. |
cc: @ranaalotaibiMS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ashvina ! Overall code structure looks good to me. I left some comments, including on how imo the work could be split slightly differently moving forward to make commits more self-contained for the proto translation.
@@ -99,7 +97,7 @@ public void Visit(Sort sort) | |||
Fallback(sort); | |||
} | |||
|
|||
public void Fallback(Rel _) | |||
protected void Fallback(object _) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't Fallback
still accept a Rel
object rather than generic object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it type independent since Fallback
needs to accept Rel
, or Expression
, and in future other types. I dont think adding one fallback method for such types adds any value. WDYT?
src/Substrait.Core/ProtoConverter.cs
Outdated
{ | ||
var input = ToRel(project.Input); | ||
var inputs = ImmutableList.Create(input); | ||
return new Join() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a Project
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Thanks.
private Rel CreateAgg(AggregateRel agg) | ||
{ | ||
var input = ToRel(agg.Input); | ||
var inputs = ImmutableList.Create(input); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems there are important parts missing in the translation, e.g., grouping columns or aggregates. Same comment for other operators below (e.g., expressions for join, filter, or project, different type of read rels, etc.)
/// <summary> | ||
/// Utility to convert Substrait protobuf messages to/from charp domain-objects. | ||
/// </summary> | ||
public class ProtoConverter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possible recommendation for the ProtoConverter
to split work in multiple small PRs while at the same time making committed code self-contained is to divide the work across operator boundaries, e.g., take scan operator and translate all its existing contents from proto into internal representation, including unit tests to validate the translation using a few examples (maybe consisting of a single op).
Said that, it might still be fine to push this PR since we are in very early stages of the library development (I'd like to know what @westonpace thinks too) but maybe add some comments in the files with a TODO/warning note so there is a clear understanding that translation has some significant gaps (as it is already pointed out in the PR title)? Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a comment, or even throwing a NotImplementedException
is probably fine. But I think whatever feels natural is fine. This seems like a rather arbitrary swath of functionality in this PR but, on the other hand, it is enough to start sparking some initial feedback.
I've found that is quite quick to get to 80% coverage and then the last 20% ends up being quite tricky. For example, user defined types and extension relations will probably require more thought. Going operator by operator can be challenging because, for example, a project relation needs an expression which could be anything (and, as mentioned, user defined type expressions might come quite a bit later).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not implying that conversion should be complete (maybe I did not express that clearly enough); as you mention, that final 20% is usually hard and would slow down the whole development. I was thinking more about making translation less arbitrary and support boundaries a bit more clear, which you also referred to. But it's not a very strong opinion, overall I'm happy that we continue making progress one way or another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jcamachor, I see value in your suggestion, particularly from tracking and reviews. Anyone who checks out this commit would know that what features will work. I started it like that with just inputs
to start with. But I can pay more attention going forward. If you think this commit creates confusion, I can fix it later next week.
@westonpace, as you said my intention is to gather early feedback and let others jump in if feasible.
Thanks both !
As a follow-up to the comment above, consider creating issues for the Proto translation of each of the operators and link them here. That way it is clear the work that is remaining and we can also parallelize some of the work on implementing that operator translation on top of the foundation you are creating. |
{ | ||
/// <summary> | ||
/// Utility to convert Substrait protobuf messages to/from charp domain-objects. | ||
/// </summary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please more details on the motivation to add a Substrait.Relation.Rel
mappings for the protobuf objects? I see some of it in the previous PR, trying to learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One very useful feature is that Rel
has a property Inputs
:
public IList<Rel> Inputs { get; init; } = new List<Rel>();
This can be used, for example, to do a depth-first traversal of the relations tree. However, in the protobuf objects, there is no such property. For example, a join relation has left
and right
properties and a filter relation has an input
property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
Is the intention for C# users to use these objects to interface with substrait instead of the protofbuf objects when it is fully ready?
public void TestMethod1() | ||
{ | ||
Plan plan; | ||
using var planPb = File.OpenRead(@"TestData\tpcds\Q02.pb"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how was the .pb
file generated? Can you please add instructions somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the .pb
was generated using substrait-java:isthmus
(here). Its landing page has the instructions for generating Substrait
plans using calcite
. I hope this helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the ping. Looks like good work. I've added a few thoughts / questions.
/// </summary> | ||
abstract public class RelExpression | ||
{ | ||
public virtual void Accept(SubstraitRelVisitor visitor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be abstract? Also, is SubstraitRelVisitor
intended to visit both relations and expressions? Or should this be SubstraitExpressionVisitor
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on separation of visitor for relations and expressions (Java implementation can serve as a model).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, separation would be more meaningful. If it is ok with you both, I'd like to fix this in a subsequent PR.
/// <summary> | ||
/// Base type for all relational expressions, <see cref="Protobuf.Expression"/> | ||
/// </summary> | ||
abstract public class RelExpression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why RelExpression
and not just Expression
? An expression is not an instance or a specialization of a Rel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry for the confusion. I called it so to mean relational expression
. I didn't use Expression
or Rex
because it was hurting code readability. Like my earlier comment, I can fix it later next week or in a subsequent PR if this one is merged. Thanks !
{ | ||
/// <summary> | ||
/// Utility to convert Substrait protobuf messages to/from charp domain-objects. | ||
/// </summary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One very useful feature is that Rel
has a property Inputs
:
public IList<Rel> Inputs { get; init; } = new List<Rel>();
This can be used, for example, to do a depth-first traversal of the relations tree. However, in the protobuf objects, there is no such property. For example, a join relation has left
and right
properties and a filter relation has an input
property.
/// <summary> | ||
/// Utility to convert Substrait protobuf messages to/from charp domain-objects. | ||
/// </summary> | ||
public class ProtoConverter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a comment, or even throwing a NotImplementedException
is probably fine. But I think whatever feels natural is fine. This seems like a rather arbitrary swath of functionality in this PR but, on the other hand, it is enough to start sparking some initial feedback.
I've found that is quite quick to get to 80% coverage and then the last 20% ends up being quite tricky. For example, user defined types and extension relations will probably require more thought. Going operator by operator can be challenging because, for example, a project relation needs an expression which could be anything (and, as mentioned, user defined type expressions might come quite a bit later).
/// Create domain object specific to relational expression type from protobuf message | ||
/// </summary> | ||
/// <param name="protoRex">Relational expression encoded as protobuf message</param> | ||
/// <returns><see cref="Rel"/> corresponding to message's expression type</returns> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This returns a RelExpression
, not a Rel
.
{ | ||
var left = ToRel(join.Left); | ||
var right = ToRel(join.Right); | ||
var _ = ToExpression(join.Expression); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a touch arbitrary (e.g. why grab join expression and not the join filter? why not grab a filter's filter expression?) but there's probably no reason to get hung up on it.
public class LineageExtractorVisitorTest | ||
{ | ||
[TestMethod] | ||
public void TestMethod1() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: rename
using Substrait.Relation; | ||
using ProtoRel = Substrait.Protobuf.Rel; | ||
|
||
namespace Substrait.Lineage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm...I'm maybe a little confused here. What is the intent of this project? Isn't lineage an implicit property of Rel
? I would expect this project would be extending Rel
somehow to add lineage information to it (or extracting lineage information into some kind of project-specific classes).
It seems a bit odd that "grabbing lineage for one relation" is a Substrait.Core
task but "grabbing lineage for an entire relation tree" is a Substrait.Lineage
task. I would think either both are Substrait.Lineage
or both are Substrait.Core
(with a minor preference for the latter)
/// A concrete implementation of <see cref="SubstraitRelVisitor"/> for processing Substrait relations and extracting | ||
/// lineage information. | ||
/// </summary> | ||
public class LineageExtractorRelVisitor : SubstraitRelVisitor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this class really needed? I would think the normal behavior for a system would be to do something like:
public void OnMessage(byte[] message)
{
ProtoRel protoMessage = DecodeMessage(message);
Rel root = new ProtoConverter().ToRel(protoMessage);
HandleMessage(root);
}
private void HandleMessage(Rel rel)
{
// Do some stuff
LogQuery(rel);
// Do some other stuff
}
private void LogQuery(Rel rel)
{
new LoggingVisitor(logger).Visit(rel);
}
In other words, I would expect to convert at the boundaries, then visiting the converted plan as needed. Instead this appears to convert each time I need to visit something.
Co-authored-by: Weston Pace <[email protected]>
Fetch = fetchRel | ||
}; | ||
|
||
fetchRel.Input = rel; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test fails with stack overflow due to this.
[TestClass()] | ||
public class ProtoConverterTests | ||
{ | ||
[TestMethod()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: ()
not needed.
public void TestMethod1() | ||
{ | ||
Plan plan; | ||
using var planPb = File.OpenRead(@"TestData\tpcds\Q02.pb"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider Path.Combine
so that test works irrespective of os directory separator.
This PR adds the first test and refactors, I can take #3 once this PR is done or if a smaller PR with refactoring and a placeholder is added. |
This change adds some more logic to the substrait client. 1) it adds logic to map proto messages to domain objects for relations and expressions. 2) it adds a high level visitor control flow. This commit also introduces a lineage extractor application to drive development and testing of the control flow. For the testing it leverages pre-constructed substrait plans.