diff --git a/site/docs/expressions/embedded_functions.md b/site/docs/expressions/embedded_functions.md index babbcde59..d0fa4d835 100644 --- a/site/docs/expressions/embedded_functions.md +++ b/site/docs/expressions/embedded_functions.md @@ -19,10 +19,9 @@ The binary representation of an embedded function is: ```proto %%% proto.message.Expression.EmbeddedFunction %%% ``` + === "Human Readable Representation" - n/a -=== "Example" - n/a + As the bytes are opaque to Substrait there is no equivalent human readable form. ## Function Details @@ -49,16 +48,9 @@ There are many types of possible stored functions. For each, Substrait works to -## Discussion Points - -* What are the common embedded function formats? -* How do we expose the data for a function? -* How do we express batching capabilities? -* How do we ensure/declare containerization? - - - - - - +???+ question "Discussion Points" + * What are the common embedded function formats? + * How do we expose the data for a function? + * How do we express batching capabilities? + * How do we ensure/declare containerization? diff --git a/site/docs/expressions/extended_expression.md b/site/docs/expressions/extended_expression.md index ca8360b2d..e7fb18e1d 100644 --- a/site/docs/expressions/extended_expression.md +++ b/site/docs/expressions/extended_expression.md @@ -4,6 +4,12 @@ Extended Expression messages are provided for expression-level protocols as an a Since Extended Expression will be used seperately from the Plan rel representation, it will need to include basic fields like Version. +=== "ExtendedExpression Message" + + ```proto +%%% proto.message.ExtendedExpression %%% + ``` + ## Input and output data schema Similar to `base_schema` defined in [ReadRel](https://github.com/substrait-io/substrait/blob/7f272f13f22cd5f5842baea42bcf7961e6251881/proto/substrait/algebra.proto#L58), the input data schema describes the name/type/nullibilty and layout info of input data for the target expression evalutation. It also has a field `name` to define the name of the output data. diff --git a/site/docs/expressions/field_references.md b/site/docs/expressions/field_references.md index 630fa07ea..f64050470 100644 --- a/site/docs/expressions/field_references.md +++ b/site/docs/expressions/field_references.md @@ -107,7 +107,7 @@ struct: - i64 ``` -Given this schema, you could declare a mask in pseudocode, such as: +Given this schema, you could declare a mask of fields to include in pseudocode, such as: ``` 0:[0,1:[..5:[0,2]]],2,3 @@ -144,11 +144,7 @@ By default, when only a single field is selected from a struct, that struct is r -## Discussion Points - -* Should we support column reordering/positioning using a masked complex expression? (Right now, you can only mask things out.) - - - +???+ question "Discussion Points" + * Should we support column reordering/positioning using a masked complex expression? (Right now, you can only mask things out.) diff --git a/site/docs/expressions/subqueries.md b/site/docs/expressions/subqueries.md index 53e1813e5..c71fee0dc 100644 --- a/site/docs/expressions/subqueries.md +++ b/site/docs/expressions/subqueries.md @@ -65,3 +65,11 @@ WHERE x < ANY(SELECT y from t2) | Comparison operation | The kind of comparison operation to use | Yes | | Expression | Left-hand side expression to check | Yes | | Subquery | Subquery to check | Yes | + + + +=== "Protobuf Representation" + + ```proto +%%% proto.message.Expression.Subquery %%% + ``` diff --git a/site/docs/expressions/user_defined_functions.md b/site/docs/expressions/user_defined_functions.md index c5c23031e..fbd05f258 100644 --- a/site/docs/expressions/user_defined_functions.md +++ b/site/docs/expressions/user_defined_functions.md @@ -1,3 +1,32 @@ # User-Defined Functions -Substrait supports the creation of custom functions using [simple extensions](../extensions/index.md#simple-extensions), using the facilities described in [scalar functions](scalar_functions.md). In fact, the functions defined by Substrait use the same mechanism. The extension files for them can be found [here](https://github.com/substrait-io/substrait/tree/main/extensions). +Substrait supports the creation of custom functions using [simple extensions](../extensions/index.md#simple-extensions), using the facilities described in [scalar functions](scalar_functions.md). The functions defined by Substrait use the same mechanism. The extension files for standard functions can be found [here](https://github.com/substrait-io/substrait/tree/main/extensions). + +Here's an example function that doubles its input: + +!!! info inline end "Implementation Note" + This implementation is only defined on 32-bit floats and integers but could be defined on all numbers (and even lists and strings). The user of the implementation can specify what happens when the resulting value falls outside of the valid range for a 32-bit float (either return NAN or raise an error). + +``` yaml +%YAML 1.2 +--- +scalar_functions: + - + name: "double" + description: "Double the value" + impls: + - args: + - name: x + value: fp32 + options: + on_domain_error: + values: [ NAN, ERROR ] + return: fp32 + - args: + - name: x + value: i32 + options: + on_domain_error: + values: [ NAN, ERROR ] + return: i32 +``` diff --git a/site/docs/relations/basics.md b/site/docs/relations/basics.md index 8c265d191..41e86f425 100644 --- a/site/docs/relations/basics.md +++ b/site/docs/relations/basics.md @@ -70,9 +70,8 @@ A guarantee that data output from this operation is provided with a sort order. -## Discussion Points +???+ question "Discussion Points" -* Do we try to make read definition types more extensible à la function signatures? Is that necessary if we have a custom relational operator? -* How do we express decomposed types. For example, the Iceberg type above is for early logical planning. Once we do some operations, it may produce a list of Iceberg file reads. This is likely a secondary type of object. -* We currently include a "generic properties" property on read type. Do we want this dumping ground? + * Should [read definition types](/relations/logical_relations/#read-definition-types) be more extensible in the same way that function signatures are? Are extensible read definition types necessary if we have custom relational operators? + * How are decomposed reads expressed? For example, the Iceberg type above is for early logical planning. Once we do some operations, it may produce a list of Iceberg file reads. This is likely a secondary type of object. diff --git a/site/docs/relations/logical_relations.md b/site/docs/relations/logical_relations.md index 8f689fb22..410802f3f 100644 --- a/site/docs/relations/logical_relations.md +++ b/site/docs/relations/logical_relations.md @@ -42,7 +42,11 @@ A filter expression must be interpreted against the direct schema before the pro ### Read Definition Types -Read definition types are built by the community and added to the specification. This is a portion of specification that is expected to grow rapidly. +???+ info inline end "Adding new Read Definition Types" + + If you have a read definition that's not covered here, see the [process for adding new read definition types](/spec/extending). + +Read definition types (like the rest of the features in Substrait) are built by the community and added to the specification. #### Virtual Table @@ -394,7 +398,11 @@ The write operator is an operator that consumes one output and writes it to stor ### Write Definition Types -Write definition types are built by the community and added to the specification. This is a portion of specification that is expected to grow rapidly. +???+ info inline end "Adding new Write Definition Types" + + If you have a write definition that's not covered here, see the [process for adding new write definition types](/spec/extending). + +Write definition types are built by the community and added to the specification. === "WriteRel Message" @@ -420,7 +428,7 @@ Write definition types are built by the community and added to the specification | Format | Enumeration of available formats. Only current option is PARQUET. | Required | -## DDL Operator +## DDL (Data Definition Language) Operator The operator that defines modifications of a database schema (CREATE/DROP/ALTER for TABLE and VIEWS). @@ -449,6 +457,6 @@ The operator that defines modifications of a database schema (CREATE/DROP/ALTER %%% proto.algebra.DdlRel %%% ``` -## Discussion Points +???+ question "Discussion Points" -* How to handle correlated operations? + * How should correlated operations be handled? diff --git a/site/docs/relations/physical_relations.md b/site/docs/relations/physical_relations.md index 74f5cd09c..b7482b693 100644 --- a/site/docs/relations/physical_relations.md +++ b/site/docs/relations/physical_relations.md @@ -27,7 +27,7 @@ The hash equijoin join operator will build a hash table out of the right input b | Join Type | One of the join types defined in the Join operator. | Required | -## NLJ Operator +## NLJ (Nested Loop Join) Operator The nested loop join operator does a join by holding the entire right input and then iterating over it using the left input, evaluating the join expression on the Cartesian product of all rows, only outputting rows where the expression is true. Will also include non-matching rows in the OUTER, LEFT and RIGHT operations per the join type requirements. diff --git a/site/docs/serialization/binary_serialization.md b/site/docs/serialization/binary_serialization.md index c14ca36e5..31ab91c68 100644 --- a/site/docs/serialization/binary_serialization.md +++ b/site/docs/serialization/binary_serialization.md @@ -5,7 +5,7 @@ Substrait can be serialized into a [protobuf](https://developers.google.com/prot ## Plan -The main top-level object used to communicate a Substrait plan using protobuf is a Plan message. The plan message is composed of a set of data structures that minimize repetition in the serialization along with one (or more) Relation trees. +The main top-level object used to communicate a Substrait plan using protobuf is a Plan message (see the [ExtendedExpression](/expressions/extended_expression/) for an alternative other top-level object). The plan message is composed of a set of data structures that minimize repetition in the serialization along with one (or more) Relation trees. === "Plan Message" diff --git a/site/docs/spec/_config b/site/docs/spec/_config index dad9cfd1c..77dde9a2b 100644 --- a/site/docs/spec/_config +++ b/site/docs/spec/_config @@ -2,3 +2,4 @@ arrange: - versioning.md - specification.md - technology_principles.md + - extending.md diff --git a/site/docs/spec/extending.md b/site/docs/spec/extending.md new file mode 100644 index 000000000..41874df0b --- /dev/null +++ b/site/docs/spec/extending.md @@ -0,0 +1,49 @@ +# Extending + +Substrait is a community project and requires consensus about new additions to the specification in order to maintain consistency. The best way to get consensus is to discuss ideas. The main ways to communicate are: + +* Substrait Mailing List +* Substrait Slack +* Community Meeting + +## Minor changes + +Simple changes like typos and bug fixes do not require as much effort. [File an issue](https://github.com/substrait-io/substrait/issues) or [send a PR](https://github.com/substrait-io/substrait/pulls) and we can discuss it there. + +## Complex changes + +For complex features it is useful to discuss the change first. It will be useful to gather some background information to help get everyone on the same page. + +### Outline the issue + +#### Language + +Every engine has its own terminology. Every Spark user probably knows what an "attribute" is. Velox users will know what a "RowVector" means. Etc. However, Substrait is used by people that come from a variety of backgrounds and you should generally assume that its users do not know anything about your own implementation. As a result, all PRs and discussion should endeavor to use Substrait terminology wherever possible. + +#### Motivation + +What problems does this relation solve? If it is a more logical relation then how does it allow users to express new capabilities? If it is more of an internal relation then how does it map to existing logical relations? How is it different than other existing relations? Why do we need this? + +#### Examples + +Provide example input and output for the relation. Show example plans. Try and motivate your examples, as best as possible, with something that looks like a real world problem. These will go a long ways towards helping others understand the purpose of a relation. + +#### Alternatives + +Discuss what alternatives are out there. Are there other ways to achieve similar results? Do some systems handle this problem differently? + +### Survey existing implementation + +It's unlikely that this is the first time that this has been done. Figuring out + +### Prototype the feature + +Novel approaches should be implemented as an extension first. + +### Substrait design principles + +Substrait is designed around interoperability so a feature only used by a single system may not be accepted. But don't dispair! Substrait has a highly developed extension system for this express purpose. + +### You don't have to do it alone + +If you are hoping to add a feature and these criteria seem intimidating then feel free to start a mailing list discussion before you have all the information and ask for help. Investigating other implementations, in particular, is something that can be quite difficult to do on your own. diff --git a/site/docs/tools/third_party_tools.md b/site/docs/tools/third_party_tools.md index 5ddeab935..9ae040209 100644 --- a/site/docs/tools/third_party_tools.md +++ b/site/docs/tools/third_party_tools.md @@ -3,4 +3,10 @@ ## Substrait-tools The [substrait-tools](https://pypi.org/project/substrait-tools/) python package provides a command line interface for producing/consuming substrait plans by leveraging the APIs -from different producers and consumers. \ No newline at end of file +from different producers and consumers. + +## Substrait Fiddle +[Substrait Fiddle](https://substrait-fiddle.com) is an online tool to share, debug, and prototype Substrait plans. + +The [Substrait Fiddle Source](https://github.com/voltrondata/substrait-fiddle) is available allowing it to be run in any environment. + diff --git a/site/docs/types/type_parsing.md b/site/docs/types/type_parsing.md index 396215e60..87bd3639e 100644 --- a/site/docs/types/type_parsing.md +++ b/site/docs/types/type_parsing.md @@ -10,26 +10,37 @@ The components of this expression are: | Component | Description | Required | | ---------------------- | ------------------------------------------------------------ | ------------------------------------- | -| Name | Each type has a name. A type is expressed by providing a name. This name can be expressed in arbitrary case (e.g. `varchar` and `vArChAr` are equivalent). | | -| Nullability indicator | A type is either non-nullable or nullable. To express nullability, a type name is appended with a question mark. | Optional, defaults to non-nullable | +| Name | Each type has a name. A type is expressed by providing a name. This name can be expressed in arbitrary case (e.g. `varchar` and `vArChAr` are equivalent) although lowercase is preferred. | | +| Nullability indicator | A type is either non-nullable or nullable. To express nullability, a question mark is added after the type name (before any parameters). | Optional, defaults to non-nullable | | Variation | When expressing a type, a user can define the type based on a type variation. Some systems use type variations to describe different underlying representations of the same data type. This is expressed as a bracketed integer such as [2]. | Optional, defaults to [0] | | Parameters | Compound types may have one or more configurable properties. The two main types of properties are integer and type properties. The parameters for each type correspond to a list of known properties associated with a type as declared in the order defined in the type specification. For compound types (types that contain types), the data type syntax will include nested type declarations. The one exception is structs, which are further outlined below. | Required where parameters are defined | ### Grammars -It is relatively easy in most languages to produce simple parser & emitters for the type syntax. To make that easier, Substrait also includes an ANTLR [impl pending] grammar to ease consumption and production of types. +It is relatively easy in most languages to produce simple parser & emitters for the type syntax. To make that easier, Substrait also includes an [ANTLR grammar](https://github.com/substrait-io/substrait-cpp/blob/main/src/substrait/textplan/parser/grammar/SubstraitPlanParser.g4#L108) to ease consumption and production of types. (The grammar also supports an entire language for representing plans as text.) ### Structs & Named Structs -Structs are unique from other types because they have an arbitrary number of parameters. The parameters can also include one or two subproperties. Struct parsing is thus declared in the following two ways: +Structs are unique from other types because they have an arbitrary number of parameters. The parameters are recursive and may include their own subproperties. Struct parsing is declared in the following two ways: -``` -# Struct -struct?[variation] +=== "YAML" -# Named Struct -nstruct?[variation] -``` + ``` + # Struct + struct?[variation] + + # Named Struct + nstruct?[variation] + ``` + +=== "Text Format Examples" + + ``` + // Struct + struct? + + // Named structs are not yet supported in the text format. + ``` In the normal (non-named) form, struct declares a set of types that are fields within that struct. In the named struct form, the parameters are formed by tuples of names + types, delineated by a colon. Names that are composed only of numbers and letters can be left unquoted. For other characters, names should be quoted with double quotes and use backslash for double-quote escaping. diff --git a/site/docs/types/type_system.md b/site/docs/types/type_system.md index 56362d127..eef7ff507 100644 --- a/site/docs/types/type_system.md +++ b/site/docs/types/type_system.md @@ -13,4 +13,5 @@ Substrait types fundamentally consist of four components: Refer to [Type Parsing](type_parsing.md) for a description of the syntax used to describe types. -Note that Substrait employs a strict type system without any coercion rules. All changes in types must be made explicit via [cast expressions](../expressions/specialized_record_expressions.md). +!!! note "Note" + Substrait employs a strict type system without any coercion rules. All changes in types must be made explicit via [cast expressions](../expressions/specialized_record_expressions.md).