From 0ad01696759dcfdd848f5e065f0c704602bde93c Mon Sep 17 00:00:00 2001 From: David Sisson Date: Thu, 13 Jul 2023 13:42:32 -0700 Subject: [PATCH 1/6] feat: clarify map behavior * Map keys must be unique. * Map keys must not be NULL. * The map key type may be nullable. This is based on the current restrictions found in the wild. For example, DuckDB specifically calls out that keys must be unique in its implementation and Spark specifically calls out that keys may not be NULL. This makes sense as map keys often need to be hashable for performance reasons. If for some reason we need a map type that supports repeated keys it may be better to create an alterate type (or a clarifying parameter/option). --- site/docs/expressions/field_references.md | 2 +- site/docs/types/type_classes.md | 2 +- site/docs/types/type_parsing.md | 20 ++++++++++++++++++++ site/mkdocs.yml | 3 ++- 4 files changed, 24 insertions(+), 3 deletions(-) diff --git a/site/docs/expressions/field_references.md b/site/docs/expressions/field_references.md index 3e7f411b3..9c608c8cc 100644 --- a/site/docs/expressions/field_references.md +++ b/site/docs/expressions/field_references.md @@ -7,7 +7,7 @@ In Substrait, all fields are dealt with on a positional basis. Field names are o | Struct Field | Ordinal position. Zero-based. Only legal within the range of possible fields within a struct. Selecting an ordinal outside the applicable field range results in an invalid plan. | struct | Type of field referenced | | Array Value | Array offset. Zero-based. Negative numbers can be used to describe an offset relative to the end of the array. For example, -1 means the last element in an array. Negative and positive overflows return null values (no wrapping). | list | type of list | | Array Slice | Array offset and element count. Zero-based. Negative numbers can be used to describe an offset relative to the end of the array. For example, -1 means the last element in an array. Position does not wrap, nor does length. | list | Same type as original list | -| Map Key | A map value that is matched exactly against available map keys and returned. [TBD, can multiple matches occur?] | map | Value type of map | +| Map Key | A map value that is matched exactly against available map keys and returned. | map | Value type of map | | Map KeyExpression | A wildcard string that is matched against a simplified form of regular expressions. Requires the key type of the map to be a character type. [Format detail needed, intention to include basic regex concepts such as greedy/non-greedy.] | map | List of map value type | | Masked Complex Expression | An expression that provides a mask over a schema declaring which portions of the schema should be presented. This allows a user to select a portion of a complex object but mask certain subsections of that same object. | any | any | diff --git a/site/docs/types/type_classes.md b/site/docs/types/type_classes.md index bcad59dd4..c4edcc019 100644 --- a/site/docs/types/type_classes.md +++ b/site/docs/types/type_classes.md @@ -40,7 +40,7 @@ Compound type classes are type classes that need to be configured by means of a | STRUCT<T1,...,Tn> | A list of types in a defined order. | `repeated Literal`, types matching T1..Tn | NSTRUCT<N:T1,...,N:Tn> | **Pseudo-type**: A struct that maps unique names to value types. Each name is a UTF-8-encoded string. Each value can have a distinct type. Note that NSTRUCT is actually a pseudo-type, because Substrait's core type system is based entirely on ordinal positions, not named fields. Nonetheless, when working with systems outside Substrait, names are important. | n/a | LIST<T> | A list of values of type T. The list can be between [0..2,147,483,647] values in length. | `repeated Literal`, all types matching T -| MAP<K, V> | An unordered list of type K keys with type V values. | `repeated KeyValue` (in turn two `Literal`s), all key types matching K and all value types matching V +| MAP<K, V> | An unordered list of type K keys with type V values. Keys may not be repeated. While the key type could be nullable, keys may not be null. | `repeated KeyValue` (in turn two `Literal`s), all key types matching K and all value types matching V ## User-Defined Types diff --git a/site/docs/types/type_parsing.md b/site/docs/types/type_parsing.md index 396215e60..e97673319 100644 --- a/site/docs/types/type_parsing.md +++ b/site/docs/types/type_parsing.md @@ -34,3 +34,23 @@ nstruct?[variation] In the normal (non-named) form, struct declares a set of types that are fields within that struct. In the named struct form, the parameters are formed by tuples of names + types, delineated by a colon. Names that are composed only of numbers and letters can be left unquoted. For other characters, names should be quoted with double quotes and use backslash for double-quote escaping. Note, in core Substrait algebra, fields are unnamed and references are always based on zero-index ordinal positions. However, data inputs must declare name-to-ordinal mappings and outputs must declare ordinal-to-name mappings. As such, Substrait also provides a named struct which is a pseudo-type that is useful for human consumption. Outside these places, most structs in a Substrait plan are structs, not named-structs. The two cannot be used interchangeably. + +### Other Complex Types + +Similar to structs, maps and lists can also have a type as one of their parameters. Type references may be recursive. The key for a map is typically a simple type but it is not required. + + +=== "YAML" + + ``` + list?> + map + ``` + +=== "Text Format Examples" + + ``` + list?> + list> + map>> + ``` diff --git a/site/mkdocs.yml b/site/mkdocs.yml index a4e025e79..3d0c2f199 100644 --- a/site/mkdocs.yml +++ b/site/mkdocs.yml @@ -94,7 +94,8 @@ markdown_extensions: - name: mermaid class: mermaid format: !!python/name:pymdownx.superfences.fence_code_format - - pymdownx.tabbed + - pymdownx.tabbed: + alternate_style: true - pymdownx.tasklist: custom_checkbox: true - pymdownx.tilde From c140468c86507b1a30011d3eec77286d710c0cbe Mon Sep 17 00:00:00 2001 From: David Sisson Date: Thu, 13 Jul 2023 13:13:13 -0700 Subject: [PATCH 2/6] docs: elucidate various Also adds to Substrait Fiddle the list of third party tools and updates some pages with admonitions. --- site/docs/expressions/embedded_functions.md | 22 +++++-------- site/docs/expressions/extended_expression.md | 6 ++++ site/docs/expressions/field_references.md | 10 ++---- site/docs/expressions/subqueries.md | 8 +++++ .../expressions/user_defined_functions.md | 31 ++++++++++++++++++- site/docs/relations/basics.md | 8 ++--- site/docs/relations/logical_relations.md | 18 ++++++++--- site/docs/relations/physical_relations.md | 2 +- .../serialization/binary_serialization.md | 2 +- site/docs/tools/third_party_tools.md | 8 ++++- site/docs/types/type_parsing.md | 30 ++++++++++++------ site/docs/types/type_system.md | 3 +- 12 files changed, 102 insertions(+), 46 deletions(-) diff --git a/site/docs/expressions/embedded_functions.md b/site/docs/expressions/embedded_functions.md index babbcde59..d0fa4d835 100644 --- a/site/docs/expressions/embedded_functions.md +++ b/site/docs/expressions/embedded_functions.md @@ -19,10 +19,9 @@ The binary representation of an embedded function is: ```proto %%% proto.message.Expression.EmbeddedFunction %%% ``` + === "Human Readable Representation" - n/a -=== "Example" - n/a + As the bytes are opaque to Substrait there is no equivalent human readable form. ## Function Details @@ -49,16 +48,9 @@ There are many types of possible stored functions. For each, Substrait works to -## Discussion Points - -* What are the common embedded function formats? -* How do we expose the data for a function? -* How do we express batching capabilities? -* How do we ensure/declare containerization? - - - - - - +???+ question "Discussion Points" + * What are the common embedded function formats? + * How do we expose the data for a function? + * How do we express batching capabilities? + * How do we ensure/declare containerization? diff --git a/site/docs/expressions/extended_expression.md b/site/docs/expressions/extended_expression.md index ca8360b2d..e7fb18e1d 100644 --- a/site/docs/expressions/extended_expression.md +++ b/site/docs/expressions/extended_expression.md @@ -4,6 +4,12 @@ Extended Expression messages are provided for expression-level protocols as an a Since Extended Expression will be used seperately from the Plan rel representation, it will need to include basic fields like Version. +=== "ExtendedExpression Message" + + ```proto +%%% proto.message.ExtendedExpression %%% + ``` + ## Input and output data schema Similar to `base_schema` defined in [ReadRel](https://github.com/substrait-io/substrait/blob/7f272f13f22cd5f5842baea42bcf7961e6251881/proto/substrait/algebra.proto#L58), the input data schema describes the name/type/nullibilty and layout info of input data for the target expression evalutation. It also has a field `name` to define the name of the output data. diff --git a/site/docs/expressions/field_references.md b/site/docs/expressions/field_references.md index 9c608c8cc..492b67f9a 100644 --- a/site/docs/expressions/field_references.md +++ b/site/docs/expressions/field_references.md @@ -40,7 +40,7 @@ struct: - i64 ``` -Given this schema, you could declare a mask in pseudocode, such as: +Given this schema, you could declare a mask of fields to include in pseudocode, such as: ``` 0:[0,1:[..5:[0,2]]],2,3 @@ -77,11 +77,7 @@ By default, when only a single field is selected from a struct, that struct is r -## Discussion Points - -* Should we support column reordering/positioning using a masked complex expression? (Right now, you can only mask things out.) - - - +???+ question "Discussion Points" + * Should we support column reordering/positioning using a masked complex expression? (Right now, you can only mask things out.) diff --git a/site/docs/expressions/subqueries.md b/site/docs/expressions/subqueries.md index 53e1813e5..c71fee0dc 100644 --- a/site/docs/expressions/subqueries.md +++ b/site/docs/expressions/subqueries.md @@ -65,3 +65,11 @@ WHERE x < ANY(SELECT y from t2) | Comparison operation | The kind of comparison operation to use | Yes | | Expression | Left-hand side expression to check | Yes | | Subquery | Subquery to check | Yes | + + + +=== "Protobuf Representation" + + ```proto +%%% proto.message.Expression.Subquery %%% + ``` diff --git a/site/docs/expressions/user_defined_functions.md b/site/docs/expressions/user_defined_functions.md index c5c23031e..b0c1ce27a 100644 --- a/site/docs/expressions/user_defined_functions.md +++ b/site/docs/expressions/user_defined_functions.md @@ -1,3 +1,32 @@ # User-Defined Functions -Substrait supports the creation of custom functions using [simple extensions](../extensions/index.md#simple-extensions), using the facilities described in [scalar functions](scalar_functions.md). In fact, the functions defined by Substrait use the same mechanism. The extension files for them can be found [here](https://github.com/substrait-io/substrait/tree/main/extensions). +Substrait supports the creation of custom functions using [simple extensions](../extensions/index.md#simple-extensions), using the facilities described in [scalar functions](scalar_functions.md). The functions defined by Substrait use the same mechanism. The extension files for standard functions can be found [here](https://github.com/substrait-io/substrait/tree/main/extensions). + +Here's an example function that doubles its input: + +!!! info inline end "Implementation Note" + This implementation is only defined on 32-bit floats and integers but could be defined on all numbers (and even lists and strings). + +``` yaml +%YAML 1.2 +--- +scalar_functions: + - + name: "double" + description: "Double the value" + impls: + - args: + - name: x + value: fp32 + options: + on_domain_error: + values: [ NAN, ERROR ] + return: fp32 + - args: + - name: x + value: i32 + options: + on_domain_error: + values: [ NAN, ERROR ] + return: i32 +``` diff --git a/site/docs/relations/basics.md b/site/docs/relations/basics.md index 8c265d191..61ac52979 100644 --- a/site/docs/relations/basics.md +++ b/site/docs/relations/basics.md @@ -70,9 +70,9 @@ A guarantee that data output from this operation is provided with a sort order. -## Discussion Points +???+ question "Discussion Points" -* Do we try to make read definition types more extensible à la function signatures? Is that necessary if we have a custom relational operator? -* How do we express decomposed types. For example, the Iceberg type above is for early logical planning. Once we do some operations, it may produce a list of Iceberg file reads. This is likely a secondary type of object. -* We currently include a "generic properties" property on read type. Do we want this dumping ground? + * Should [read definition types](/relations/logical_relations/#read-definition-types) more extensible in the way that function signatures are? Are extensible read definition types necessary if we have custom relational operators? + * How are decomposed types expressed? For example, the Iceberg type above is for early logical planning. Once we do some operations, it may produce a list of Iceberg file reads. This is likely a secondary type of object. + * We currently include a "generic properties" property on read type. Do we want this dumping ground? diff --git a/site/docs/relations/logical_relations.md b/site/docs/relations/logical_relations.md index 3bf5e4056..7105a7a18 100644 --- a/site/docs/relations/logical_relations.md +++ b/site/docs/relations/logical_relations.md @@ -42,7 +42,11 @@ A filter expression must be interpreted against the direct schema before the pro ### Read Definition Types -Read definition types are built by the community and added to the specification. This is a portion of specification that is expected to grow rapidly. +???+ info inline end "Adding new Read Definition Types" + + If you have a read definition that's not covered here, see the [process for adding new read definition types](/governance/#substrait-voting-process). + +Read definition types are built by the community and added to the specification. #### Virtual Table @@ -393,7 +397,11 @@ The write operator is an operator that consumes one output and writes it to stor ### Write Definition Types -Write definition types are built by the community and added to the specification. This is a portion of specification that is expected to grow rapidly. +???+ info inline end "Adding new Write Definition Types" + + If you have a read definition that's not covered here, see the [process for adding new write definition types](/governance/#substrait-voting-process). + +Write definition types are built by the community and added to the specification. === "WriteRel Message" @@ -419,7 +427,7 @@ Write definition types are built by the community and added to the specification | Format | Enumeration of available formats. Only current option is PARQUET. | Required | -## DDL Operator +## DDL (Data Definition Language) Operator The operator that defines modifications of a database schema (CREATE/DROP/ALTER for TABLE and VIEWS). @@ -448,6 +456,6 @@ The operator that defines modifications of a database schema (CREATE/DROP/ALTER %%% proto.algebra.DdlRel %%% ``` -## Discussion Points +???+ "Discussion Points" -* How to handle correlated operations? + * How should correlated operations be handled? diff --git a/site/docs/relations/physical_relations.md b/site/docs/relations/physical_relations.md index c3d8da102..7e96b71b4 100644 --- a/site/docs/relations/physical_relations.md +++ b/site/docs/relations/physical_relations.md @@ -27,7 +27,7 @@ The hash equijoin join operator will build a hash table out of the right input b | Join Type | One of the join types defined in the Join operator. | Required | -## NLJ Operator +## NLJ (Nested Loop Join) Operator The nested loop join operator does a join by holding the entire right input and then iterating over it using the left input, evaluating the join expression on the Cartesian product of all rows, only outputting rows where the expression is true. Will also include non-matching rows in the OUTER, LEFT and RIGHT operations per the join type requirements. diff --git a/site/docs/serialization/binary_serialization.md b/site/docs/serialization/binary_serialization.md index 1321c2276..9dd64525d 100644 --- a/site/docs/serialization/binary_serialization.md +++ b/site/docs/serialization/binary_serialization.md @@ -5,7 +5,7 @@ Substrait can be serialized into a [protobuf](https://developers.google.com/prot ## Plan -The main top-level object used to communicate a Substrait plan using protobuf is a Plan message. The plan message is composed of a set of data structures that minimize repetition in the serialization along with one (or more) Relation trees. +The main top-level object used to communicate a Substrait plan using protobuf is a Plan message (see the [ExtendedExpression](/expressions/extended_expression/) for the other top-level object). The plan message is composed of a set of data structures that minimize repetition in the serialization along with one (or more) Relation trees. === "Plan Message" diff --git a/site/docs/tools/third_party_tools.md b/site/docs/tools/third_party_tools.md index 5ddeab935..4d302a833 100644 --- a/site/docs/tools/third_party_tools.md +++ b/site/docs/tools/third_party_tools.md @@ -3,4 +3,10 @@ ## Substrait-tools The [substrait-tools](https://pypi.org/project/substrait-tools/) python package provides a command line interface for producing/consuming substrait plans by leveraging the APIs -from different producers and consumers. \ No newline at end of file +from different producers and consumers.# Substrait Fiddle + +## Substrait Fiddle +[Substrait Fiddle](https://substrait-fiddle.com) is an online tool to share, debug, and prototype Substrait plans. + +The [Substrait Fiddle Source](https://github.com/voltrondata/substrait-fiddle) is available allowing it to be run in any environment. + diff --git a/site/docs/types/type_parsing.md b/site/docs/types/type_parsing.md index e97673319..e204f8472 100644 --- a/site/docs/types/type_parsing.md +++ b/site/docs/types/type_parsing.md @@ -10,26 +10,36 @@ The components of this expression are: | Component | Description | Required | | ---------------------- | ------------------------------------------------------------ | ------------------------------------- | -| Name | Each type has a name. A type is expressed by providing a name. This name can be expressed in arbitrary case (e.g. `varchar` and `vArChAr` are equivalent). | | -| Nullability indicator | A type is either non-nullable or nullable. To express nullability, a type name is appended with a question mark. | Optional, defaults to non-nullable | +| Name | Each type has a name. A type is expressed by providing a name. This name can be expressed in arbitrary case (e.g. `varchar` and `vArChAr` are equivalent) although lowercase is preferred. | | +| Nullability indicator | A type is either non-nullable or nullable. To express nullability, a question mark is added after the type name (before any parameters). | Optional, defaults to non-nullable | | Variation | When expressing a type, a user can define the type based on a type variation. Some systems use type variations to describe different underlying representations of the same data type. This is expressed as a bracketed integer such as [2]. | Optional, defaults to [0] | | Parameters | Compound types may have one or more configurable properties. The two main types of properties are integer and type properties. The parameters for each type correspond to a list of known properties associated with a type as declared in the order defined in the type specification. For compound types (types that contain types), the data type syntax will include nested type declarations. The one exception is structs, which are further outlined below. | Required where parameters are defined | ### Grammars -It is relatively easy in most languages to produce simple parser & emitters for the type syntax. To make that easier, Substrait also includes an ANTLR [impl pending] grammar to ease consumption and production of types. +It is relatively easy in most languages to produce simple parser & emitters for the type syntax. To make that easier, Substrait also includes an [ANTLR grammar](https://github.com/substrait-io/substrait-cpp/blob/main/src/substrait/textplan/parser/grammar/SubstraitPlanParser.g4#L108) to ease consumption and production of types. ### Structs & Named Structs -Structs are unique from other types because they have an arbitrary number of parameters. The parameters can also include one or two subproperties. Struct parsing is thus declared in the following two ways: +Structs are unique from other types because they have an arbitrary number of parameters. The parameters are recursive and may include their own subproperties. Struct parsing is declared in the following two ways: -``` -# Struct -struct?[variation] +=== "YAML" -# Named Struct -nstruct?[variation] -``` + ``` + # Struct + struct?[variation] + + # Named Struct + nstruct?[variation] + ``` + +=== "Text Format Examples" + + ``` + struct? + + # named structs are not yet supported in the text format + ``` In the normal (non-named) form, struct declares a set of types that are fields within that struct. In the named struct form, the parameters are formed by tuples of names + types, delineated by a colon. Names that are composed only of numbers and letters can be left unquoted. For other characters, names should be quoted with double quotes and use backslash for double-quote escaping. diff --git a/site/docs/types/type_system.md b/site/docs/types/type_system.md index 56362d127..eef7ff507 100644 --- a/site/docs/types/type_system.md +++ b/site/docs/types/type_system.md @@ -13,4 +13,5 @@ Substrait types fundamentally consist of four components: Refer to [Type Parsing](type_parsing.md) for a description of the syntax used to describe types. -Note that Substrait employs a strict type system without any coercion rules. All changes in types must be made explicit via [cast expressions](../expressions/specialized_record_expressions.md). +!!! note "Note" + Substrait employs a strict type system without any coercion rules. All changes in types must be made explicit via [cast expressions](../expressions/specialized_record_expressions.md). From 118e16bb69b9e92893ffb94bd6b016b5be8d8b92 Mon Sep 17 00:00:00 2001 From: David Sisson Date: Thu, 27 Jul 2023 16:16:56 -0700 Subject: [PATCH 3/6] Some review notes. --- site/docs/expressions/user_defined_functions.md | 2 +- site/docs/relations/basics.md | 5 ++--- site/docs/relations/logical_relations.md | 6 +++--- 3 files changed, 6 insertions(+), 7 deletions(-) diff --git a/site/docs/expressions/user_defined_functions.md b/site/docs/expressions/user_defined_functions.md index b0c1ce27a..fbd05f258 100644 --- a/site/docs/expressions/user_defined_functions.md +++ b/site/docs/expressions/user_defined_functions.md @@ -5,7 +5,7 @@ Substrait supports the creation of custom functions using [simple extensions](.. Here's an example function that doubles its input: !!! info inline end "Implementation Note" - This implementation is only defined on 32-bit floats and integers but could be defined on all numbers (and even lists and strings). + This implementation is only defined on 32-bit floats and integers but could be defined on all numbers (and even lists and strings). The user of the implementation can specify what happens when the resulting value falls outside of the valid range for a 32-bit float (either return NAN or raise an error). ``` yaml %YAML 1.2 diff --git a/site/docs/relations/basics.md b/site/docs/relations/basics.md index 61ac52979..41e86f425 100644 --- a/site/docs/relations/basics.md +++ b/site/docs/relations/basics.md @@ -72,7 +72,6 @@ A guarantee that data output from this operation is provided with a sort order. ???+ question "Discussion Points" - * Should [read definition types](/relations/logical_relations/#read-definition-types) more extensible in the way that function signatures are? Are extensible read definition types necessary if we have custom relational operators? - * How are decomposed types expressed? For example, the Iceberg type above is for early logical planning. Once we do some operations, it may produce a list of Iceberg file reads. This is likely a secondary type of object. - * We currently include a "generic properties" property on read type. Do we want this dumping ground? + * Should [read definition types](/relations/logical_relations/#read-definition-types) be more extensible in the same way that function signatures are? Are extensible read definition types necessary if we have custom relational operators? + * How are decomposed reads expressed? For example, the Iceberg type above is for early logical planning. Once we do some operations, it may produce a list of Iceberg file reads. This is likely a secondary type of object. diff --git a/site/docs/relations/logical_relations.md b/site/docs/relations/logical_relations.md index 7105a7a18..3f66ef6b5 100644 --- a/site/docs/relations/logical_relations.md +++ b/site/docs/relations/logical_relations.md @@ -46,7 +46,7 @@ A filter expression must be interpreted against the direct schema before the pro If you have a read definition that's not covered here, see the [process for adding new read definition types](/governance/#substrait-voting-process). -Read definition types are built by the community and added to the specification. +Read definition types (like the rest of the features in Substrait) are built by the community and added to the specification. #### Virtual Table @@ -399,7 +399,7 @@ The write operator is an operator that consumes one output and writes it to stor ???+ info inline end "Adding new Write Definition Types" - If you have a read definition that's not covered here, see the [process for adding new write definition types](/governance/#substrait-voting-process). + If you have a write definition that's not covered here, see the [process for adding new write definition types](/governance/#substrait-voting-process). Write definition types are built by the community and added to the specification. @@ -456,6 +456,6 @@ The operator that defines modifications of a database schema (CREATE/DROP/ALTER %%% proto.algebra.DdlRel %%% ``` -???+ "Discussion Points" +???+ question "Discussion Points" * How should correlated operations be handled? From 3af2f7ae33dd1eb58880fa643db64e1bd2ea1796 Mon Sep 17 00:00:00 2001 From: David Sisson Date: Thu, 27 Jul 2023 22:20:28 -0700 Subject: [PATCH 4/6] Working on the review notes. --- site/docs/relations/logical_relations.md | 4 +- .../serialization/binary_serialization.md | 2 +- site/docs/spec/_config | 1 + site/docs/spec/extending.md | 78 +++++++++++++++++++ site/docs/tools/third_party_tools.md | 2 +- site/docs/types/type_parsing.md | 5 +- 6 files changed, 86 insertions(+), 6 deletions(-) create mode 100644 site/docs/spec/extending.md diff --git a/site/docs/relations/logical_relations.md b/site/docs/relations/logical_relations.md index 3f66ef6b5..e2edf6a6a 100644 --- a/site/docs/relations/logical_relations.md +++ b/site/docs/relations/logical_relations.md @@ -44,7 +44,7 @@ A filter expression must be interpreted against the direct schema before the pro ???+ info inline end "Adding new Read Definition Types" - If you have a read definition that's not covered here, see the [process for adding new read definition types](/governance/#substrait-voting-process). + If you have a read definition that's not covered here, see the [process for adding new read definition types](/spec/extending). Read definition types (like the rest of the features in Substrait) are built by the community and added to the specification. @@ -399,7 +399,7 @@ The write operator is an operator that consumes one output and writes it to stor ???+ info inline end "Adding new Write Definition Types" - If you have a write definition that's not covered here, see the [process for adding new write definition types](/governance/#substrait-voting-process). + If you have a write definition that's not covered here, see the [process for adding new write definition types](/spec/extending). Write definition types are built by the community and added to the specification. diff --git a/site/docs/serialization/binary_serialization.md b/site/docs/serialization/binary_serialization.md index 9dd64525d..fff053c54 100644 --- a/site/docs/serialization/binary_serialization.md +++ b/site/docs/serialization/binary_serialization.md @@ -5,7 +5,7 @@ Substrait can be serialized into a [protobuf](https://developers.google.com/prot ## Plan -The main top-level object used to communicate a Substrait plan using protobuf is a Plan message (see the [ExtendedExpression](/expressions/extended_expression/) for the other top-level object). The plan message is composed of a set of data structures that minimize repetition in the serialization along with one (or more) Relation trees. +The main top-level object used to communicate a Substrait plan using protobuf is a Plan message (see the [ExtendedExpression](/expressions/extended_expression/) for an alternative other top-level object). The plan message is composed of a set of data structures that minimize repetition in the serialization along with one (or more) Relation trees. === "Plan Message" diff --git a/site/docs/spec/_config b/site/docs/spec/_config index dad9cfd1c..77dde9a2b 100644 --- a/site/docs/spec/_config +++ b/site/docs/spec/_config @@ -2,3 +2,4 @@ arrange: - versioning.md - specification.md - technology_principles.md + - extending.md diff --git a/site/docs/spec/extending.md b/site/docs/spec/extending.md new file mode 100644 index 000000000..2d5d48d13 --- /dev/null +++ b/site/docs/spec/extending.md @@ -0,0 +1,78 @@ +# Extending + +Substrait is a community project and requires consensus about new additions to the specification in order to maintain consistency. The best way to get consensus is to discuss ideas. The main ways to communicate are: + +* Substrait Mailing List +* Substrait Slack +* Community Meeting + +## Minor changes + +Simple changes like typos and bug fixes do not require as much effort. File an issue or send a PR and we can discuss it there. + +## Complex changes + +For complex features it is useful to discuss the change first. It will be useful to gather some background information to help get everyone on the same page. + +### Outline the issue + +Every engine has its own terminology. Every Spark user probably knows what an "attribute" is. Velox users will know what a "RowVector" means. Etc. However, Substrait is used by people that come from a variety of backgrounds and you should generally assume that its users do not know anything about your own implementation. As a result, all PRs and discussion should endeavor to use Substrait terminology wherever possible. + +### Survey existing implementation + +It's unlikely that this is the first time that this has been done. Figuring out + +### Prototype the feature + +Novel approaches should be implemented as an extension first. + +### Substrait design principles + +Substrait is designed around interoperability so a feature only used by a single system may not be accepted. But don't dispair! Substrait has a highly developed extension system for this express purpose. + + + + + + + + + +https://github.com/substrait-io/substrait/issues + + +https://github.com/substrait-io/substrait/pulls + + + + + +# Mailing list discussion + +Additions to the spec can start with a PR but should be accompanied by a discussion on the mailing list before the PR advanced too far. This allows for a broader audience. Also, the spec itself aims to be very terse and so it does not leave room for things like "examples", "motivation", and so on. So either all of that information needs to go into the PR description or (more commonly) it is just ignored. I think it is easier for this information to go in a google doc (or some other shareable doc) and linked into a mailing list discussion so that others can take a look. + +# Survey of existing implementations + +We should not be changing the spec for the sake of a single implementation. We have designed many extension points for this very purpose. If a feature only makes sense in one implementation then it should be an extension. + +This means that anyone adding a relation should do an investigation and / or survey of other engines out there to figure out how they work. This is perhaps the most challenging item in this criteria and the one that I feel has been left out of every single PR proposed so far. + +# Everything should be worded in Substrait Terminology + +Every engine has its own terminology. Every Spark user probably knows what an "attribute" is. Velox users will know what a "RowVector" means. Etc. However, Substrait is used by people that come from a variety of backgrounds and you should generally assume that its users do not know anything about your own implementation. As a result, all PRs and discussion should endeavor to use Substrait terminology wherever possible. + +# Motivation + +What problems does this relation solve? If it is a more logical relation then how does it allow users to express new capabilities? If it is more of an internal relation then how does it map to existing logical relations? How is it different than other existing relations? Why do we need this? + +# Examples + +Provide example input and output for the relation. Show example plans. Try and motivate your examples, as best as possible, with something that looks like a real world problem. These will go a long ways towards helping others understand the purpose of a relation. + +# Alternatives + +Discuss what alternatives are out there. Are there other ways to achieve similar results? Do some systems handle this problem differently? + +# You don't have to do it alone + +If you are hoping to add a new relation and these criteria seem intimidating then feel free to start a mailing list discussion before you have all the information and ask for help. Investigating other implementations, in particular, is something that can be quite difficult to do on your own. However, it is something that I think the mailing list is a good fit for. diff --git a/site/docs/tools/third_party_tools.md b/site/docs/tools/third_party_tools.md index 4d302a833..9ae040209 100644 --- a/site/docs/tools/third_party_tools.md +++ b/site/docs/tools/third_party_tools.md @@ -3,7 +3,7 @@ ## Substrait-tools The [substrait-tools](https://pypi.org/project/substrait-tools/) python package provides a command line interface for producing/consuming substrait plans by leveraging the APIs -from different producers and consumers.# Substrait Fiddle +from different producers and consumers. ## Substrait Fiddle [Substrait Fiddle](https://substrait-fiddle.com) is an online tool to share, debug, and prototype Substrait plans. diff --git a/site/docs/types/type_parsing.md b/site/docs/types/type_parsing.md index e204f8472..2a926d37d 100644 --- a/site/docs/types/type_parsing.md +++ b/site/docs/types/type_parsing.md @@ -17,7 +17,7 @@ The components of this expression are: ### Grammars -It is relatively easy in most languages to produce simple parser & emitters for the type syntax. To make that easier, Substrait also includes an [ANTLR grammar](https://github.com/substrait-io/substrait-cpp/blob/main/src/substrait/textplan/parser/grammar/SubstraitPlanParser.g4#L108) to ease consumption and production of types. +It is relatively easy in most languages to produce simple parser & emitters for the type syntax. To make that easier, Substrait also includes an [ANTLR grammar](https://github.com/substrait-io/substrait-cpp/blob/main/src/substrait/textplan/parser/grammar/SubstraitPlanParser.g4#L108) to ease consumption and production of types. (The grammar also supports an entire language for representing plans as text.) ### Structs & Named Structs @@ -36,9 +36,10 @@ Structs are unique from other types because they have an arbitrary number of par === "Text Format Examples" ``` + // Struct struct? - # named structs are not yet supported in the text format + // Named structs are not yet supported in the text format. ``` In the normal (non-named) form, struct declares a set of types that are fields within that struct. In the named struct form, the parameters are formed by tuples of names + types, delineated by a colon. Names that are composed only of numbers and letters can be left unquoted. For other characters, names should be quoted with double quotes and use backslash for double-quote escaping. From 7b15c66ab2bb669fc29f2e23d76bdedd3f161968 Mon Sep 17 00:00:00 2001 From: David Sisson Date: Thu, 27 Jul 2023 23:38:29 -0700 Subject: [PATCH 5/6] Added the extending page. --- site/docs/spec/extending.md | 63 ++++++++++--------------------------- 1 file changed, 17 insertions(+), 46 deletions(-) diff --git a/site/docs/spec/extending.md b/site/docs/spec/extending.md index 2d5d48d13..41874df0b 100644 --- a/site/docs/spec/extending.md +++ b/site/docs/spec/extending.md @@ -8,7 +8,7 @@ Substrait is a community project and requires consensus about new additions to t ## Minor changes -Simple changes like typos and bug fixes do not require as much effort. File an issue or send a PR and we can discuss it there. +Simple changes like typos and bug fixes do not require as much effort. [File an issue](https://github.com/substrait-io/substrait/issues) or [send a PR](https://github.com/substrait-io/substrait/pulls) and we can discuss it there. ## Complex changes @@ -16,63 +16,34 @@ For complex features it is useful to discuss the change first. It will be usefu ### Outline the issue -Every engine has its own terminology. Every Spark user probably knows what an "attribute" is. Velox users will know what a "RowVector" means. Etc. However, Substrait is used by people that come from a variety of backgrounds and you should generally assume that its users do not know anything about your own implementation. As a result, all PRs and discussion should endeavor to use Substrait terminology wherever possible. - -### Survey existing implementation - -It's unlikely that this is the first time that this has been done. Figuring out - -### Prototype the feature - -Novel approaches should be implemented as an extension first. - -### Substrait design principles - -Substrait is designed around interoperability so a feature only used by a single system may not be accepted. But don't dispair! Substrait has a highly developed extension system for this express purpose. - - - - - - - - +#### Language -https://github.com/substrait-io/substrait/issues - - -https://github.com/substrait-io/substrait/pulls - - - - - -# Mailing list discussion +Every engine has its own terminology. Every Spark user probably knows what an "attribute" is. Velox users will know what a "RowVector" means. Etc. However, Substrait is used by people that come from a variety of backgrounds and you should generally assume that its users do not know anything about your own implementation. As a result, all PRs and discussion should endeavor to use Substrait terminology wherever possible. -Additions to the spec can start with a PR but should be accompanied by a discussion on the mailing list before the PR advanced too far. This allows for a broader audience. Also, the spec itself aims to be very terse and so it does not leave room for things like "examples", "motivation", and so on. So either all of that information needs to go into the PR description or (more commonly) it is just ignored. I think it is easier for this information to go in a google doc (or some other shareable doc) and linked into a mailing list discussion so that others can take a look. +#### Motivation -# Survey of existing implementations +What problems does this relation solve? If it is a more logical relation then how does it allow users to express new capabilities? If it is more of an internal relation then how does it map to existing logical relations? How is it different than other existing relations? Why do we need this? -We should not be changing the spec for the sake of a single implementation. We have designed many extension points for this very purpose. If a feature only makes sense in one implementation then it should be an extension. +#### Examples -This means that anyone adding a relation should do an investigation and / or survey of other engines out there to figure out how they work. This is perhaps the most challenging item in this criteria and the one that I feel has been left out of every single PR proposed so far. +Provide example input and output for the relation. Show example plans. Try and motivate your examples, as best as possible, with something that looks like a real world problem. These will go a long ways towards helping others understand the purpose of a relation. -# Everything should be worded in Substrait Terminology +#### Alternatives -Every engine has its own terminology. Every Spark user probably knows what an "attribute" is. Velox users will know what a "RowVector" means. Etc. However, Substrait is used by people that come from a variety of backgrounds and you should generally assume that its users do not know anything about your own implementation. As a result, all PRs and discussion should endeavor to use Substrait terminology wherever possible. +Discuss what alternatives are out there. Are there other ways to achieve similar results? Do some systems handle this problem differently? -# Motivation +### Survey existing implementation -What problems does this relation solve? If it is a more logical relation then how does it allow users to express new capabilities? If it is more of an internal relation then how does it map to existing logical relations? How is it different than other existing relations? Why do we need this? +It's unlikely that this is the first time that this has been done. Figuring out -# Examples +### Prototype the feature -Provide example input and output for the relation. Show example plans. Try and motivate your examples, as best as possible, with something that looks like a real world problem. These will go a long ways towards helping others understand the purpose of a relation. +Novel approaches should be implemented as an extension first. -# Alternatives +### Substrait design principles -Discuss what alternatives are out there. Are there other ways to achieve similar results? Do some systems handle this problem differently? +Substrait is designed around interoperability so a feature only used by a single system may not be accepted. But don't dispair! Substrait has a highly developed extension system for this express purpose. -# You don't have to do it alone +### You don't have to do it alone -If you are hoping to add a new relation and these criteria seem intimidating then feel free to start a mailing list discussion before you have all the information and ask for help. Investigating other implementations, in particular, is something that can be quite difficult to do on your own. However, it is something that I think the mailing list is a good fit for. +If you are hoping to add a feature and these criteria seem intimidating then feel free to start a mailing list discussion before you have all the information and ask for help. Investigating other implementations, in particular, is something that can be quite difficult to do on your own. From 1dcce2524acc26497a43d895868435bca3f6036f Mon Sep 17 00:00:00 2001 From: David Sisson Date: Thu, 13 Jul 2023 13:42:32 -0700 Subject: [PATCH 6/6] Revert "feat: clarify map behavior" This reverts commit 0ad01696759dcfdd848f5e065f0c704602bde93c. --- site/docs/expressions/field_references.md | 2 +- site/docs/types/type_classes.md | 2 +- site/docs/types/type_parsing.md | 20 -------------------- site/mkdocs.yml | 3 +-- 4 files changed, 3 insertions(+), 24 deletions(-) diff --git a/site/docs/expressions/field_references.md b/site/docs/expressions/field_references.md index 492b67f9a..e11346101 100644 --- a/site/docs/expressions/field_references.md +++ b/site/docs/expressions/field_references.md @@ -7,7 +7,7 @@ In Substrait, all fields are dealt with on a positional basis. Field names are o | Struct Field | Ordinal position. Zero-based. Only legal within the range of possible fields within a struct. Selecting an ordinal outside the applicable field range results in an invalid plan. | struct | Type of field referenced | | Array Value | Array offset. Zero-based. Negative numbers can be used to describe an offset relative to the end of the array. For example, -1 means the last element in an array. Negative and positive overflows return null values (no wrapping). | list | type of list | | Array Slice | Array offset and element count. Zero-based. Negative numbers can be used to describe an offset relative to the end of the array. For example, -1 means the last element in an array. Position does not wrap, nor does length. | list | Same type as original list | -| Map Key | A map value that is matched exactly against available map keys and returned. | map | Value type of map | +| Map Key | A map value that is matched exactly against available map keys and returned. [TBD, can multiple matches occur?] | map | Value type of map | | Map KeyExpression | A wildcard string that is matched against a simplified form of regular expressions. Requires the key type of the map to be a character type. [Format detail needed, intention to include basic regex concepts such as greedy/non-greedy.] | map | List of map value type | | Masked Complex Expression | An expression that provides a mask over a schema declaring which portions of the schema should be presented. This allows a user to select a portion of a complex object but mask certain subsections of that same object. | any | any | diff --git a/site/docs/types/type_classes.md b/site/docs/types/type_classes.md index c4edcc019..bcad59dd4 100644 --- a/site/docs/types/type_classes.md +++ b/site/docs/types/type_classes.md @@ -40,7 +40,7 @@ Compound type classes are type classes that need to be configured by means of a | STRUCT<T1,...,Tn> | A list of types in a defined order. | `repeated Literal`, types matching T1..Tn | NSTRUCT<N:T1,...,N:Tn> | **Pseudo-type**: A struct that maps unique names to value types. Each name is a UTF-8-encoded string. Each value can have a distinct type. Note that NSTRUCT is actually a pseudo-type, because Substrait's core type system is based entirely on ordinal positions, not named fields. Nonetheless, when working with systems outside Substrait, names are important. | n/a | LIST<T> | A list of values of type T. The list can be between [0..2,147,483,647] values in length. | `repeated Literal`, all types matching T -| MAP<K, V> | An unordered list of type K keys with type V values. Keys may not be repeated. While the key type could be nullable, keys may not be null. | `repeated KeyValue` (in turn two `Literal`s), all key types matching K and all value types matching V +| MAP<K, V> | An unordered list of type K keys with type V values. | `repeated KeyValue` (in turn two `Literal`s), all key types matching K and all value types matching V ## User-Defined Types diff --git a/site/docs/types/type_parsing.md b/site/docs/types/type_parsing.md index 2a926d37d..87bd3639e 100644 --- a/site/docs/types/type_parsing.md +++ b/site/docs/types/type_parsing.md @@ -45,23 +45,3 @@ Structs are unique from other types because they have an arbitrary number of par In the normal (non-named) form, struct declares a set of types that are fields within that struct. In the named struct form, the parameters are formed by tuples of names + types, delineated by a colon. Names that are composed only of numbers and letters can be left unquoted. For other characters, names should be quoted with double quotes and use backslash for double-quote escaping. Note, in core Substrait algebra, fields are unnamed and references are always based on zero-index ordinal positions. However, data inputs must declare name-to-ordinal mappings and outputs must declare ordinal-to-name mappings. As such, Substrait also provides a named struct which is a pseudo-type that is useful for human consumption. Outside these places, most structs in a Substrait plan are structs, not named-structs. The two cannot be used interchangeably. - -### Other Complex Types - -Similar to structs, maps and lists can also have a type as one of their parameters. Type references may be recursive. The key for a map is typically a simple type but it is not required. - - -=== "YAML" - - ``` - list?> - map - ``` - -=== "Text Format Examples" - - ``` - list?> - list> - map>> - ``` diff --git a/site/mkdocs.yml b/site/mkdocs.yml index 3d0c2f199..a4e025e79 100644 --- a/site/mkdocs.yml +++ b/site/mkdocs.yml @@ -94,8 +94,7 @@ markdown_extensions: - name: mermaid class: mermaid format: !!python/name:pymdownx.superfences.fence_code_format - - pymdownx.tabbed: - alternate_style: true + - pymdownx.tabbed - pymdownx.tasklist: custom_checkbox: true - pymdownx.tilde