Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: elucidate obscure discussion points #523

Merged
merged 11 commits into from
Nov 22, 2023
22 changes: 7 additions & 15 deletions site/docs/expressions/embedded_functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,9 @@ The binary representation of an embedded function is:
```proto
%%% proto.message.Expression.EmbeddedFunction %%%
```

=== "Human Readable Representation"
n/a
=== "Example"
n/a
As the bytes are opaque to Substrait there is no equivalent human readable form.
EpsilonPrime marked this conversation as resolved.
Show resolved Hide resolved


## Function Details
Expand All @@ -49,16 +48,9 @@ There are many types of possible stored functions. For each, Substrait works to



## Discussion Points

* What are the common embedded function formats?
* How do we expose the data for a function?
* How do we express batching capabilities?
* How do we ensure/declare containerization?






???+ question "Discussion Points"

* What are the common embedded function formats?
* How do we expose the data for a function?
* How do we express batching capabilities?
* How do we ensure/declare containerization?
6 changes: 6 additions & 0 deletions site/docs/expressions/extended_expression.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,12 @@ Extended Expression messages are provided for expression-level protocols as an a

Since Extended Expression will be used seperately from the Plan rel representation, it will need to include basic fields like Version.

=== "ExtendedExpression Message"

```proto
%%% proto.message.ExtendedExpression %%%
```

## Input and output data schema

Similar to `base_schema` defined in [ReadRel](https://github.com/substrait-io/substrait/blob/7f272f13f22cd5f5842baea42bcf7961e6251881/proto/substrait/algebra.proto#L58), the input data schema describes the name/type/nullibilty and layout info of input data for the target expression evalutation. It also has a field `name` to define the name of the output data.
Expand Down
12 changes: 4 additions & 8 deletions site/docs/expressions/field_references.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ In Substrait, all fields are dealt with on a positional basis. Field names are o
| Struct Field | Ordinal position. Zero-based. Only legal within the range of possible fields within a struct. Selecting an ordinal outside the applicable field range results in an invalid plan. | struct | Type of field referenced |
| Array Value | Array offset. Zero-based. Negative numbers can be used to describe an offset relative to the end of the array. For example, -1 means the last element in an array. Negative and positive overflows return null values (no wrapping). | list | type of list |
| Array Slice | Array offset and element count. Zero-based. Negative numbers can be used to describe an offset relative to the end of the array. For example, -1 means the last element in an array. Position does not wrap, nor does length. | list | Same type as original list |
| Map Key | A map value that is matched exactly against available map keys and returned. [TBD, can multiple matches occur?] | map | Value type of map |
| Map Key | A map value that is matched exactly against available map keys and returned. | map | Value type of map |
| Map KeyExpression | A wildcard string that is matched against a simplified form of regular expressions. Requires the key type of the map to be a character type. [Format detail needed, intention to include basic regex concepts such as greedy/non-greedy.] | map | List of map value type |
| Masked Complex Expression | An expression that provides a mask over a schema declaring which portions of the schema should be presented. This allows a user to select a portion of a complex object but mask certain subsections of that same object. | any | any |

Expand Down Expand Up @@ -40,7 +40,7 @@ struct:
- i64
```

Given this schema, you could declare a mask in pseudocode, such as:
Given this schema, you could declare a mask of fields to include in pseudocode, such as:

```
0:[0,1:[..5:[0,2]]],2,3
Expand Down Expand Up @@ -77,11 +77,7 @@ By default, when only a single field is selected from a struct, that struct is r



## Discussion Points

* Should we support column reordering/positioning using a masked complex expression? (Right now, you can only mask things out.)



???+ question "Discussion Points"

* Should we support column reordering/positioning using a masked complex expression? (Right now, you can only mask things out.)

8 changes: 8 additions & 0 deletions site/docs/expressions/subqueries.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,3 +65,11 @@ WHERE x < ANY(SELECT y from t2)
| Comparison operation | The kind of comparison operation to use | Yes |
| Expression | Left-hand side expression to check | Yes |
| Subquery | Subquery to check | Yes |



=== "Protobuf Representation"
EpsilonPrime marked this conversation as resolved.
Show resolved Hide resolved

```proto
%%% proto.message.Expression.Subquery %%%
```
31 changes: 30 additions & 1 deletion site/docs/expressions/user_defined_functions.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,32 @@
# User-Defined Functions

Substrait supports the creation of custom functions using [simple extensions](../extensions/index.md#simple-extensions), using the facilities described in [scalar functions](scalar_functions.md). In fact, the functions defined by Substrait use the same mechanism. The extension files for them can be found [here](https://github.com/substrait-io/substrait/tree/main/extensions).
Substrait supports the creation of custom functions using [simple extensions](../extensions/index.md#simple-extensions), using the facilities described in [scalar functions](scalar_functions.md). The functions defined by Substrait use the same mechanism. The extension files for standard functions can be found [here](https://github.com/substrait-io/substrait/tree/main/extensions).

Here's an example function that doubles its input:

!!! info inline end "Implementation Note"
This implementation is only defined on 32-bit floats and integers but could be defined on all numbers (and even lists and strings).

``` yaml
%YAML 1.2
---
scalar_functions:
-
name: "double"
description: "Double the value"
impls:
- args:
- name: x
value: fp32
options:
on_domain_error:
values: [ NAN, ERROR ]
EpsilonPrime marked this conversation as resolved.
Show resolved Hide resolved
return: fp32
- args:
- name: x
value: i32
options:
on_domain_error:
values: [ NAN, ERROR ]
return: i32
```
8 changes: 4 additions & 4 deletions site/docs/relations/basics.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,9 +70,9 @@ A guarantee that data output from this operation is provided with a sort order.



## Discussion Points
???+ question "Discussion Points"

* Do we try to make read definition types more extensible à la function signatures? Is that necessary if we have a custom relational operator?
* How do we express decomposed types. For example, the Iceberg type above is for early logical planning. Once we do some operations, it may produce a list of Iceberg file reads. This is likely a secondary type of object.
* We currently include a "generic properties" property on read type. Do we want this dumping ground?
* Should [read definition types](/relations/logical_relations/#read-definition-types) more extensible in the way that function signatures are? Are extensible read definition types necessary if we have custom relational operators?
EpsilonPrime marked this conversation as resolved.
Show resolved Hide resolved
* How are decomposed types expressed? For example, the Iceberg type above is for early logical planning. Once we do some operations, it may produce a list of Iceberg file reads. This is likely a secondary type of object.
EpsilonPrime marked this conversation as resolved.
Show resolved Hide resolved
* We currently include a "generic properties" property on read type. Do we want this dumping ground?
EpsilonPrime marked this conversation as resolved.
Show resolved Hide resolved

18 changes: 13 additions & 5 deletions site/docs/relations/logical_relations.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,11 @@ A filter expression must be interpreted against the direct schema before the pro

### Read Definition Types

Read definition types are built by the community and added to the specification. This is a portion of specification that is expected to grow rapidly.
???+ info inline end "Adding new Read Definition Types"

If you have a read definition that's not covered here, see the [process for adding new read definition types](/governance/#substrait-voting-process).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...I'm not sure the voting process is the exact same thing as "the process for adding new read definition types".

For example, if someone aksed me what that process was I would recommend they first implement the read in at least one engine using an extension type. Then they bring it up for discussion with the community.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm adding a whole new section under the specification section called "Extending" where we can discuss these topics. You may find some of the language there familiar.


Read definition types are built by the community and added to the specification.
EpsilonPrime marked this conversation as resolved.
Show resolved Hide resolved

#### Virtual Table

Expand Down Expand Up @@ -393,7 +397,11 @@ The write operator is an operator that consumes one output and writes it to stor

### Write Definition Types

Write definition types are built by the community and added to the specification. This is a portion of specification that is expected to grow rapidly.
???+ info inline end "Adding new Write Definition Types"

If you have a read definition that's not covered here, see the [process for adding new write definition types](/governance/#substrait-voting-process).
EpsilonPrime marked this conversation as resolved.
Show resolved Hide resolved

Write definition types are built by the community and added to the specification.


=== "WriteRel Message"
Expand All @@ -419,7 +427,7 @@ Write definition types are built by the community and added to the specification
| Format | Enumeration of available formats. Only current option is PARQUET. | Required |


## DDL Operator
## DDL (Data Definition Language) Operator

The operator that defines modifications of a database schema (CREATE/DROP/ALTER for TABLE and VIEWS).

Expand Down Expand Up @@ -448,6 +456,6 @@ The operator that defines modifications of a database schema (CREATE/DROP/ALTER
%%% proto.algebra.DdlRel %%%
```

## Discussion Points
???+ "Discussion Points"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These discussion points are blue where they seem to be green elsewhere. Is that the +question thing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the question indicates the treatment of this section.


* How to handle correlated operations?
* How should correlated operations be handled?
2 changes: 1 addition & 1 deletion site/docs/relations/physical_relations.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ The hash equijoin join operator will build a hash table out of the right input b
| Join Type | One of the join types defined in the Join operator. | Required |


## NLJ Operator
## NLJ (Nested Loop Join) Operator

The nested loop join operator does a join by holding the entire right input and then iterating over it using the left input, evaluating the join expression on the Cartesian product of all rows, only outputting rows where the expression is true. Will also include non-matching rows in the OUTER, LEFT and RIGHT operations per the join type requirements.

Expand Down
2 changes: 1 addition & 1 deletion site/docs/serialization/binary_serialization.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Substrait can be serialized into a [protobuf](https://developers.google.com/prot

## Plan

The main top-level object used to communicate a Substrait plan using protobuf is a Plan message. The plan message is composed of a set of data structures that minimize repetition in the serialization along with one (or more) Relation trees.
The main top-level object used to communicate a Substrait plan using protobuf is a Plan message (see the [ExtendedExpression](/expressions/extended_expression/) for the other top-level object). The plan message is composed of a set of data structures that minimize repetition in the serialization along with one (or more) Relation trees.
EpsilonPrime marked this conversation as resolved.
Show resolved Hide resolved

=== "Plan Message"

Expand Down
8 changes: 7 additions & 1 deletion site/docs/tools/third_party_tools.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,10 @@
## Substrait-tools
The [substrait-tools](https://pypi.org/project/substrait-tools/) python package provides
a command line interface for producing/consuming substrait plans by leveraging the APIs
from different producers and consumers.
from different producers and consumers.# Substrait Fiddle
EpsilonPrime marked this conversation as resolved.
Show resolved Hide resolved

## Substrait Fiddle
[Substrait Fiddle](https://substrait-fiddle.com) is an online tool to share, debug, and prototype Substrait plans.

The [Substrait Fiddle Source](https://github.com/voltrondata/substrait-fiddle) is available allowing it to be run in any environment.

2 changes: 1 addition & 1 deletion site/docs/types/type_classes.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ Compound type classes are type classes that need to be configured by means of a
| STRUCT&lt;T1,...,Tn&gt; | A list of types in a defined order. | `repeated Literal`, types matching T1..Tn
| NSTRUCT&lt;N:T1,...,N:Tn&gt; | **Pseudo-type**: A struct that maps unique names to value types. Each name is a UTF-8-encoded string. Each value can have a distinct type. Note that NSTRUCT is actually a pseudo-type, because Substrait's core type system is based entirely on ordinal positions, not named fields. Nonetheless, when working with systems outside Substrait, names are important. | n/a
| LIST&lt;T&gt; | A list of values of type T. The list can be between [0..2,147,483,647] values in length. | `repeated Literal`, all types matching T
| MAP&lt;K, V&gt; | An unordered list of type K keys with type V values. | `repeated KeyValue` (in turn two `Literal`s), all key types matching K and all value types matching V
| MAP&lt;K, V&gt; | An unordered list of type K keys with type V values. Keys may not be repeated. While the key type could be nullable, keys may not be null. | `repeated KeyValue` (in turn two `Literal`s), all key types matching K and all value types matching V

## User-Defined Types

Expand Down
50 changes: 40 additions & 10 deletions site/docs/types/type_parsing.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,27 +10,57 @@ The components of this expression are:

| Component | Description | Required |
| ---------------------- | ------------------------------------------------------------ | ------------------------------------- |
| Name | Each type has a name. A type is expressed by providing a name. This name can be expressed in arbitrary case (e.g. `varchar` and `vArChAr` are equivalent). | |
| Nullability indicator | A type is either non-nullable or nullable. To express nullability, a type name is appended with a question mark. | Optional, defaults to non-nullable |
| Name | Each type has a name. A type is expressed by providing a name. This name can be expressed in arbitrary case (e.g. `varchar` and `vArChAr` are equivalent) although lowercase is preferred. | |
| Nullability indicator | A type is either non-nullable or nullable. To express nullability, a question mark is added after the type name (before any parameters). | Optional, defaults to non-nullable |
| Variation | When expressing a type, a user can define the type based on a type variation. Some systems use type variations to describe different underlying representations of the same data type. This is expressed as a bracketed integer such as [2]. | Optional, defaults to [0] |
| Parameters | Compound types may have one or more configurable properties. The two main types of properties are integer and type properties. The parameters for each type correspond to a list of known properties associated with a type as declared in the order defined in the type specification. For compound types (types that contain types), the data type syntax will include nested type declarations. The one exception is structs, which are further outlined below. | Required where parameters are defined |

### Grammars

It is relatively easy in most languages to produce simple parser & emitters for the type syntax. To make that easier, Substrait also includes an ANTLR [impl pending] grammar to ease consumption and production of types.
It is relatively easy in most languages to produce simple parser & emitters for the type syntax. To make that easier, Substrait also includes an [ANTLR grammar](https://github.com/substrait-io/substrait-cpp/blob/main/src/substrait/textplan/parser/grammar/SubstraitPlanParser.g4#L108) to ease consumption and production of types.
EpsilonPrime marked this conversation as resolved.
Show resolved Hide resolved

### Structs & Named Structs

Structs are unique from other types because they have an arbitrary number of parameters. The parameters can also include one or two subproperties. Struct parsing is thus declared in the following two ways:
Structs are unique from other types because they have an arbitrary number of parameters. The parameters are recursive and may include their own subproperties. Struct parsing is declared in the following two ways:

```
# Struct
struct?[variation]<type0, type1,..., typeN>
=== "YAML"

# Named Struct
nstruct?[variation]<name0:type0, name1:type1,..., nameN:typeN>
```
```
# Struct
struct?[variation]<type0, type1,..., typeN>

# Named Struct
nstruct?[variation]<name0:type0, name1:type1,..., nameN:typeN>
```

=== "Text Format Examples"

```
struct?<string, i8, i32?, timestamp_tz>
EpsilonPrime marked this conversation as resolved.
Show resolved Hide resolved

# named structs are not yet supported in the text format
```

In the normal (non-named) form, struct declares a set of types that are fields within that struct. In the named struct form, the parameters are formed by tuples of names + types, delineated by a colon. Names that are composed only of numbers and letters can be left unquoted. For other characters, names should be quoted with double quotes and use backslash for double-quote escaping.

Note, in core Substrait algebra, fields are unnamed and references are always based on zero-index ordinal positions. However, data inputs must declare name-to-ordinal mappings and outputs must declare ordinal-to-name mappings. As such, Substrait also provides a named struct which is a pseudo-type that is useful for human consumption. Outside these places, most structs in a Substrait plan are structs, not named-structs. The two cannot be used interchangeably.

### Other Complex Types

Similar to structs, maps and lists can also have a type as one of their parameters. Type references may be recursive. The key for a map is typically a simple type but it is not required.


=== "YAML"

```
list?<type>>
map<type0, type1>
```

=== "Text Format Examples"

```
list?<list<string>>
list<struct<string, i32>>
map<i32?, list<map<i32, string?>>>
EpsilonPrime marked this conversation as resolved.
Show resolved Hide resolved
```
3 changes: 2 additions & 1 deletion site/docs/types/type_system.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,5 @@ Substrait types fundamentally consist of four components:

Refer to [Type Parsing](type_parsing.md) for a description of the syntax used to describe types.

Note that Substrait employs a strict type system without any coercion rules. All changes in types must be made explicit via [cast expressions](../expressions/specialized_record_expressions.md).
!!! note "Note"
Substrait employs a strict type system without any coercion rules. All changes in types must be made explicit via [cast expressions](../expressions/specialized_record_expressions.md).
3 changes: 2 additions & 1 deletion site/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,8 @@ markdown_extensions:
- name: mermaid
class: mermaid
format: !!python/name:pymdownx.superfences.fence_code_format
- pymdownx.tabbed
- pymdownx.tabbed:
alternate_style: true
EpsilonPrime marked this conversation as resolved.
Show resolved Hide resolved
- pymdownx.tasklist:
custom_checkbox: true
- pymdownx.tilde