Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: elucidate obscure discussion points #523

Merged
merged 11 commits into from
Nov 22, 2023
22 changes: 7 additions & 15 deletions site/docs/expressions/embedded_functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,9 @@ The binary representation of an embedded function is:
```proto
%%% proto.message.Expression.EmbeddedFunction %%%
```

=== "Human Readable Representation"
n/a
=== "Example"
n/a
As the bytes are opaque to Substrait there is no equivalent human readable form.
EpsilonPrime marked this conversation as resolved.
Show resolved Hide resolved


## Function Details
Expand All @@ -49,16 +48,9 @@ There are many types of possible stored functions. For each, Substrait works to



## Discussion Points

* What are the common embedded function formats?
* How do we expose the data for a function?
* How do we express batching capabilities?
* How do we ensure/declare containerization?






???+ question "Discussion Points"

* What are the common embedded function formats?
* How do we expose the data for a function?
* How do we express batching capabilities?
* How do we ensure/declare containerization?
6 changes: 6 additions & 0 deletions site/docs/expressions/extended_expression.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,12 @@ Extended Expression messages are provided for expression-level protocols as an a

Since Extended Expression will be used seperately from the Plan rel representation, it will need to include basic fields like Version.

=== "ExtendedExpression Message"

```proto
%%% proto.message.ExtendedExpression %%%
```

## Input and output data schema

Similar to `base_schema` defined in [ReadRel](https://github.com/substrait-io/substrait/blob/7f272f13f22cd5f5842baea42bcf7961e6251881/proto/substrait/algebra.proto#L58), the input data schema describes the name/type/nullibilty and layout info of input data for the target expression evalutation. It also has a field `name` to define the name of the output data.
Expand Down
10 changes: 3 additions & 7 deletions site/docs/expressions/field_references.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ struct:
- i64
```

Given this schema, you could declare a mask in pseudocode, such as:
Given this schema, you could declare a mask of fields to include in pseudocode, such as:

```
0:[0,1:[..5:[0,2]]],2,3
Expand Down Expand Up @@ -144,11 +144,7 @@ By default, when only a single field is selected from a struct, that struct is r



## Discussion Points

* Should we support column reordering/positioning using a masked complex expression? (Right now, you can only mask things out.)



???+ question "Discussion Points"

* Should we support column reordering/positioning using a masked complex expression? (Right now, you can only mask things out.)

8 changes: 8 additions & 0 deletions site/docs/expressions/subqueries.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,3 +65,11 @@ WHERE x < ANY(SELECT y from t2)
| Comparison operation | The kind of comparison operation to use | Yes |
| Expression | Left-hand side expression to check | Yes |
| Subquery | Subquery to check | Yes |



=== "Protobuf Representation"
EpsilonPrime marked this conversation as resolved.
Show resolved Hide resolved

```proto
%%% proto.message.Expression.Subquery %%%
```
31 changes: 30 additions & 1 deletion site/docs/expressions/user_defined_functions.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,32 @@
# User-Defined Functions

Substrait supports the creation of custom functions using [simple extensions](../extensions/index.md#simple-extensions), using the facilities described in [scalar functions](scalar_functions.md). In fact, the functions defined by Substrait use the same mechanism. The extension files for them can be found [here](https://github.com/substrait-io/substrait/tree/main/extensions).
Substrait supports the creation of custom functions using [simple extensions](../extensions/index.md#simple-extensions), using the facilities described in [scalar functions](scalar_functions.md). The functions defined by Substrait use the same mechanism. The extension files for standard functions can be found [here](https://github.com/substrait-io/substrait/tree/main/extensions).

Here's an example function that doubles its input:

!!! info inline end "Implementation Note"
This implementation is only defined on 32-bit floats and integers but could be defined on all numbers (and even lists and strings). The user of the implementation can specify what happens when the resulting value falls outside of the valid range for a 32-bit float (either return NAN or raise an error).

``` yaml
%YAML 1.2
---
scalar_functions:
-
name: "double"
description: "Double the value"
impls:
- args:
- name: x
value: fp32
options:
on_domain_error:
values: [ NAN, ERROR ]
EpsilonPrime marked this conversation as resolved.
Show resolved Hide resolved
return: fp32
- args:
- name: x
value: i32
options:
on_domain_error:
values: [ NAN, ERROR ]
return: i32
```
7 changes: 3 additions & 4 deletions site/docs/relations/basics.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,9 +70,8 @@ A guarantee that data output from this operation is provided with a sort order.



## Discussion Points
???+ question "Discussion Points"

* Do we try to make read definition types more extensible à la function signatures? Is that necessary if we have a custom relational operator?
* How do we express decomposed types. For example, the Iceberg type above is for early logical planning. Once we do some operations, it may produce a list of Iceberg file reads. This is likely a secondary type of object.
* We currently include a "generic properties" property on read type. Do we want this dumping ground?
* Should [read definition types](/relations/logical_relations/#read-definition-types) be more extensible in the same way that function signatures are? Are extensible read definition types necessary if we have custom relational operators?
* How are decomposed reads expressed? For example, the Iceberg type above is for early logical planning. Once we do some operations, it may produce a list of Iceberg file reads. This is likely a secondary type of object.

18 changes: 13 additions & 5 deletions site/docs/relations/logical_relations.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,11 @@ A filter expression must be interpreted against the direct schema before the pro

### Read Definition Types

Read definition types are built by the community and added to the specification. This is a portion of specification that is expected to grow rapidly.
???+ info inline end "Adding new Read Definition Types"

If you have a read definition that's not covered here, see the [process for adding new read definition types](/spec/extending).

Read definition types (like the rest of the features in Substrait) are built by the community and added to the specification.

#### Virtual Table

Expand Down Expand Up @@ -394,7 +398,11 @@ The write operator is an operator that consumes one output and writes it to stor

### Write Definition Types

Write definition types are built by the community and added to the specification. This is a portion of specification that is expected to grow rapidly.
???+ info inline end "Adding new Write Definition Types"

If you have a write definition that's not covered here, see the [process for adding new write definition types](/spec/extending).

Write definition types are built by the community and added to the specification.


=== "WriteRel Message"
Expand All @@ -420,7 +428,7 @@ Write definition types are built by the community and added to the specification
| Format | Enumeration of available formats. Only current option is PARQUET. | Required |


## DDL Operator
## DDL (Data Definition Language) Operator

The operator that defines modifications of a database schema (CREATE/DROP/ALTER for TABLE and VIEWS).

Expand Down Expand Up @@ -449,6 +457,6 @@ The operator that defines modifications of a database schema (CREATE/DROP/ALTER
%%% proto.algebra.DdlRel %%%
```

## Discussion Points
???+ question "Discussion Points"

* How to handle correlated operations?
* How should correlated operations be handled?
2 changes: 1 addition & 1 deletion site/docs/relations/physical_relations.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ The hash equijoin join operator will build a hash table out of the right input b
| Join Type | One of the join types defined in the Join operator. | Required |


## NLJ Operator
## NLJ (Nested Loop Join) Operator

The nested loop join operator does a join by holding the entire right input and then iterating over it using the left input, evaluating the join expression on the Cartesian product of all rows, only outputting rows where the expression is true. Will also include non-matching rows in the OUTER, LEFT and RIGHT operations per the join type requirements.

Expand Down
2 changes: 1 addition & 1 deletion site/docs/serialization/binary_serialization.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Substrait can be serialized into a [protobuf](https://developers.google.com/prot

## Plan

The main top-level object used to communicate a Substrait plan using protobuf is a Plan message. The plan message is composed of a set of data structures that minimize repetition in the serialization along with one (or more) Relation trees.
The main top-level object used to communicate a Substrait plan using protobuf is a Plan message (see the [ExtendedExpression](/expressions/extended_expression/) for an alternative other top-level object). The plan message is composed of a set of data structures that minimize repetition in the serialization along with one (or more) Relation trees.

=== "Plan Message"

Expand Down
1 change: 1 addition & 0 deletions site/docs/spec/_config
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@ arrange:
- versioning.md
- specification.md
- technology_principles.md
- extending.md
49 changes: 49 additions & 0 deletions site/docs/spec/extending.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Extending

Substrait is a community project and requires consensus about new additions to the specification in order to maintain consistency. The best way to get consensus is to discuss ideas. The main ways to communicate are:

* Substrait Mailing List
* Substrait Slack
* Community Meeting

## Minor changes

Simple changes like typos and bug fixes do not require as much effort. [File an issue](https://github.com/substrait-io/substrait/issues) or [send a PR](https://github.com/substrait-io/substrait/pulls) and we can discuss it there.

## Complex changes

For complex features it is useful to discuss the change first. It will be useful to gather some background information to help get everyone on the same page.

### Outline the issue

#### Language

Every engine has its own terminology. Every Spark user probably knows what an "attribute" is. Velox users will know what a "RowVector" means. Etc. However, Substrait is used by people that come from a variety of backgrounds and you should generally assume that its users do not know anything about your own implementation. As a result, all PRs and discussion should endeavor to use Substrait terminology wherever possible.

#### Motivation

What problems does this relation solve? If it is a more logical relation then how does it allow users to express new capabilities? If it is more of an internal relation then how does it map to existing logical relations? How is it different than other existing relations? Why do we need this?

#### Examples

Provide example input and output for the relation. Show example plans. Try and motivate your examples, as best as possible, with something that looks like a real world problem. These will go a long ways towards helping others understand the purpose of a relation.

#### Alternatives

Discuss what alternatives are out there. Are there other ways to achieve similar results? Do some systems handle this problem differently?

### Survey existing implementation

It's unlikely that this is the first time that this has been done. Figuring out

### Prototype the feature

Novel approaches should be implemented as an extension first.

### Substrait design principles

Substrait is designed around interoperability so a feature only used by a single system may not be accepted. But don't dispair! Substrait has a highly developed extension system for this express purpose.

### You don't have to do it alone

If you are hoping to add a feature and these criteria seem intimidating then feel free to start a mailing list discussion before you have all the information and ask for help. Investigating other implementations, in particular, is something that can be quite difficult to do on your own.
8 changes: 7 additions & 1 deletion site/docs/tools/third_party_tools.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,10 @@
## Substrait-tools
The [substrait-tools](https://pypi.org/project/substrait-tools/) python package provides
a command line interface for producing/consuming substrait plans by leveraging the APIs
from different producers and consumers.
from different producers and consumers.

## Substrait Fiddle
[Substrait Fiddle](https://substrait-fiddle.com) is an online tool to share, debug, and prototype Substrait plans.

The [Substrait Fiddle Source](https://github.com/voltrondata/substrait-fiddle) is available allowing it to be run in any environment.

31 changes: 21 additions & 10 deletions site/docs/types/type_parsing.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,26 +10,37 @@ The components of this expression are:

| Component | Description | Required |
| ---------------------- | ------------------------------------------------------------ | ------------------------------------- |
| Name | Each type has a name. A type is expressed by providing a name. This name can be expressed in arbitrary case (e.g. `varchar` and `vArChAr` are equivalent). | |
| Nullability indicator | A type is either non-nullable or nullable. To express nullability, a type name is appended with a question mark. | Optional, defaults to non-nullable |
| Name | Each type has a name. A type is expressed by providing a name. This name can be expressed in arbitrary case (e.g. `varchar` and `vArChAr` are equivalent) although lowercase is preferred. | |
| Nullability indicator | A type is either non-nullable or nullable. To express nullability, a question mark is added after the type name (before any parameters). | Optional, defaults to non-nullable |
| Variation | When expressing a type, a user can define the type based on a type variation. Some systems use type variations to describe different underlying representations of the same data type. This is expressed as a bracketed integer such as [2]. | Optional, defaults to [0] |
| Parameters | Compound types may have one or more configurable properties. The two main types of properties are integer and type properties. The parameters for each type correspond to a list of known properties associated with a type as declared in the order defined in the type specification. For compound types (types that contain types), the data type syntax will include nested type declarations. The one exception is structs, which are further outlined below. | Required where parameters are defined |

### Grammars

It is relatively easy in most languages to produce simple parser & emitters for the type syntax. To make that easier, Substrait also includes an ANTLR [impl pending] grammar to ease consumption and production of types.
It is relatively easy in most languages to produce simple parser & emitters for the type syntax. To make that easier, Substrait also includes an [ANTLR grammar](https://github.com/substrait-io/substrait-cpp/blob/main/src/substrait/textplan/parser/grammar/SubstraitPlanParser.g4#L108) to ease consumption and production of types. (The grammar also supports an entire language for representing plans as text.)

### Structs & Named Structs

Structs are unique from other types because they have an arbitrary number of parameters. The parameters can also include one or two subproperties. Struct parsing is thus declared in the following two ways:
Structs are unique from other types because they have an arbitrary number of parameters. The parameters are recursive and may include their own subproperties. Struct parsing is declared in the following two ways:

```
# Struct
struct?[variation]<type0, type1,..., typeN>
=== "YAML"

# Named Struct
nstruct?[variation]<name0:type0, name1:type1,..., nameN:typeN>
```
```
# Struct
struct?[variation]<type0, type1,..., typeN>

# Named Struct
nstruct?[variation]<name0:type0, name1:type1,..., nameN:typeN>
```

=== "Text Format Examples"

```
// Struct
struct?<string, i8, i32?, timestamp_tz>

// Named structs are not yet supported in the text format.
```

In the normal (non-named) form, struct declares a set of types that are fields within that struct. In the named struct form, the parameters are formed by tuples of names + types, delineated by a colon. Names that are composed only of numbers and letters can be left unquoted. For other characters, names should be quoted with double quotes and use backslash for double-quote escaping.

Expand Down
3 changes: 2 additions & 1 deletion site/docs/types/type_system.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,5 @@ Substrait types fundamentally consist of four components:

Refer to [Type Parsing](type_parsing.md) for a description of the syntax used to describe types.

Note that Substrait employs a strict type system without any coercion rules. All changes in types must be made explicit via [cast expressions](../expressions/specialized_record_expressions.md).
!!! note "Note"
Substrait employs a strict type system without any coercion rules. All changes in types must be made explicit via [cast expressions](../expressions/specialized_record_expressions.md).
Loading