From dc2ff4aaedcef76dd8f1c3d7a978c7546224733c Mon Sep 17 00:00:00 2001 From: Weston Pace Date: Fri, 4 Oct 2024 04:56:02 -0700 Subject: [PATCH] docs: add two frequently asked questions to the FAQ (#719) These two questions come up often on Slack / other channels and I felt it was time to update the FAQ. --------- Co-authored-by: Bryce Mecum --- site/docs/faq.md | 43 ++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 42 insertions(+), 1 deletion(-) diff --git a/site/docs/faq.md b/site/docs/faq.md index 453d76ebc..c6859c67a 100644 --- a/site/docs/faq.md +++ b/site/docs/faq.md @@ -1,9 +1,50 @@ --- title: FAQ --- -# Frequently Asked Question + +# Frequently Asked Questions ## What is the purpose of the post-join filter field on Join relations? + The post-join filter on the various Join relations is not always equivalent to an explicit Filter relation AFTER the Join. See the example [here](https://facebookincubator.github.io/velox/develop/joins.html#hash-join-implementation) that highlights how the post-join filter behaves differently than a Filter relation in the case of a left join. + +## Why does the project relation keep existing columns? + +In several relational algebra systems ([DuckDB](https://duckdb.org/), [Velox](https://velox-lib.io/), [Apache Spark](https://spark.apache.org/), [Apache DataFusion](https://datafusion.apache.org/), etc.) the project relation is used both +to add new columns and remove existing columns. It is defined by a list of expressions and there is one output +column for each expression. + +In Substrait, the project relation is only used to add new columns. Any relation can remove columns by using the +`emit` property in `RelCommon`. This is because it is very common for optimized plans to discard columns once they +are no longer needed and this can happen anywhere in a plan. If this discard required a project relation then +optimized plans would be cluttered with project relations that only remove columns. + +As a result, Substrait's project relation is a little different. It is also defined by a list of expressions. +However, the output columns are a combination of the input columns and one column for each of the expressions. + +## Where are field names represented? + +Some relational algebra systems, such as Spark, give names to the output fields of a relation. For example, in +PySpark I might run `df.withColumn("num_chars", length("text")).filter("num_chars > 10")`. This creates a +project relation, which calculates a new field named `num_chars`. This field is then referenced in the filter +relation. Spark's logical plan maps closely to this and includes both the expression (`length("text")`) and the +name of the output field (`num_chars`) in its project relation. + +Substrait does not name intermediate fields in a plan. This is because these field names have no effect on +the computation that must be performed. In addition, it opens the door to name-based references, which Substrait +also does not support, because these can be a source of errors and confusion. One of the goals of Substrait is +to make it very easy for consumers to understand plans. All references in Substrait are done with ordinals. + +In order to allow plans that do use named fields to round-trip through Substrait there is a hint that can be +used to add field names to a plan. This hint is called `output_names` and is located in `RelCommon`. Consumers +should not rely on this hint being present in a plan but, if present, it can be used to provide field names to +intermediate relations in a plan for round-trip or debugging purposes. + +There are a few places where Substrait DOES define field names: + +- Read relations have field names in the base schema. This is because it is quite common for reads to do a + name-based lookup to determine the columns that need to be read from source files. +- The root relation has field names. This is because the root relation is the final output of the plan and + it is useful to have names for the fields in the final output.