Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent project behaviour #262

Open
thisisnic opened this issue Mar 14, 2023 · 1 comment
Open

Inconsistent project behaviour #262

thisisnic opened this issue Mar 14, 2023 · 1 comment

Comments

@thisisnic
Copy link
Contributor

I'm getting inconsistent behaviour depending on whether I supply all fields in a substrait_project() or not (and differing behaviour between Arrow and DuckDB) to the point where it's unclear what the expected behaviour should be.

library(dplyr)
library(substrait)

# y/x/z 
tibble::tibble(x = 1:3, y = 4:6) %>%
  arrow_substrait_compiler() %>%
  substrait_project(x, z = x + 1) %>%
  collect()
#> # A tibble: 3 × 3
#>       y     x     z
#>   <int> <int> <dbl>
#> 1     4     1     2
#> 2     5     2     3
#> 3     6     3     4

# x/y/z
tibble::tibble(x = 1:3, y = 4:6) %>%
  arrow_substrait_compiler() %>%
  substrait_project(y, z = x + 1) %>%
  collect()
#> # A tibble: 3 × 3
#>       x     y     z
#>   <int> <int> <dbl>
#> 1     1     4     2
#> 2     2     5     3
#> 3     3     6     4

# x/y/z
tibble::tibble(x = 1:3, y = 4:6) %>%
  arrow_substrait_compiler() %>%
  substrait_project(x, y, z = x + 1) %>%
  collect()
#> # A tibble: 3 × 3
#>       x     y     z
#>   <int> <int> <dbl>
#> 1     1     4     2
#> 2     2     5     3
#> 3     3     6     4

# error
tibble::tibble(x = 1:3, y = 4:6) %>%
  arrow_substrait_compiler() %>%
  substrait_project(z = x + 1) %>%
  collect()
#> Error: Invalid: Invalid emit case
#> /home/nic2/arrow/cpp/src/arrow/engine/substrait/serde.cc:157  FromProto(plan_rel.has_root() ? plan_rel.root().input() : plan_rel.rel(), ext_set, conversion_options)

# error
tibble::tibble(x = 1:3, y = 4:6) %>%
  duckdb_substrait_compiler() %>%
  substrait_project(x, z = x + 1) %>%
  collect()
#> Error: Binder Error: Positional reference 3 out of range (total 2 columns)

# error
tibble::tibble(x = 1:3, y = 4:6) %>%
  duckdb_substrait_compiler() %>%
  substrait_project(y, z = x + 1) %>%
  collect()
#> Error: Binder Error: Positional reference 3 out of range (total 2 columns)

# success
tibble::tibble(x = 1:3, y = 4:6) %>%
  duckdb_substrait_compiler() %>%
  substrait_project(x, y, z = x + 1) %>%
  collect()
#> # A tibble: 3 × 3
#>       x     y     z
#>   <int> <int> <dbl>
#> 1     1     4     2
#> 2     2     5     3
#> 3     3     6     4

# error
tibble::tibble(x = 1:3, y = 4:6) %>%
  duckdb_substrait_compiler() %>%
  substrait_project(z = x + 1) %>%
  collect()
#> Error: Binder Error: Positional reference 2 out of range (total 1 columns)
@paleolimbot
Copy link
Contributor

The behaviour for substrait_project() is definitely strange here! In general substrait_project() always appends columns (rather than replaces), but I forget the details and our emit case does look strange:

library(dplyr, warn.conflicts = FALSE)
library(substrait, warn.conflicts = FALSE)

projected <- tibble::tibble(x = 1:3, y = 4:6) %>%
  arrow_substrait_compiler() %>%
  substrait_project(x, z = x + 1)

# Seems correct: we append two fields to the output
projected$rel$project$expressions
#> [[1]]
#> message of type 'substrait.Expression' with 1 field set
#> selection {
#>   direct_reference {
#>     struct_field {
#>     }
#>   }
#>   root_reference {
#>   }
#> }
#> 
#> [[2]]
#> message of type 'substrait.Expression' with 1 field set
#> scalar_function {
#>   function_reference: 2
#>   output_type {
#>     fp64 {
#>       nullability: NULLABILITY_NULLABLE
#>     }
#>   }
#>   arguments {
#>     value {
#>       selection {
#>         direct_reference {
#>           struct_field {
#>           }
#>         }
#>         root_reference {
#>         }
#>       }
#>     }
#>   }
#>   arguments {
#>     value {
#>       literal {
#>         fp64: 1
#>       }
#>     }
#>   }
#>   options {
#>     name: "overflow"
#>     preference: "SILENT"
#>   }
#> }

# Incorrect: the emit should be 0, 1, 2, 3
projected$rel$project$common$emit
#> message of type 'substrait.RelCommon.Emit' with 1 field set
#> output_mapping: 1
#> output_mapping: 2
#> output_mapping: 3

# Incorrect: the names should be x, y, x, z
projected$schema$names
#> [1] "y" "x" "z"

In the meantime, I think maybe you could use substrait_select()?

library(dplyr)
library(substrait)

# y/x/z 
tibble::tibble(x = 1:3, y = 4:6) %>%
  arrow_substrait_compiler() %>%
  substrait_select(x, z = x + 1) %>%
  collect()
#> # A tibble: 3 × 2
#>       x     z
#>   <int> <dbl>
#> 1     1     2
#> 2     2     3
#> 3     3     4

Created on 2023-03-14 with reprex v2.0.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants