Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] summarise() returns incorrect results when there are >1 fields to return (Arrow) #252

Open
Tracked by #205
thisisnic opened this issue Feb 27, 2023 · 0 comments
Assignees
Labels
arrow bug Something isn't working dplyr-summarise

Comments

@thisisnic
Copy link
Contributor

There's something strange going on with summarise() in Arrow, but not DuckDB, when we return multiple fields.

library(substrait)
library(dplyr)

mtcars %>%
  summarise(mean_hp = mean(hp, na.rm = TRUE), max_hp = max(hp, na.rm = TRUE))
#>    mean_hp max_hp
#> 1 146.6875    335

mtcars %>%
  duckdb_substrait_compiler() %>%
  summarise(mean_hp = mean(hp, na.rm = TRUE), max_hp = max(hp, na.rm = TRUE)) %>%
  collect()
#> # A tibble: 1 × 2
#>   mean_hp max_hp
#>     <dbl>  <dbl>
#> 1    147.    335

mtcars %>%
  arrow_substrait_compiler() %>%
  summarise(mean_hp = mean(hp, na.rm = TRUE), max_hp = max(hp, na.rm = TRUE)) %>%
  collect()
#> # A tibble: 1 × 2
#>   mean_hp max_hp
#>     <dbl>  <dbl>
#> 1    147.   147.
library(substrait)
#> 
#> Attaching package: 'substrait'
#> The following object is masked from 'package:stats':
#> 
#>     filter
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

mtcars %>%
  group_by(am) %>%
  summarise(mean_hp = mean(hp, na.rm = TRUE), max_hp = max(hp, na.rm = TRUE))
#> # A tibble: 2 × 3
#>      am mean_hp max_hp
#>   <dbl>   <dbl>  <dbl>
#> 1     0    160.    245
#> 2     1    127.    335

mtcars %>%
  duckdb_substrait_compiler() %>%
  group_by(am) %>%
  summarise(mean_hp = mean(hp, na.rm = TRUE), max_hp = max(hp, na.rm = TRUE)) %>%
  collect()
#> # A tibble: 2 × 3
#>      am mean_hp max_hp
#>   <dbl>   <dbl>  <dbl>
#> 1     1    127.    335
#> 2     0    160.    245

mtcars %>%
  arrow_substrait_compiler() %>%
  group_by(am) %>%
  summarise(mean_hp = mean(hp, na.rm = TRUE), max_hp = max(hp, na.rm = TRUE)) %>%
  collect()
#> # A tibble: 2 × 3
#>      am mean_hp max_hp
#>   <dbl>   <dbl>  <dbl>
#> 1  127.    127.      1
#> 2  160.    160.      0

Created on 2023-02-27 with reprex v2.0.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow bug Something isn't working dplyr-summarise
Projects
None yet
Development

No branches or pull requests

1 participant