You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, if distinct is used with .keep_all = TRUE, ROW_NUMBER is used. Which is normal dbplyr behavior for generics. But in my daily uses, I found DISTINCT ON is much faster than filtering a window function of a subquery.
library(duckdb)
#> Loading required package: DBI
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
conn1 <- DBI::dbConnect(duckdb())
duckdb_register(conn1, "iris", iris)
tbl_iris <- tbl(conn1, "iris")
# Current behavior
tbl_iris |>
distinct(
Petal.Width,
Species,
.keep_all = TRUE
) |>
show_query()
#> <SQL>
#> SELECT "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", Species
#> FROM (
#> SELECT
#> iris.*,
#> ROW_NUMBER() OVER (PARTITION BY "Petal.Width", Species ORDER BY "Sepal.Length") AS col01
#> FROM iris
#> ) q01
#> WHERE (col01 = 1)
dbDisconnect(conn1)
rm(conn1, tbl_iris)
Currently, if
distinct
is used with.keep_all = TRUE
,ROW_NUMBER
is used. Which is normaldbplyr
behavior for generics. But in my daily uses, I foundDISTINCT ON
is much faster than filtering a window function of a subquery.Created on 2024-09-18 with reprex v2.1.1
The text was updated successfully, but these errors were encountered: