Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grepl binding for DuckDB doesn't work with regex included #267

Open
thisisnic opened this issue Apr 5, 2023 · 1 comment
Open

grepl binding for DuckDB doesn't work with regex included #267

thisisnic opened this issue Apr 5, 2023 · 1 comment

Comments

@thisisnic
Copy link
Contributor

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(substrait)
#> 
#> Attaching package: 'substrait'
#> The following object is masked from 'package:stats':
#> 
#>     filter

tbl <- tibble::tibble(x = c("apple pie", "pork pie", "pork chop"))

# dplyr - all good without regex
tbl %>% 
  filter(grepl("pie", x))
#> # A tibble: 2 × 1
#>   x        
#>   <chr>    
#> 1 apple pie
#> 2 pork pie

# substrait - all good without regex
tbl %>% 
  duckdb_substrait_compiler() %>%
  filter(grepl("pie", x)) %>%
  collect()
#> # A tibble: 2 × 1
#>   x        
#>   <chr>    
#> 1 apple pie
#> 2 pork pie

# dplyr - all good with regex
tbl %>% 
  filter(grepl("pie$", x))
#> # A tibble: 2 × 1
#>   x        
#>   <chr>    
#> 1 apple pie
#> 2 pork pie

# substrait - doesn't work without regex
tbl %>% 
  duckdb_substrait_compiler() %>%
  filter(grepl("pie$", x)) %>%
  collect()
#> # A tibble: 0 × 1
#> # … with 1 variable: x <chr>

Created on 2023-04-05 with reprex v2.0.2

@thisisnic
Copy link
Contributor Author

thisisnic commented Apr 5, 2023

The Substrait spec doesn't state whether regex can or cannot be used in the functions, so this is something we want to raise there. I suspect that the contains() function which has been bound to R's grepl() here doesn't translate to something which allows regex; in the Substrait string function spec there are function like count_substring and regexp_count_substring; implying differing version of functions depending on whether regex are allowed. There is currently no regexp_contains. We should probably open an issue on the Substrait repo, and in the meantime use a workaround (e.g. could we perhaps bind regexp_count_substring(x) > 0 to grepl())?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant