forked from tidyverse/design
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcs-stringr.qmd
58 lines (43 loc) · 2.1 KB
/
cs-stringr.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# Case study: stringr
```{r}
#| include = FALSE
source("common.R")
```
```{=html}
<!--
https://github.com/wch/r-source/blob/trunk/src/main/grep.c#L891-L1151 -->
```
## Pattern engines
`grepl()`, has three arguments that take either `FALSE` or `TRUE`: `ignore.case`, `perl`, `fixed`, which might suggest that there are 2 \^ 8 = 16 possible options.
However, a number of combinations are not allowed:
```{r}
x <- grepl("a", letters, fixed = TRUE, ignore.case = TRUE)
x <- grepl("a", letters, fixed = TRUE, perl = TRUE)
```
The other problem is that `ignore.case` only works with two of the three engines: POSIX and perl.
This is hard to remedy without creating a completely new matching engine for fixed case, which is particularly hard because different languages have different rules for case.
stringr takes a different approach, encoding the engine as an attribute of the pattern:
```{r}
library(stringr)
x <- str_detect(letters, "a")
# short for:
x <- str_detect(letters, regex("a"))
# Which is where you supply additional arguments
x <- str_detect(letters, regex("a", ignore_case = TRUE))
```
This has the advantage that each engine can take different arguments, but has the disadvantage that it's harder to know that these options exist because you have to travel at least another level in the documentation.
An alternative approach would be to have a separate engine argument:
```{r}
#| eval = FALSE
x <- str_detect(letters, "a", engine = regex())
x <- str_detect(letters, "a", engine = fixed())
x <- str_detect(letters, "a", engine = coll())
```
This approach is a bit more discoverable (because there's clearly another argument that affects the pattern), but it's slightly less general, because of the `boundary()` engine, which doesn't match patterns but boundaries:
```{r}
#| eval = FALSE
x <- str_detect(letters, boundary("word"))
# Seems confusing: now you can omit the pattern argument?
x <- str_detect(letters, engine = boundary("word"))
```
It would also mean that you had an argument `engine`, that affected how another argument, `pattern`, was interpreted, so it would repeat the problem in a slightly different form.