forked from tidyverse/design
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcs-rep.qmd
164 lines (120 loc) · 5.09 KB
/
cs-rep.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
---
depends_on:
- defs-required
- args-independence
---
# Case study: `rep()` {#sec-cs-rep}
```{r}
#| include = FALSE
source("common.R")
```
## What does `rep()` do?
`rep()` is an extremely useful base R function that repeats a vector `x` in various ways.
It has three details arguments: `times`, `each`, and `length.out`[^cs-rep-1] that interact in complicated ways.
Let's explore the basics first:
[^cs-rep-1]: Note that the function specification is `rep(x, ...)`, and `times`, `each`, and `length.out` do not appear explicitly.
You have to read the documentation to discover these arguments.
```{r}
x <- c(1, 2, 4)
rep(x, times = 3)
rep(x, length.out = 10)
```
`times` and `length.out` replicate the vector in the same way, but `length.out` allows you to specify a non-integer number of replications.
If you specify both, `length.out` wins.
```{r}
rep(x, times = 3, length.out = 10)
```
The `each` argument repeats individual components of the vector rather than the whole vector:
```{r}
rep(x, each = 3)
```
And you can combine that with `times`:
```{r}
rep(x, each = 3, times = 2)
```
If you supply a vector to `times` it works a similar way to `each`, repeating each component the specified number of times:
```{r}
rep(x, times = x)
```
## What makes this function hard to understand?
- There's a complicated dependency between `times`, `length.out`, and `each`.
`times` and `length.out` both control the same underlying variable in different ways, and you can not set them simultaneously.
`times` and `each` are mostly independent, but if you specify a vector for `times` you can't use each.
```{r}
#| error = TRUE
rep(1:3, times = c(2, 2, 2), each = 2)
```
- I think using `times` with a vector is confusing because it switches from replicating the whole vector to replicating individual values of the vector, like `each` usually does.
```{r}
rep(1:3, each = 2)
rep(1:3, times = 2)
rep(1:3, times = c(2, 2, 2))
```
I think these two problems have the same underlying cause: `rep()` is trying to do too much in a single function.
I think we can make things simpler by turning `rep()` into two functions: one that replicates the full vector, and one that replicates each element of the vector.
## How might we improve the situation?
Two create two new functions, we need to first come up with names: I like `rep_each()` and `rep_full()`.
`rep_each()` was a fairly easy name to come up with.
`rep_full()` was a little harder and took a few iterations: I like that `full` has the same number of letters as `each`, which makes the two functions look like they belong together.
Next, we need to think about their arguments.
Both will have a single data argument: `x`, the vector to replicate.
`rep_each()` has a single details argument which specifies the number of times to replicate each element.
`rep_time()` has two mutually exclusive details arguments, the number of times to repeat the whole vector, or the desired length of the output.
What should we call the arguments?
We've already captured the different replication strategies (each vs. full) in the function name, so I think the argument that specifies the number of times to replicate can be the same, and `times` seems reasonable.
For the second argument to `rep_full()`, I draw inspiration from `rep()` which uses `length.out`.
I think it's obvious that the argument controls the output, so `length` is adequate.
```{r}
rep_each <- function(x, times) {
times <- rep(times, length.out = length(x))
rep(x, times = times)
}
rep_full <- function(x, times, length) {
rlang::check_exclusive(times, length)
if (!missing(length)) {
rep(x, length.out = length)
} else {
rep(x, length.out = times * base::length(x))
}
}
```
(Note the downside of using `length` as the argument name: we have to call [`base::length()`](https://rdrr.io/r/base/length.html) to avoid evaluating the missing `length` when times is supplied.)
```{r}
x <- c(1, 2, 4)
rep_each(x, times = 2)
rep_full(x, times = 2)
rep_each(x, times = x)
rep_full(x, length = 5)
```
One downside of this approach is if you want to both replicate each component *and* the entire vector, you have to use two function calls, which is much more verbose than the `rep()` equivalent.
However, I don't think this is a terribly common use case, and so I think a longer call is more readable.
## Dealing with bad inputs
The implementations above work well for correct inputs, but will also work without error for a number of incorrect inputs:
```{r}
rep_full(1:3, 1:3)
```
Need to think about the types
```{r}
rep_each <- function(x, times) {
times <- vctrs::vec_cast(times, integer())
times <- vctrs::vec_recycle(times, vctrs::vec_size(x), x_arg = "times")
rep.int(x, times)
}
rep_full <- function(x, times, length) {
rlang::check_exclusive(times, length)
if (!missing(length)) {
rlang:::check_number_whole(length)
rep(x, length.out = length)
} else {
rlang:::check_number_decimal(times)
rep(x, length.out = times * base::length(x))
}
}
```
```{r}
#| error = TRUE
rep_each(1:3, 1:2)
rep_each(1:3, "x")
rep_full(1:3, "x")
rep_full(1:3, c(1, 2))
```