-
Notifications
You must be signed in to change notification settings - Fork 53
/
inputs-explicit.qmd
112 lines (81 loc) · 5.96 KB
/
inputs-explicit.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
# Make inputs explicit {#sec-inputs-explicit}
```{r}
#| include = FALSE
source("common.R")
```
## What's the problem?
A function is easier to understand if its output depends only on its inputs (i.e. its arguments).
If a function returns different results with the same inputs, then some inputs must be implicit, typically because the function relies on an option or some locale setting.
Implicit inputs are not always bad, as some functions like `Sys.time()`, `read.csv()`, and the random number generators, fundamentally depend on them.
But they should be used as sparingly as possible, and never when not related to the core purpose of the function.
Explicit arguments make code easier to understand because you can see what will affect the outputs just by reading the code; you don't need to run it.
Implicit arguments can lead to code that returns different results on different computers, and the differences are usually hard to track down.
## What are some examples?
One common source of hidden arguments is the use of global options:
- Historically, the worst offender was the `stringsAsFactors` option which changed how a number of functions[^inputs-explicit-1] treated character vectors.
This option was part of a multi-year procedure to move R away toward character vectors and away from vectors.
You can learn more in [*stringsAsFactors: An unauthorized biography*](https://simplystatistics.org/posts/2015-07-24-stringsasfactors-an-unauthorized-biography/) by Roger Peng and [*stringsAsFactors = \<sigh\>*](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh) by Thomas Lumley.
- `lm()`'s handling of missing values depends on the global option of `na.action`.
The default is `na.omit` which drops the missing values prior to fitting the model (which is inconvenient because then the results of `predict()` don't line up with the input data.
[^inputs-explicit-1]: Such as `data.frame()`, `as.data.frame()`, and `read.csv()`
Another common source of subtle bugs is relying on the system **locale**, i.e. the country and language specific settings controlled by your operating system.
Relying on the system locale is always done with the best of intentions (you want your code to respect the user's preferences) but can lead to subtle differences when the same code is run by different people.
Here are a few examples:
- `strptime()` relies on the names of weekdays and months in the current locale.
That means `strptime("1 Jan 2020", "%d %b %Y")` will work on computers with an English locale, and fail elsewhere.
- `as.POSIXct()` depends on the current timezone.
The following code returns different underlying times when run on different computers:
```{r}
as.POSIXct("2020-01-01 09:00")
```
- `toupper()` and `tolower()` depend on the current locale.
It is fairly uncommon for this to cause problems because most languages either use their own character set, or use the same rules for capitalisation as English.
However, this behaviour did cause a bug in ggplot2 because internally it takes `geom = "identity"` and turns it into `GeomIdentity` to find the object that actually does computation.
In Turkish, however, the upper case version of i is İ, and `Geomİdentity` does not exist.
This meant that for some time ggplot2 did not work on Turkish computers.
- `sort()` and `order()` rely on the lexicographic order (i.e. how different alphabets sort their letters) defined by the current locale.
`lm()` automatically converts character vectors to factors with `factor()`, which uses `order()`, which means that it's possible for the coefficients to vary[^inputs-explicit-2] if your code is run in a different country!
[^inputs-explicit-2]: Predictions and other diagnostics won't be affected, but you're likely to be surprised that your coefficients are different.
## How can I remediate the problem?
At some level, implicit inputs are easy to avoid when creating new functions: just don't use the locale or global options!
But it's easy for such problems to creep in indirectly, when you call a function not knowing that it has hidden inputs.
The best way to prevent that is to consult the list of common offenders provided above.
### Make an option explicit
If you want depend on an option or locale, make sure it's an explicit argument.
Such arguments generally should not affect computation (@sec-def-user), just side-effects like printed output or status messages.
If they do affect results, follow @sec-def-inform to make sure the user knows what's happening.
For example, lets take `as.POSIXct()` which basically looks something like this:
```{r}
as.POSIXct <- function(x, tz = "") {
base::as.POSIXct(x, tz = tz)
}
as.POSIXct("2020-01-01 09:00")
```
The `tz` argument is present, but it's not obvious that `""` means the current time zone.
Let's first make that explicit:
```{r}
as.POSIXct <- function(x, tz = Sys.timezone()) {
base::as.POSIXct(x, tz = tz)
}
as.POSIXct("2020-01-01 09:00")
```
Since this is an important default whose value can change, we also print it out if the user hasn't explicitly set it:
```{r}
as.POSIXct <- function(x, tz = Sys.timezone()) {
if (missing(tz)) {
message("Using `tz = \"", tz, "\"`")
}
base::as.POSIXct(x, tz = tz)
}
as.POSIXct("2020-01-01 09:00")
```
Since most people don't like lots of random output this provides a subtle incentive to supply the timezone:
```{r}
as.POSIXct("2020-01-01 09:00", tz = "America/Chicago")
```
### Temporarily adjust global state
If you're calling a function with implicit arguments and those implicit arguments are causing problems with your code, you can always work around them by temporarily changing the global state which it uses.
The easiest way to do so is to use the [withr](https://withr.r-lib.org) package, which provides a variety of tools to change temporarily change global state.
## See also
- @sec-def-user and @sec-def-inform: how to make an option as explicit as possible.
- @sec-spooky-action: where a function changes global state in a surprising way.