diff --git a/README.md b/README.md index 609398b..5881303 100755 --- a/README.md +++ b/README.md @@ -5,13 +5,12 @@ A complement to . ## Structure Title should be a command. -Keep it as short as possible. +Keep it as short as possible, and frame it positively (which you should do, not what you shouldn't do). Sections: -- What's the pattern/problem? - Brief description and motivation. - Include related problems as bulleted list. +- What's the pattern? + Brief description and why it's important. - What are some examples? Bulleted list of existing functions. @@ -19,12 +18,12 @@ Sections: Show results of code where useful. Goal is to include enough variety that everyone recognises at least one function, and can look up the docs for the details of the others. -- Why is it bad/useful? - More detailed motivation. - - What are the exceptions? - How to avoid/remediate/use it? Detailed explanation (with example) of how to prevent the problem, fix the problem, and/or use the pattern. +- See also. + Include related problems as bulleted list. + Case studies are useful for functions that need more explanation, have multiple problems, or need greater discussion of different trade-offs. diff --git a/_quarto.yml b/_quarto.yml index eb0547f..f22b12f 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -33,6 +33,7 @@ book: - dots-position.Rmd - args-independence.Rmd - cs-setNames.Rmd + - cs-stringr.Rmd - part: Default values chapters: diff --git a/args-data-details.Rmd b/args-data-details.Rmd index 49bf372..16a9505 100644 --- a/args-data-details.Rmd +++ b/args-data-details.Rmd @@ -17,10 +17,7 @@ However, there are a few general principles: - Optional arguments (i.e. arguments with a default) are the least important, and should come last. - If the function uses `…`, the optional arguments should come after `…`; see @sec-dots-position for more details. -## Why is it important? - This convention makes it easy to understand the structure of a function at a glance: the more important an argument is, the earlier you see it. - When the output is very strongly tied to an input, putting first also ensures that you function works well with the pipe, leading to code that focuses on the transformations rather than the object being transformed. I believe that ensuring the first argument is always the object being transformed is helps make stringr and purrr functions easier to learn than their base equivalents. diff --git a/args-hidden.Rmd b/args-hidden.Rmd index 0c319fe..3961790 100644 --- a/args-hidden.Rmd +++ b/args-hidden.Rmd @@ -1,4 +1,4 @@ -# Avoid hidden arguments {#sec-args-hidden} +# Make inputs explicit {#sec-args-hidden} ```{r} #| include = FALSE @@ -7,34 +7,38 @@ source("common.R") ## What's the problem? -Functions are easier to understand if the outputs depends only on the values of the input. -If a function unexpectedly returns different results with the same inputs, then we say it has **hidden arguments**. -Hidden arguments make code harder to reason about, because to correctly predict the output you also need to know some other state. +A function is easier to understand if it's output depends only on its arguments. +If a function returns different results with the same inputs, then it some inputs must be implicit. +Typically this is because the function relies on an option or environment variable. -It's hard to precisely characterise surprise, but it's an important part of this pattern as their are functions whose primary purpose is to return different values even with the same inputs: +Implicit inputs are not always bad, as there are some functions that rely on them in order to their job, like `Sys.time()`, `read.csv()`, and the random number generators. +But these difference are not the key part of the function's job then it's going to be surprising. -- `Sys.time()` depends on the system time, but it's not a surprise: getting the current time is to the whole point of the function! - -- `read.csv(path)` depends not on the value of `path` but the contents of the file at that location. - Reading from the file system necessarily implies that the results depend on the contents of the file, not its path, so this is not a surprise. - -- Random number generators like `runif()` peek at the value of the special global variable `.Random.seed`. - This is a little surprising, but if they didn't have some global state every call to `runif()` would return the same value. - -Related to hidden arguments (where a function depends on global state in a surprising way) is spooky action (@sec-spooky-action), where a function changes global state in a surprising way. +Explicit arguments make code easier to understand because all the inputs are explicit in the code; you don't need to run it in order to predict what it will do. +Functions with implicit inputs can also return different results on different computers, and the differences will be hard to track down because you have to look in many different places. ## What are some examples? One common source of hidden arguments is the use of global options. These can be useful to control display but, as discussed in @sec-def-user, should not affect computation: -- The result of `data.frame(x = "a")$x` depends on the value of the global `stringsAsFactors` option: if it's `FALSE` (the default) you get a character vector; if it's `TRUE`, you get a factor. +- Historically, the worst offender was the `stringsAsFactors` option which changed how a number of functions[^args-hidden-1] treated character vectors. + This option was part of a multiyear procedure to move R away toward charcter vectors and away from vectors. + You can learn more in [*stringsAsFactors: An unauthorized biography*](https://simplystatistics.org/posts/2015-07-24-stringsasfactors-an-unauthorized-biography/) by Roger Peng and [*stringsAsFactors = \*](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh) by Thomas Lumley. - `lm()`'s handling of missing values depends on the global option of `na.action`. The default is `na.omit` which drops the missing values prior to fitting the model (which is inconvenient because then the results of `predict()` don't line up with the input data. `modelr::na.warn()` provides an approach more in line with other base behaviours: it drops missing values with a warning.) -Another common source of hidden inputs is the system locale: +[^args-hidden-1]: Such as `data.frame()`, `as.data.frame()`, and `read.csv()` + +Allowing the system locale to affect the result of a function is a subtle source of bugs when sharing code between people who work in different countries. +These defaults on rarely cause problems because most languages that share the same writing system share (most of) the same collation rules. +The main exceptions tend to be European languages which have varying rules for modified letters, e.g. in Norwegian, å comes at the end of the alphabet. +However, when they do cause problems they will take a long time to track down: you're unlikely to expect that the coefficients of a linear model are different[^args-hidden-2] because your code is run in a different country! + +[^args-hidden-2]: You'll get different coefficients for a categorical predictor if the ordering means that a different levels comes first in the alphabet. + The predictions and other diagnostics won't be affected, but you're likely to be surprised that your coefficients are different. - `strptime()` relies on the names of weekdays and months in the current locale. That means `strptime("1 Jan 2020", "%d %b %Y")` will work on computers with an English locale, and fail elsewhere. @@ -57,23 +61,6 @@ Another common source of hidden inputs is the system locale: `factor()` uses `order()`, so the results from factor depend implicitly on the current locale. (This is not an imaginary problem as this [SO question](https://stackoverflow.com/questions/39339489) attests). -## Why is it important? - -Hidden arguments are bad because they make it much harder to predict the output of a function. -Historically, the worst offender was the `stringsAsFactors` option which changed how a number of functions[^args-hidden-1] treated character vectors. -This option was part of a multiyear procedure to move R away toward charcter vectors and away from vectors. -You can learn more in [*stringsAsFactors: An unauthorized biography*](https://simplystatistics.org/posts/2015-07-24-stringsasfactors-an-unauthorized-biography/) by Roger Peng and [*stringsAsFactors = \*](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh) by Thomas Lumley. - -[^args-hidden-1]: Such as `data.frame()`, `as.data.frame()`, and `read.csv()` - -Allowing the system locale to affect the result of a function is a subtle source of bugs when sharing code between people who work in different countries. -These defaults on rarely cause problems because most languages that share the same writing system share (most of) the same collation rules. -The main exceptions tend to be European languages which have varying rules for modified letters, e.g. in Norwegian, å comes at the end of the alphabet. -However, when they do cause problems they will take a long time to track down: you're unlikely to expect that the coefficients of a linear model are different[^args-hidden-2] because your code is run in a different country! - -[^args-hidden-2]: You'll get different coefficients for a categorical predictor if the ordering means that a different levels comes first in the alphabet. - The predictions and other diagnostics won't be affected, but you're likely to be surprised that your coefficients are different. - ## How can I remediate the problem? Generally, hidden arguments are easy to avoid when creating new functions: simply refrain from using environment variables (like the locale) or global options (like `stringsAsFactors`). @@ -112,3 +99,7 @@ as.POSIXct <- function(x, tz = Sys.timezone()) { } as.POSIXct("2020-01-01 09:00") ``` + +### See also + +- In hidden arguments, a function depends on global state in a surprising way; in spooky action (@sec-spooky-action), a function changes global state in a surprising way. diff --git a/args-independence.Rmd b/args-independence.Rmd index fdc7aac..c4e53de 100644 --- a/args-independence.Rmd +++ b/args-independence.Rmd @@ -1,4 +1,4 @@ -# Avoid dependencies between arguments {#sec-args-independence} +# Keep arguments indepedent {#sec-args-independence} ```{r} #| include = FALSE @@ -8,7 +8,20 @@ source("common.R") ## What's the problem? Avoid creating dependencies between details arguments so that only certain combinations are permitted. -Dependencies between arguments makes functions harder to use because you have to remember how arguments interact, and you when reading a call, you need to read multiple arguments before interpreting one. + +Dependencies between arguments makes functions harder: + +- It suggests that there are many more viable code paths than there really are and all those (unnecessary) possibilities still occupy head space. + You have to learn and the remember the set of allowed combinations, rather than them being implied by the structure of the function. + +- If one argument affects the operation of another argument, you have to read the entire function call before you can successfully understand it. + +- It increases implementation complexity. + Interdependence of arguments suggests complex implementation paths which are harder to analyse and test. + +- It makes documentation harder to write. + You have to use extra words to explain exactly how combinations of arguments work together, and it's not obvious where those words should go. + If there's an interaction between `arg_a` and `arg_b` do you document with `arg_a`, with `arg_b`, or with both? ## What are some examples? @@ -51,20 +64,6 @@ See @sec-args-mutually-exclusive) and @sec-args-compound for a two exceptions wh (Other examples for you to explore: `rate` and `scale` in `rgamma()`, `na.rm` and `use` in `var()`) -## Why is this important? - -Having complicated interdependencies between arguments has major downsides: - -- It suggests that there are many more viable code paths than there really are and all those (unnecessary) possibilities still occupy head space. - You have to learn and the remember the set of allowed combinations, rather than them being implied by the structure of the function. - -- It increases implementation complexity. - Interdependence of arguments suggests complex implementation paths which are harder to analyse and test. - -- It makes documentation harder to write. - You have to use extra words to explain exactly how combinations of arguments work together, and it's not obvious where those words should go. - If there's an interaction between `arg_a` and `arg_b` do you document with `arg_a`, with `arg_b`, or with both? - ## How do I remediate? Often these problems arise because the scope of a function grows over time. @@ -113,63 +112,3 @@ That has three advantages: - There's no way to supply both `n` and `prop`. - The `ties.method` argument would only appear in `fct_lump_n()` and `_prop()`, not `_smallest()`. - -### Case study: `grepl()` vs `stringr::str_detect()` - -```{=html} - -``` -`grepl()`, has three arguments that take either `FALSE` or `TRUE`: `ignore.case`, `perl`, `fixed`, which might suggest that there are 2 \^ 3 = 16 possible invocations. -However, a number of combinations are not allowed: - -```{r} -x <- grepl("a", letters, fixed = TRUE, ignore.case = TRUE) -x <- grepl("a", letters, fixed = TRUE, perl = TRUE) -``` - -Part of this problem could be resolved by making it more clear that this function provides three engines for matching text: - -- POSIX 1003.2 extended regular expressions (the default), -- Perl-style regular expressions (`perl = TRUE`) -- Fixed matching (`fixed = TRUE`). - -One way to make this more clear would be to use @sec-def-enum and create a new argument called something like `engine = c("POSIX", "perl", "fixed")`. - -The other problem is that `ignore.case` only works with two of the three engines: POSIX and perl. -This is hard to remedy without creating a completely new matching engine for fixed case, which is particularly hard because different languages have different rules for case. - -stringr takes a different approach, encoding the engine as an attribute of the pattern: - -```{r} -library(stringr) - -x <- str_detect(letters, "a") -# short for: -x <- str_detect(letters, regex("a")) - -# Which is where you supply additional arguments -x <- str_detect(letters, regex("a", ignore_case = TRUE)) -``` - -This has the advantage that each engine can take different arguments, but has the disadvantage that it's harder to know that these options exist because you have to travel at least another level in the documentation. - -An alternative approach would be to have a separate engine argument: - -```{r} -#| eval = FALSE -x <- str_detect(letters, "a", engine = regex()) -x <- str_detect(letters, "a", engine = fixed()) -x <- str_detect(letters, "a", engine = coll()) -``` - -This approach is a bit more discoverable (because there's clearly another argument that affects the pattern), but it's slightly less general, because of the `boundary()` engine, which doesn't match patterns but boundaries: - -```{r} -#| eval = FALSE -x <- str_detect(letters, boundary("word")) -# Seems confusing: now you can omit the pattern argument? -x <- str_detect(letters, engine = boundary("word")) -``` - -It would also mean that you had an argument `engine`, that affected how another argument, `pattern`, was interpreted, so it would repeat the problem in a slightly different form. diff --git a/cs-stringr.Rmd b/cs-stringr.Rmd new file mode 100644 index 0000000..a0a4c1e --- /dev/null +++ b/cs-stringr.Rmd @@ -0,0 +1,64 @@ +# Case study: stringr + +```{r} +#| include = FALSE +source("common.R") +``` + +```{=html} + +``` +`grepl()`, has three arguments that take either `FALSE` or `TRUE`: `ignore.case`, `perl`, `fixed`, which might suggest that there are 2 \^ 3 = 16 possible invocations. +However, a number of combinations are not allowed: + +```{r} +x <- grepl("a", letters, fixed = TRUE, ignore.case = TRUE) +x <- grepl("a", letters, fixed = TRUE, perl = TRUE) +``` + +Part of this problem could be resolved by making it more clear that this function provides three engines for matching text: + +- POSIX 1003.2 extended regular expressions (the default), +- Perl-style regular expressions (`perl = TRUE`) +- Fixed matching (`fixed = TRUE`). + +One way to make this more clear would be to use @sec-def-enum and create a new argument called something like `engine = c("POSIX", "perl", "fixed")`. + +The other problem is that `ignore.case` only works with two of the three engines: POSIX and perl. +This is hard to remedy without creating a completely new matching engine for fixed case, which is particularly hard because different languages have different rules for case. + +stringr takes a different approach, encoding the engine as an attribute of the pattern: + +```{r} +library(stringr) + +x <- str_detect(letters, "a") +# short for: +x <- str_detect(letters, regex("a")) + +# Which is where you supply additional arguments +x <- str_detect(letters, regex("a", ignore_case = TRUE)) +``` + +This has the advantage that each engine can take different arguments, but has the disadvantage that it's harder to know that these options exist because you have to travel at least another level in the documentation. + +An alternative approach would be to have a separate engine argument: + +```{r} +#| eval = FALSE +x <- str_detect(letters, "a", engine = regex()) +x <- str_detect(letters, "a", engine = fixed()) +x <- str_detect(letters, "a", engine = coll()) +``` + +This approach is a bit more discoverable (because there's clearly another argument that affects the pattern), but it's slightly less general, because of the `boundary()` engine, which doesn't match patterns but boundaries: + +```{r} +#| eval = FALSE +x <- str_detect(letters, boundary("word")) +# Seems confusing: now you can omit the pattern argument? +x <- str_detect(letters, engine = boundary("word")) +``` + +It would also mean that you had an argument `engine`, that affected how another argument, `pattern`, was interpreted, so it would repeat the problem in a slightly different form. diff --git a/def-required.Rmd b/def-required.Rmd index 79ed25a..9d50fde 100644 --- a/def-required.Rmd +++ b/def-required.Rmd @@ -9,13 +9,8 @@ source("common.R") An argument should have a default if and only if it's optional. In other words: the absence of a default implies than an argument is required; the presence of a default implies that an argument is optional. - -See @sec-def-short if an argument has a default, but it requires a complex calculation. - There are two exceptions: when a pair of arguments are mutually exclusive (i.e. you must supply one or the other but not both), or when you can supply one complex object instead of two or more simple objects. -## Why is it important? - This simple convention ensures that you can tell which arguments are optional and which arguments are required from a glance at the function signature. Otherwise you need to rely on the user carefully reading the documentation. diff --git a/def-short.Rmd b/def-short.Rmd index 543128c..5a777f1 100644 --- a/def-short.Rmd +++ b/def-short.Rmd @@ -23,6 +23,8 @@ funs[args_max > 50] %>% discard(~ grepl("as.data.frame", .x$name, fixed = TRUE)) Default values should be short and sweet. This makes the function specification easier to scan. +This pattern is important because it ensures that when scanning the function definition you can understand quickly how the defaults will be computed. + ## What are some examples? The following examples, drawn from base R, illustrate some functions that don't follow this pattern: @@ -45,8 +47,6 @@ The following examples, drawn from base R, illustrate some functions that don't ) ``` -## Why is it important? - ## How do I use it? There are three approaches: @@ -88,7 +88,7 @@ title.hjust <- guide$title.hjust %||% As you can see, `%||%` is particularly well suited to arguments where the default value is found through a cascading system of fallbacks. -Don't use `%||%` for more complex examples. +Don't use `%||%` for more complex examples where the individual clauses can't fit on their own line. For example in `reshape()` I would set `split = NULL` and then write: ```{r} @@ -102,8 +102,6 @@ if (is.null(split)) { } ``` -(I would probably also switch on `is.null(sep)` too, to make it more clear that there is special behaviour.) - ### Helper function For more complicated cases, you'll probably want to pull the code that computes the default out into a separate function, and in many cases you'll want to export (and document) the function. @@ -125,9 +123,6 @@ str(purrr::done()) str(rlang::zap()) ``` -Or is this the wrong way around? -Should the default always be `NULL` and we have a special value to use when you actually want a `NULL`? - Take `purrr::reduce()`: it has an optional details argument called `init`. When supplied, it serves as the initial value for the computation. But any value (including `NULL`) can a valid value. @@ -147,25 +142,14 @@ sample.int <- function(n, prob = NULL, useHash = (!replace && is.null(prob) && size <= n/2 && n > 1e+07) ) { - if (useHash) { - .Internal(sample2(n, size)) - } else { - .Internal(sample(n, size, replace, prob)) - } + + ... } # AFTER -sample.int <- function(n, - size = n, - replace = FALSE, - prob = NULL, - useHash = NULL) { - useHash <- useHash %||% !replace && is.null(prob) && size <= n/2 && n > 1e+07 +sample.int <- function(n, size = n, replace = FALSE, prob = NULL, useHash = NULL) { + useHash <- useHash %||% (!replace && is.null(prob) && size <= n/2 && n > 1e+07) - if (useHash) { - .Internal(sample2(n, size)) - } else { - .Internal(sample(n, size, replace, prob)) - } + ... } ``` diff --git a/dots-position.Rmd b/dots-position.Rmd index 0bc492e..601aae4 100644 --- a/dots-position.Rmd +++ b/dots-position.Rmd @@ -8,11 +8,8 @@ source("common.R") ## What's the pattern? If you use `…` in a function, put it after the required arguments and before the optional arguments. - (See @sec-dots-data if `…` requires at least one argument because it's used to combine an arbitrary number of objects.) -## Why is it important? - There are three primary advantages: - It forces the user of your function to fully name optional arguments, because arguments that come after `...` are never matched by position or partially by name.