Skip to content

Commit

Permalink
Polishing scannable definitions
Browse files Browse the repository at this point in the history
  • Loading branch information
hadley committed Jul 26, 2023
1 parent 9acdeb8 commit 8ac437c
Show file tree
Hide file tree
Showing 10 changed files with 227 additions and 219 deletions.
2 changes: 2 additions & 0 deletions _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,8 @@ book:
- dots-prefix.qmd
- dots-inspect.qmd
- cs-mapply-pmap.qmd
- mutually-exclusive-arguments.qmd
- compound-arguments.qmd

- part: Outputs
chapters:
Expand Down
2 changes: 1 addition & 1 deletion arguments-independent.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ Dependencies between arguments makes functions harder:
- In `readr::locale()` there's a complex dependency between `decimal_mark` and `grouping_mark` because they can't be the same value, and Europe and the US Europe use different standards.
See @sec-args-mutually-exclusive) and @sec-args-compound for a two exceptions where the dependency is via specific patterns of missing arguments.
See @sec-mutually-exclusive and @sec-compound-arguments for a two exceptions where the dependency is via specific patterns of missing arguments.
(Other examples for you to explore: `rate` and `scale` in `rgamma()`, `na.rm` and `use` in `var()`)
Expand Down
69 changes: 69 additions & 0 deletions compound-arguments.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Compound arguments {#sec-compound-arguments}

```{r}
#| include = FALSE
source("common.R")
```

## What's the pattern?

A related, if less generally useful, form is to allow the user to supply either a single complex argument or several smaller arguments.

## What are some examples?

- `rgb(cbind(r, g, b))` is equivalent to `rgb(r, g, b)` (See @sec-cs-rgb for more details).

- `options(list(a = 1, b = 2))` is equivalent to `options(a = 1, b = 2)`.

- `stringr::str_sub(x, cbind(start, end))` is equivalent to `str_sub(x, start, end)`.

The most compelling reason to provide this sort of interface is when another function might return a complex output that you want to use as an input.
For example, it seems reasonable that you should be able to feed the output of `str_locate()` directly into `str_sub()`:

```{r}
library(stringr)
x <- c("aaaaab", "aaab", "ccccb")
loc <- str_locate(x, "a+b")
str_sub(x, loc)
```

But equally, it would be a bit weird to have to provide a matrix when subsetting with known positions:

```{r}
str_sub("Hadley", cbind(2, 4))
```

So `str_sub()` allows either individual vectors supplied to `start` and `end`, or a two-column matrix supplied to `start`.

## How do I use the pattern?

To implement in your own functions, you should branch on the type of the first argument:

```{r}
str_sub <- function(string, start, end) {
if (is.matrix(start)) {
if (!missing(end)) {
stop("`end` must be missing when `start` is a matrix", call. = FALSE)
}
if (ncol(start) != 2) {
stop("Matrix `start` must have exactly two columns", call. = FALSE)
}
stri_sub(string, from = start[, 1], to = start[, 2])
} else {
stri_sub(string, from = start, to = end)
}
}
```

And make it clear in the documentation:

```{r}
#' @param start,end Integer vectors giving the `start` (default: first)
#' and `end` (default: last) positions, inclusively. Alternatively, you
#' pass a two-column matrix to `start`, i.e. `str_sub(x, start, end)`
#' is equivalent to `str_sub(x, cbind(start, end))`
```

(If you look at `string::str_sub()` you'll notice that `start` and `end` do have defaults; I think this is a mistake because `start` and `end` are important enough that the user should always be forced to supply them.)
4 changes: 1 addition & 3 deletions def-magical.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ This problem is generally easy to avoid for new functions:
When `x` takes the value `y` from its default, it's evaluated inside the function, yielding `1`.
When `y` is supplied explicitly, it is evaluated in the caller environment, yielding `2`.
- Don't use `missing()`[^def-magical-1].
- Don't use `missing()`\[\^def-magical-1\].
```{r}
f2 <- function(x = 1) {
Expand All @@ -124,8 +124,6 @@ This problem is generally easy to avoid for new functions:
In packages, it's easy to use a non-exported function without thinking about it.
This function is available to you, the package author, but not the user of the package, which makes it harder for them to understand how a package works.
[^def-magical-1]: The only exceptions are described in @sec-args-mutually-exclusive and @sec-args-compound.
## How do I remediate the problem?
If you have a made a mistake in an older function you can remediate it by using a `NULL` default, as described in @sec-defaults-short-and-sweet).
Expand Down
46 changes: 15 additions & 31 deletions dots-after-required.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -8,20 +8,13 @@ source("common.R")
## What's the pattern?

If you use `` in a function, put it after the required arguments and before the optional arguments.
(See @sec-dots-data if `` requires at least one argument because it's used to combine an arbitrary number of objects.)

There are three primary advantages:
This has two positive impacts:

- It forces the user of your function to fully name optional arguments, because arguments that come after `...` are never matched by position or partially by name.
Using full names for details arguments is good practice, because we believe it makes code easier to read.
- It forces the user of your function to fully name optional arguments, because arguments that come after `...` are never matched by position or by partial name.
We believe that using full names for optional arguments is good practice because it makes code easier to read.

- You can easily add new optional arguments or change the order of existing arguments without affecting existing calls.
This makes it easy to extend your function with new capabilities, because you don't need to worry about breaking existing code.

- When coupled with "inspect the dots" (@sec-dots-inspect), or "dot prefix" (@sec-dots-prefix) it minimises the chances that misspelled arguments names will silently go astray.

The payoff of this pattern is not huge: it protects against a fairly unusual failure mode.
However, since the failure mode is silent (and hence easy to miss) and the pattern is very easy to apply, I think it's still worth it.
- This in turn means that uou can easily add new optional arguments or change the order of existing arguments without affecting existing code.

## What are some examples?

Expand All @@ -34,13 +27,13 @@ mean(x, , TRUE)
mean(x, n = TRUE, t = 0.1)
```

Not only does allow confusing code[^dots-position-1], it also makes it hard to later change the order of these arguments, or introduce new arguments that might be more important.
Not only does this allow for confusing code[^dots-after-required-1], it also makes it hard to later change the order of these arguments, or introduce new arguments that might be more important.

[^dots-position-1]: As much as we recommended people don't write code like this, you know someone will!
[^dots-after-required-1]: As much as we recommended people don't write code like this, you know someone will!

If `mean()` instead placed `` before `trim` and `na.rm`, like `mean2()`[^dots-position-2] below, then you must fully name each argument:
If `mean()` instead placed `` before `trim` and `na.rm`, like `mean2()`[^dots-after-required-2] below, then you must fully name each argument:

[^dots-position-2]: Note that I moved `na.rm = TRUE` in front of `trim` because I believe `na.rm` is the more important argument because it's used vastly more often than `trim` and I'm following @sec-important-args-first.
[^dots-after-required-2]: Note that I moved `na.rm = TRUE` in front of `trim` because I believe `na.rm` is the more important argument because it's used vastly more often than `trim` and I'm following @sec-important-args-first.

```{r}
mean2 <- function(x, ..., na.rm = FALSE, trim = 0) {
Expand All @@ -53,22 +46,8 @@ mean2(x, na.rm = TRUE, trim = 0.1)

## How do I remediate past mistakes?

If you've already published a function where you've put `...` in the wrong place, it's easy to fix.
You'll need to use a function from the rlang package to check that `...` is as expected (e.g. from @sec-dots-data or @sec-dots-inspect).
Since using the full names for details arguments is good practice, making this change will typically affect little existing code, but it is an interface change so should be advertised prominently.

```{r}
#| error = TRUE
old_interface <- function(x, data1 = 1, data2 = 2, ...) {
}
old_interface(1, 2)
new_interface <- function(x, ..., data1 = 1, data2 = 2) {
rlang::check_dots_used()
}
new_interface(1, 2)
```
It's straightforward to fix a function where you've put `...` in the wrong place: you just need to change the argument order and use `rlang::check_dots_used()` to check that no arguments are lost (learn more in @sec-dots-inspect).
This is a breaking change, but it tends to affect relatively little code because most people do fully name optional arguments.

We can use this approach to make a safer version of `mean()`:

Expand All @@ -83,3 +62,8 @@ mean3(x, , TRUE)
mean3(x, n = TRUE, t = 0.1)
```

## See also

- @sec-dots-data: if `` is a required argument because it's used to combine an arbitrary number of objects in a data structure.
- @sec-dots-inspect: to ensure that arguments to `` never go silently missing.
41 changes: 19 additions & 22 deletions important-args-first.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -8,50 +8,47 @@ source("common.R")
## What's the pattern?

In a function call, the most important arguments should come first.
How can you tell what arguments are most important?
That's largely a judgement call that you'll need to make based on your beliefs about which arguments will be used most commonly.
However, there are a few general principles:
As a general rule, the most important arguments will be the ones that are used most often, but that's often hard to tell until your function has existed in the wild for a while.
Fortunately, there are a few rules of thumb that can help:

- If the output is a transformation of an input (e.g. `log()`, `stringr::str_replace()`, `dplyr::left_join()`) that's the most important argument.
- If the output is a transformation of an input (e.g. `log()`, `stringr::str_replace()`, `dplyr::left_join()`) then that argument the most important.
- Other arguments that determine the type or shape of the output are typically very important.
- Optional arguments (i.e. arguments with a default) are the least important, and should come last.
- If the function uses ``, the optional arguments should come after ``; see @sec-dots-after-required for more details.

This convention makes it easy to understand the structure of a function at a glance: the more important an argument is, the earlier you see it.
When the output is very strongly tied to an input, putting first also ensures that you function works well with the pipe, leading to code that focuses on the transformations rather than the object being transformed.
I believe that ensuring the first argument is always the object being transformed is helps make stringr and purrr functions easier to learn than their base equivalents.
This convention makes it easy to understand the structure of a function at a glance: the more important an argument is, the earlier you'll see it.
When the output is very strongly tied to an input, putting that argument first also ensures that your function works well with the pipe, leading to code that focuses on the transformations rather than the object being transformed.

## What are some examples?

The vast majority of functions get this right, so we'll pick on a few examples which I think get it wrong:

- I think the arguments to base R string functions (`grepl()`, `gsub()`, etc) are in the wrong order because they consistently make the regular expression (`pattern`) the first argument, rather than the character vector being manipulated.
I think the character vector is more important because it's the argument the fundamentally determines the size of the output.
- I think the arguments to base R string functions (`grepl()`, `gsub()`, etc) are in the wrong order because they consistently make the regular expression (`pattern`) the first argument, rather than the character vector being manipulated (`x)`.

- The first two arguments to `lm()` are `formula` and `data`.
I'd argue that `data` should be the first argument; even though it doesn't affect the shape of the output (which is always an lm S3 object), it affects the shape of many important functions like `predict()`.
I'd argue that `data` should be the first argument; while it doesn't affect the shape of the output which is always an lm S3 object, it does affect the shape of many important functions like `predict()`.
However, the designers of `lm()` wanted `data` to be optional, so you could still fit models even if you hadn't collected the individual variables into a data frame.
Because `formula` is required and `data` is not, `formula` must come first.
Because `formula` is required and `data` is not, this means that `formula` had to come first.

- The first two arguments to `ggplot()` are `data` and `mapping`.
Both data and mapping are required for every plot, so why make `data` first?
I picked this ordering because in most plots there's one dataset shared across all layers and only the mapping changes.

It's worth noting the layer functions, like `geom_point()`, flip the order of these arguments, because in an individual layer you're more likely to specify `mapping` than `data`, and in many cases if you do specify `data` you'll want `mapping` as well.
This makes these the argument order inconsistent with `ggplot()`, but I think time has shown it to be a reasonable design decision.
On the other hand, the layer functions, like `geom_point()`, flip the order of these arguments because in an individual layer you're more likely to specify `mapping` than `data`, and in many cases if you do specify `data` you'll want `mapping` as well.
This makes these the argument order inconsistent with `ggplot()`, but overall supports the most common use cases.

- ggplot2 functions work by creating some object that's then added on to a plot object, so the plot, which is arguably the most important argument, is not used at all.
ggplot2 works this way in part because it was invented before the pipe was discovered, and the best way I came up to write plots from left to right was to rely on `+` (so-called operator overloading).
- ggplot2 functions work by creating an object that is then added on to a plot, so the plot, which is really the most important argument, is not obvious at all.
ggplot2 works this way in part because it was written before the pipe was discovered, and the best way I came up to define plots from left to right was to rely on `+` (so-called operator overloading).
As an interesting historical fact, ggplot (the precursor to ggplot2) actually works great with the pipe, and a couple of years ago I bought it back to life as [ggplot1](https://github.com/hadley/ggplot1).

## How do I remediate past mistakes?

Generally, it is not possible to remediate an existing exported function with this problem.
Typically, you will need to perform major surgery on the function arguments, and this will convey different conventions about which arguments should be named.
This implies that you should deprecate the entire existing function and replace it with a new alternative.
Generally, it is not possible to change the order of the first few arguments because it will break existing code (since these are the arguments that are mostly likely to be used unnamed).
This means that the only real solution is to dperecate the entire function and replace it with a new one.
Because this is invasive to the user, it's best to do sparingly: if the mistake is minor, you're better off waiting until you've collected other problems before fixing it.

For example, take `tidyr::gather()`.
It has a number of problems with its design that made them hard to use.
Relevant to this chapter is that the argument order is wrong, because you almost always want to specify which variables to gather, which is the fourth argument, not the second (after the `data`).
It has a number of problems with its design, including the argument order, that makes it harder to use.
Because it wasn't possible to easily fix this mistake, we accumulated other `gather()` problems for several years before fixing them all at once in `pivot_longer()`.

## See also

- @sec-dots-after-required: If the function uses ``, it should come in between the required and optional arguments.
Loading

0 comments on commit 8ac437c

Please sign in to comment.