Skip to content

Commit

Permalink
Polishing last 3 chapters in scannable spec part
Browse files Browse the repository at this point in the history
  • Loading branch information
hadley committed Aug 2, 2023
1 parent a43391a commit a4c4e6a
Show file tree
Hide file tree
Showing 7 changed files with 99 additions and 91 deletions.
2 changes: 1 addition & 1 deletion _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ book:
- call-data-details.qmd
- function-names.qmd

- part: Scannable definitions
- part: Scannable specs
chapters:
- inputs-explicit.qmd
- important-args-first.qmd
Expand Down
46 changes: 27 additions & 19 deletions argument-clutter.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,32 +7,34 @@ source("common.R")

## What's the problem?

If you have a large number of optional arguments that control the fine details of the operation of a function, it might be worth moving them into a separate "options" argument.
If you have a large number of optional arguments that control the fine details of the operation of a function, it might be worth lumping them all together into a separate "options" argument created by a helper function.

Having a large number of less important options makes it harder to see the the most important options.
Having a large number of less important arguments makes it harder to see the most important.
By moving rarely used and less important arguments to a secondary function, you can more easily draw attention to what is most important.

## What are some examples?

- A number of base R modelling functions like `loess()`, `glm()`, and `nls()` have `control` arguments that are paired with a function like `loess.control()`, `glm.control()`, and `nls.control()`.
These allow you to modify some rarely defaults, including the number of iterations, and the stopping criteria, and some debugging options.
These allow you to modify rarely used defaults, including the number of iterations, the stopping criteria, and some debugging options.

`optim()` uses a less formal version of this structure --- while it has a `control` argument, it doesn't have a matching `optim.control()` helper.
Instead, you supply a named list with components described in `?optim`.
A helper function is more convenient than a named list because it checks the argument names for free and gives nicer autocomplete to the user.

- `readr::read_csv()` and friends take a `locale` argument with value created by `readr::locale()`.
- `readr::read_delim()` and friends take a `locale` argument which is paired with the `readr::locale()` helper.
This object bundles together a bunch of options related to parsing numbers, dates, and times that vary from country to country.

- `readr::locale()` itself has a `date_names` argument that's paired with `readr::date_names()` and `readr::date_names_lang()` helpers.
You typically use it by supplying a two letter locale name (which `date_names_lang()` uses to look up common languages), but if your language isn't supported you can use `readr::date_names()` to individually supply full and abbreviate month and day of week names.
You typically use the argument by supplying a two letter locale (which `date_names_lang()` uses to look up common languages), but if your language isn't supported you can use `readr::date_names()` to individually supply full and abbreviated month and day of week names.

- On the other hand, `readr::read_delim()` also has a lot of options that control the specifics of parsing that you don't usually need (e.g. `escape_backslash`, `escape_double`, `quoted_na`, `comment`, `trim_ws)`.
These make the function specifications very long.
- On the other hand, ``` readr::``read_delim``() ``` also has a lot of options that control the specifics of file parsing that you typically only need in special situations (e.g. `escape_backslash`, `escape_double`, `quoted_na`, `comment`, `trim_ws)`.
These make the function specifications very long .

- Somewhat related is the engine specification used by readr, e.g. `regex(".", multline = TRUE)`, as discussed in @sec-pattern-engine.

## How do I use this pattern?

The simple implement is just to create an object that returns a list:
The simplest implementation is just to write a helper function that returns a list:

```{r}
my_fun_opts <- function(opt1 = 1, opt2 = 2) {
Expand All @@ -43,9 +45,9 @@ my_fun_opts <- function(opt1 = 1, opt2 = 2) {
}
```

Just this alone is nice because you can document the individual arguments, and auto-complete will remind the user what these less important options include.
This alone is nice because you can document the individual arguments, you get name checking for free, and auto-complete will remind the user what these less important options include.

It's good practice to add a class to this list so you can give more informative errors if the user supplies the wrong value:
An optional extra is to add a unique class to the list:

```{r}
my_fun_opts <- function(opt1 = 1, opt2 = 2) {
Expand All @@ -58,31 +60,37 @@ my_fun_opts <- function(opt1 = 1, opt2 = 2) {
)
}
is_my_fun_opts <- function(x) {
inherits(x, "mypackage_my_fun_opts")
}
```

This then allows you to create more informative error messages:

```{r}
#| error: true
my_fun_opts <- function(..., opts = my_fun_opts()) {
if (!is_my_fun_opts(opts)) {
if (!inherits(opts, "mypackage_my_fun_opts")) {
cli::cli_abort("{.arg opts} must be created by {.fun my_fun_opts}.")
}
}
my_fun_opts(opts = 1)
```

If you use this option in many places, you should consider pulling out the repeated code into a `check_my_fun_opts()` function.

## How do I remediate past mistakes?

Typically you notice this problem after you have created too many options so you'll need to carefully remediate by introducing a new options function and paired argument, and then deprecating the old arguments.
Typically you notice this problem only after you have created too many options so you'll need to carefully remediate by introducing a new options function and paired helper function, then deprecating the old arguments.
For example, if your existing function looks like this:

```{r}
my_fun <- function(x, y, opt1 = 1, opt2 = 2) {
}
```

Then you'll need to create a `my_fun_opts()` as above, add an `opts` argument that uses that as the default.
This is a breaking change, so if you want to provide a gradual on-ramp to the new API, for a while you can accept both the individual options and the options object, deprecating the individual options:
Then you'll need to create a `my_fun_opts()` as above and add an `opts` argument that uses that as the default.
This is a breaking change, so if you want to provide a gradual on-ramp to the new API you should keep the existing arguments but deprecating them:

```{r}
my_fun <- function(x, y, opts = my_fun_opts(), opt1 = deprecated(), opt2 = deprecated()) {
Expand All @@ -98,7 +106,7 @@ my_fun <- function(x, y, opts = my_fun_opts(), opt1 = deprecated(), opt2 = depre
}
```

Then in a future release you can remove the old arguments.
Then you can remove the old arguments in a future release.

## See also

Expand Down
2 changes: 1 addition & 1 deletion def-inform.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@ common_by(flights, planes)
The technique you use to generate the code will vary from function to function.
`rlang::expr_text()` is useful here because it automatically creates the code you'd use to build the character vector.

To avoid creating a magical default (@sec-def-magical), either export and document the function, or use the technique of @sec-arg-short-null:
To avoid creating a magical default (@sec-def-magical), either export and document the function, or use one of the techniques in @sec-defaults-short-and-sweet:

```{r}
left_join <- function(x, y, by = NULL) {
Expand Down
13 changes: 0 additions & 13 deletions def-magical.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -129,16 +129,3 @@ This problem is generally easy to avoid for new functions:
If you have a made a mistake in an older function you can remediate it by using a `NULL` default, as described in @sec-defaults-short-and-sweet).
If the problem is caused by an unexported function, you can also choose to document and export it.
Remediating this problem shouldn't break existing code, because it expands the function interface: all previous code will continue to work, and the function will also work if the argument is passed `NULL` input (which probably didn't previously).
For functions like `data.frame()` where `NULL` is already a permissible value, you'll need to use a sentinel object, as described in @sec-args-default-sentinel.
```{r}
sentinel <- function() structure(list(), class = "sentinel")
is_sentinel <- function(x) inherits(x, "sentinel")
data.frame_better <- function(..., row.names = sentinel()) {
if (is_sentinel(row.names)) {
# old default behaviour
}
}
```
99 changes: 58 additions & 41 deletions defaults-short-and-sweet.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,11 @@ source("common.R")
#| eval = FALSE,
#| include = FALSE
source("fun_def.R")
funs <- c(pkg_funs("base"), pkg_funs("stats"))
pkg_funs("base") |> keep(\(f) some(f$formals, is.null))
pkg_funs("base") |> keep(\(f) some(f$formals, \(arg) is_call(arg) && !is_call(arg, c("c", "getOption", "c", "if", "topenv", "parent.frame"))))
funs <- c(pkg_funs("base"), pkg_funs("stats"))
arg_length <- function(x) map_int(x$formals, ~ nchar(expr_text(.x)))
args <- map(funs, arg_length)
args_max <- map_dbl(args, ~ if (length(.x) == 0) 0 else max(.x))
Expand All @@ -21,13 +24,25 @@ funs[args_max > 50] %>% discard(~ grepl("as.data.frame", .x$name, fixed = TRUE))
## What's the pattern?

Default values should be short and sweet.
Avoid large or complex calculations in the default values, instead using `NULL` or helper function to where more calculation is needed.

Avoid large or complex calculations in the default values, instead using `NULL` or a helper function when the default requires complex calculation.
This keeps the function specification focussed on the big picture (i.e. what are the arguments and are they required or not) rather than the details of the defaults.

## What are some examples?

The following examples, drawn from base R, illustrate some functions that don't follow this pattern:
It's common for functions to use `NULL` to mean that the argument is optional, but the computation of the default is non-trivial:

- The default `label` in `cut()` yields labels in the form `[a, b)`.
- The default `pattern` in `dir()` means match all files.
- The default `by` in `dplyr::left_join()` means join using the common variables between the two data frames (the so-called natural join).
- The default `mapping` in `ggplot2::geom_point()` (and friends) means use the mapping from in the overall plot.

In other cases, we encapsulate default values into a function:

- readr functions use a family of functions including `readr::show_progress()`, `readr::should_show_col_types()` and `readr::should_show_lazy()` that make it easier for users to override various defaults.

It's also worth looking at a couple of counter examples that come from base R:

- The default value for `by` in `seq` is `((to - from)/(length.out - 1))`.

- `reshape()` has a very long default argument: the `split` argument is one of two possible lists depending on the value of the `sep` argument:

Expand All @@ -43,32 +58,19 @@ The following examples, drawn from base R, illustrate some functions that don't
) {}
```
- `sample.int()` uses a complicated rule to determine whether or not to use a faster hash based method that's only applicable in some circumstances: `useHash = (!replace && is.null(prob) && size <= n/2 && n > 1e+07))`
- `exists()`, which figures out if a variable exists in a given environment, uses a complex default to determine which environment to look in if not specifically provided: `envir = (if (missing(frame)) as.environment(where) else sys.frame(frame))`.
(NB: `?exists` cheats and hides the long default in the documentation.)
- `sample.int()` uses a complicated rule to determine whether or not to use a faster hash based method that's only applicable in some circumstances: `useHash = (!replace && is.null(prob) && size <= n/2 && n > 1e+07))`.
## How do I use it?
So what should you do if a default requires some complex calculation?
Our recommended approach is to use a default of `NULL` and then the actual default inside the function.
This is the most common pattern and should be instantly recognisable to users.
If this approach doesn't work for your case, there are three alternatives which each come with drawbacks:
- If the default is used in multiple places, compute it with an exported function that becomes the default.
- If `NULL` is a meaningful input to your function use a "sentinel" object instead.
- Finally, it's possible to not set a default value (making it look required) and use `missing()`.
These techniques are discussed in the following sections.
We have two recommended approaches: using `NULL` or creating a helper function.
I'll also show you two other alternatives which we don't generally recommend but you'll see in a handful of places in the tidyverse, and can be useful in limited circumstances.
### `NULL` default {#sec-arg-short-null}
### `NULL` default
- Using `NULL` signals that the argument is optional (@sec-required-no-defaults) but the default requires some calculation.
The most common approach to handling complex defaults to use `NULL`.
For example, if we were to use this approach in`sample.int()`, it might look something like this:
The simplest, and most common, way to indicate that an argument is optional, but has a complex default is to use `NULL` as the default.
Then in the body of the function you perform the actual calculation only if the is `NULL`.
For example, if we were to use this approach in `sample.int()`, it might look something like this:
```{r}
sample.int <- function (n, size = n, replace = FALSE, prob = NULL, useHash = NULL) {
Expand All @@ -78,7 +80,7 @@ sample.int <- function (n, size = n, replace = FALSE, prob = NULL, useHash = NUL
}
```

This pattern is made more elegant with the infix `%||%` operator which you use either by importing it from rlang or copying and pasting it in to your `utils.R`:
This pattern is made more elegant with the infix `%||%` operator which you can use by importing it from rlang or copying and pasting it in to your `utils.R`:

```{r}
`%||%` <- function(x, y) if (is.null(x)) y else x
Expand Down Expand Up @@ -163,39 +165,46 @@ This is the main downside of this approach: you have to think carefully about th
A good example of this pattern is `readr::show_progress()`: it's used in every `read_` function in readr to determine whether or not a progress bar should be shown.
Because it has a relatively complex explanation, it's nice to be able to document it in its own file, rather than cluttering up file reading functions with incidental details.

### Sentinel value {#sec-args-default-sentinel}
### Alternatives

If the above techniques don't work for your case there are two other alternatives that we don't generally recommend but can be useful in limited situations.

::: {.callout-note collapse="true"}
#### Sentinel value {#sec-args-default-sentinel}

Sometimes you'd like to use the `NULL` approach defined above, but `NULL` already has a specific meaning that you want to preserve.
For example, this comes up in ggplot2 scales functions which allow you to set the `name` of the scale which is displayed on the axis or legend.
The default value should just preserve whatever existing label is present so that if you're providing a scale to customise (e.g.) the breaks or labels, you don't need to re-type the scale name.
However, `NULL` is also a meaningful value because it means eliminate the scale label altogether[^defaults-short-and-sweet-1].
For that reason the default value for `name` is `ggplot2::waiver()` a ggplot2-specific convention that means "inherit from the existing value".

[^defaults-short-and-sweet-1]: Unlike `name = ""` which doesn't show the label, but preserves the space where it would appear (sometimes useful for aligning multiple plots), `name = NULL` also eliminates the space normally allocated for the label.

If you look at `ggplot2::waiver()` you'll see it's just a very lightweight S3 class[^defaults-short-and-sweet-2]:

[^defaults-short-and-sweet-2]: If I was to write this code today I'd use `ggplot2_waiver` as the class name.

```{r}
ggplot2::waiver
```

And then ggplot2 also provides the internal `is.waive()`[^defaults-short-and-sweet-3] function which allows to work with it in the same way we might work with a `NULL`:

[^defaults-short-and-sweet-3]: If I wrote this code today, I'd call it `is_waiver()`.

```{r}
is.waive <- function(x) {
inherits(x, "waiver")
}
```

### No default
The primary downside of this technique is that it requires substantial infrastructure to set up, so it's only really worth it for very important functions or if you're going to use it in multiple places.
:::

There's one final technique that comes with substantial caveats: the use of `missing()`.
The major drawback to this technique is that it makes it look like an argument is required (in direct conflict with @sec-required-no-defaults).
It works something like this:
[^defaults-short-and-sweet-1]: Unlike `name = ""` which doesn't show the label, but preserves the space where it would appear (sometimes useful for aligning multiple plots), `name = NULL` also eliminates the space normally allocated for the label.

[^defaults-short-and-sweet-2]: If I was to write this code today I'd use `ggplot2_waiver` as the class name.

[^defaults-short-and-sweet-3]: If I wrote this code today, I'd call it `is_waiver()`.

::: {.callout-note collapse="true"}
#### No default

The final alternative is to condition on the absence of an argument using `missing().` It works something like this:

```{r}
reshape <- function(..., sep = ".", split) {
Expand Down Expand Up @@ -224,11 +233,19 @@ reduce(letters[0], paste)
reduce(letters[0], paste, .init = "")
```

`NULL` is a potentially valid option for `.init`, so we can't use the most common approach.
And this is for a relatively rarely used function, so it didn't seem to be worth the effort to create a sentinel for this specific value.
And `.init` is "semi" required so this seemed to be the least worst solution to the problem.
Why use this approach?
`NULL` is a potentially valid option for `.init`, so we can't use that approach.
And we only need it for a single function, that's not terribly important, so creating a sentinel didn't seem to worth it.
`.init` is "semi" required so this seemed to be the least worst solution to the problem.

The major drawback to this technique is that it makes it look like an argument is required (in direct conflict with @sec-required-no-defaults).
:::

## How do I remediate existing problems?

If you have a function with a long default, you can remediate it with any of the approaches above to.
It is not a breaking change as long as it change the default value, so make sure you have a test for the default operation of the function before embarking on this change.
If you have a function with a long default, you can remediate it with any of the approaches.
It won't be a breaking change unless you accidentally change the computation of the default, so make sure you have a test for that before you begin.

## See also

- See @sec-argument-clutter for a tecnhnique to simplify your function spec if its long because it has many less important optional arguments.
Loading

0 comments on commit a4c4e6a

Please sign in to comment.