Skip to content

Commit

Permalink
Add descriptors to data-details
Browse files Browse the repository at this point in the history
And start reworking analyses.

Fixes tidyverse#68
  • Loading branch information
hadley committed May 11, 2019
1 parent d3f129e commit 3cb535a
Show file tree
Hide file tree
Showing 9 changed files with 113 additions and 33 deletions.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ Sections:

* What are some examples? Bulleted list of existing functions. Can be both
positive and negative examples. Show results of code where useful.
Goal is to include enough variety that everyone recognises at least one
function, and can look up the docs for the details of the others.

* Why is it bad/useful? More detailed motivation.

Expand Down
117 changes: 86 additions & 31 deletions args-data-details.Rmd
Original file line number Diff line number Diff line change
@@ -1,60 +1,115 @@
# Data, then details {#args-data-details}
# Data, descriptors, details {#args-data-details}

```{r, include = FALSE}
source("common.R")
```

(If you're not familiar with the terms "data argument" and "details argument", read Chapter \@ref(call-data-details) before continuing.)

## What's the pattern?

When creating a function, put data arguments before details arguments. Arguments with defaults should always come after arguments without defaults (i.e. required arguments).
Function arguments should always come in the same order: data arguments, then descriptors, then details.

* __Data__ arguments provide the core data. They are required, and are usually
vectors and often determine the type and size of the output. Data arguments
are often called `data`, `x`, or `y`.

* __Descriptor__ arguments describe essential details of the operation, and
are usually required.

* __Details__ arguments control the details of the function. These arguments
should have defaults, so they become optional, and their values are
typically scalars (e.g. `na.rm = TRUE`, `n = 10`, `prop = 0.1`).

A standard argument order makes it easier to understand a function at a glance, and this order implies that required arguments always come before optional arguments.

Related patterns:

* It's possible for the data argument to be `...` (i.e. there's an arbitrary
number of data arguments). For example, `paste()` allows you to supply any
number of strings to `...`, and the details of the concatenation are
controlled by `sep` and `collapse`. This pattern is best using sparingly,
and is described in more detail in Chapter \@ref(dots-data).
* `...` can play the role of the data argument (i.e. when there are an
arbitrary number of inputs), as in `paste()`. This pattern is best using
sparingly, and is described in more detail in Chapter \@ref(dots-data).

* `...` can also be used to capture details arguments and pass them on to
other functions. See Chapters \@ref(dots-position) and \@ref(dots-inspect) to
how to use `...` as safely as possible in this situation.
other functions. See Chapters \@ref(dots-position) and \@ref(dots-inspect)
to how to use `...` as safely as possible in this situation.

* If the descriptor has a deafult value, I think you should tell the use about
it, as in Chapter \@ref(args-explain).

## What are some examples?

* `grepl(pattern, x)` returns a logical vector the same length as `x`
containing `TRUE` wherever `pattern` is matched. Here, `x` is the data
argument (it's the character vector of strings to compare), and `pattern`
is the data argument.
* `mean()` has one data argument (`x`) and two details (`trim` and `na.rm`).

* The mathematical (`+`, `-`, `*`, `/`, ...) and comparison (`<`, `>`, `==`,
...) operators have two data arguments.

* `ifelse()` has three data arguments (`test`, `yes`, `no`).

* `merge()` has two data arguments (`x`, `y`), one descriptor (`by`),
and a number of details (`all`, `no.dups`, `sort`, ...).

* `grepl()`, which returns a logical vector the same length as `x`
containing `TRUE` whenever it matches a regular expression `pattern`.
It has one data argument (`x`), one descrptor (`pattern`), and a number
of details (`fixed`, `perl`, `ignore.case`, ...).

* `rnorm()` has no data arguments and three descriptors (`n`, `mean`, `sd`).
`mean` and `sd` default to 0 and 1 respectively, which makes them feel
more like details. I'd argue that they shouldn't have defaults to make it
more clear that they're descriptors. This would have the side-effect of
making `rnrorm()` more consistent with the other RNGs.

In `rt(n, df, ncp)`, however, I think `ncp` should default to `0` to make
it clear that the non-centrality parameter is detail of the t-distribution,
not a core part.

* `stringr::str_sub()` has three data arguments (`string`, `start` and
`end`). You might wonder what makes `start` and `end` data arguments, and
I admit it took me a while to figure this out too, but I think the
crucial factor is that you can give a single `string` and multiple
`start`/`end` positions:

```{r}
stringr::str_sub("Hello", 1:5, -1)
```
If I was to write `str_sub()` today, I'd call the first argument `x` (not
`string`), and I wouldn't give `start` and `end` default arguments to that
they'd be required.
* In `lm()`, I think it's fairly obvious that the primary data argument is
`data` and one of the key details is the model specification in the
`formula` argument. This is partly a historic accident, because having
`lm()` work with data frames (rather than individual vectors in the
current environment), did become important until later.
* `ggplot2::ggplot()` has one data argument (`data`) and one descriptor
(`mapping`).
* `replicate()` takes the number of replicates as the first argument,
whereas that seems more like a detail of the computation.
* `lm()` has one data argument (`data`), one descriptor (`formula`), and
many details (`weights`, `na.action`, `method`, ...). Unfortunately
`formula` comes before `data`. This is a historical accident, because
putting all model variables into a data frame is a relatively recent
innovation in the long life cycle of `lm()`.
```{r}
replicate(3, rnorm(4))
replicate(rnorm(4), n = 3)
```
* `purrr::map()` has one data argument (`.x`) and one descriptor (`.f`).
`purrr::map2()` has two data arguments (`.x`, `.y`) and one descrptor
(`.f`).
* `mapply()` takes `FUN` as the first argument. This is different to all the
other `apply()` functions which take the data as the first argument.
See Chapter \@ref(cs-mapply-pmap) for more details.
* `mapply()` has any number of data arguments (...), one descriptor (`FUN`),
and a number of details (`SIMPLIFY`, `USE.NAMES`, ...). The descriptor
comes before the data arguments.
## Why is it important?
Convention makes it easier to understand structure of function at a glance - know that the most important arguments come first.
This convention makes it easy to understand the structure of a function at a glance, as when reading the arguments of a function, you know that they are roughly ordered in terms of importance.
Pipe is designed to work with this convention.
As in Chapter \@ref(call-data-details), the distinction affects how you call functions: you should always name details arguments, and leave data and descriptors to be specified by position.
Many families of functions are transformations: the output is the same type and size as the input. This includes dplyr (data frames), stringr (character vectors), and the map family (vectors) (and seen in a certain light ggplot2 also works the same way). Consistently placing the primary data first in the argument list makes it easy to generate pipelines of transformations.
## How do I remediate the problem?
Data arguments are always required, and details arguments are always optional. Descriptors are usually required; if a function has multiple descriptor arguments and only some are required, they should come before the optional descriptors. The line between descriptor and details is often thin. And it's easy to make mistakes: giving a descriptor a default value often makes the function easier to use, and the cost of making the typically call less expressive.
This categorisation is somewhat of a false trichotomy: there is really a continuous gradient of importance, where you want the arguments to be roughly in order of importance (although sometimes other concerns, like organising related arguments together, will dominate). The most important arguments are obviously those that you _must_ supply, and the least important are optional. Descriptors are somewhere in the middle. Despite being an approximation, I think these three terms are useful.
To avoid the problem, ensure you carefully analyse the arguments to your function and make sure the most important come first.
Having an argument without a default after an argument with a default is a warning sign.
If you have an exported function with this problem, there's no way to fix it. You'll need to deprecate the function and come up with a new replacement without the problem.
For example, `setNames()` has an unusual definition:
Expand Down
8 changes: 8 additions & 0 deletions args-magical-defaults.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,14 @@ If a function behaves differently when the default value is suppled explicitly,
Vectorize(rep.int, vectorize.args = arg.names)
```
* In `rbeta()`, the default value of `ncp` is 0, but if you explicitly supply
it the function uses a different algorithm:
```{r}
rbeta
```
## What are the exceptions?
It's ok to use this behaviour when you want the default value of one argument to be the same as another. For example, take `rlang::set_names()`, which allows you to create a named vector from two inputs:
Expand Down
3 changes: 3 additions & 0 deletions args-required.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,9 @@ When reading a function, it's important to be able to tell at a glance which arg
sample(4)
```
* `rt()` (draw random numbers from the t-distribution) looks like it
requires the `ncp` parameter but it doesn't.
* `lm()` does not have defaults for `formula`, `data`, `subset`, `weights`,
`na.action`, or `offset`. Only `formula` is actually required, but even
its absence fails to generate a clear error message:
Expand Down
2 changes: 1 addition & 1 deletion call-data-details.Rmd
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Name data arguments {#call-data-details}
# Name details arguments {#call-data-details}

```{r, include = FALSE}
source("common.R")
Expand Down
5 changes: 5 additions & 0 deletions dots-data.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,13 @@ f
fct_relevel(f, c("b", "a"))
# can be shortened to:
fct_relevel(f, "b", "a")
```

* `mapply()`



## Why is it important?

In general, I think it is best to avoid using `...` for this purpose because it has a relatively small benefit, only reducing typing by three letters `c()`, but has a number of costs:
Expand Down
2 changes: 2 additions & 0 deletions dots-inspect.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@
source("common.R")
```

<!-- call dots-details-s3 ? -->

## What's the pattern?

Whenever you use `...` in an S3 generic to allow methods to add custom arguments, you should inspect the dots to make sure that every argument is used. You can also use this same approach when passing `...` to an overly permissive function.
Expand Down
2 changes: 2 additions & 0 deletions dots-position.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@
source("common.R")
```

<!-- Rename to details in dots? dots-details -->

## What's the pattern?

When you use `...` in a function, where should you put it? It's obvious that the data arguments must come first. But which should come next, the dots or the details? This patter tells you to place `...` between the data arguments (the required arguments that supply the key "data" to the function) and the details arguments (optional additional arguments that control the finer details of the function).
Expand Down
5 changes: 4 additions & 1 deletion fun_def.R
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
library(purrr)
library(rlang)

# Should return more complex function object with print method
# TODO:
# * way to print function with body
# * way to highlight arguments

pkg_funs <- function(pkg) {
env <- pkg_env(pkg)
funs <- keep(as.list(env, sorted = TRUE), is_closure)
Expand Down

0 comments on commit 3cb535a

Please sign in to comment.