Add descriptors to data-details

And start reworking analyses. Fixes tidyverse#68
soodoku · May 11, 2019 · 3cb535a · 3cb535a
1 parent d3f129e
commit 3cb535a
Show file tree

Hide file tree

Showing 9 changed files with 113 additions and 33 deletions.
diff --git a/README.md b/README.md
@@ -15,6 +15,8 @@ Sections:
 
 * What are some examples? Bulleted list of existing functions. Can be both 
   positive and negative examples. Show results of code where useful.
+  Goal is to include enough variety that everyone recognises at least one 
+  function, and can look up the docs for the details of the others.
 
 * Why is it bad/useful? More detailed motivation.
 

diff --git a/args-data-details.Rmd b/args-data-details.Rmd
@@ -1,60 +1,115 @@
-# Data, then details {#args-data-details}
+# Data, descriptors, details {#args-data-details}
 
 ```{r, include = FALSE}
 source("common.R")
 ```
 
-(If you're not familiar with the terms "data argument" and "details argument", read Chapter \@ref(call-data-details) before continuing.)
-
 ## What's the pattern?
 
-When creating a function, put data arguments before details arguments. Arguments with defaults should always come after arguments without defaults (i.e. required arguments).
+Function arguments should always come in the same order: data arguments, then descriptors, then details.
+
+* __Data__ arguments provide the core data. They are required, and are usually
+  vectors and often determine the type and size of the output. Data arguments
+  are often called `data`, `x`, or `y`. 
+
+* __Descriptor__ arguments describe essential details of the operation, and 
+  are usually required.
+
+* __Details__ arguments control the details of the function. These arguments
+  should have defaults, so they become optional, and their values are 
+  typically scalars (e.g. `na.rm = TRUE`, `n = 10`, `prop = 0.1`).
+
+A standard argument order makes it easier to understand a function at a glance, and this order implies that required arguments always come before optional arguments.
 
 Related patterns:
 
-* It's possible for the data argument to be `...` (i.e. there's an arbitrary 
-  number of data arguments). For example, `paste()` allows you to supply any 
-  number of strings to `...`, and the details of the concatenation are 
-  controlled by `sep` and `collapse`. This pattern is best using sparingly, 
-  and is described in more detail in Chapter \@ref(dots-data). 
+* `...` can play the role of the data argument (i.e. when there are an 
+  arbitrary number of inputs), as in `paste()`. This pattern is best using
+  sparingly, and is described in more detail in Chapter \@ref(dots-data). 
 
 * `...` can also be used to capture details arguments and pass them on to 
-  other functions. See Chapters \@ref(dots-position) and \@ref(dots-inspect) to 
-  how to use `...` as safely as possible in this situation.
+  other functions. See Chapters \@ref(dots-position) and \@ref(dots-inspect) 
+  to how to use `...` as safely as possible in this situation.
+
+* If the descriptor has a deafult value, I think you should tell the use about 
+  it, as in Chapter \@ref(args-explain).
 
 ## What are some examples?
 
-*   `grepl(pattern, x)` returns a logical vector the same length as `x` 
-    containing `TRUE` wherever `pattern` is matched. Here, `x` is the data 
-    argument (it's the character vector of strings to compare), and `pattern` 
-    is the data argument. 
+*   `mean()` has one data argument (`x`) and two details (`trim` and `na.rm`).
+
+*   The mathematical (`+`, `-`, `*`, `/`, ...) and comparison (`<`, `>`, `==`,
+    ...) operators have two data arguments.
+
+*   `ifelse()` has three data arguments (`test`, `yes`, `no`).
+
+*   `merge()` has two data arguments (`x`, `y`), one descriptor (`by`), 
+    and a number of details (`all`, `no.dups`, `sort`, ...).
+
+*   `grepl()`, which returns a logical vector the same length as `x` 
+    containing `TRUE` whenever it matches a regular expression `pattern`. 
+    It has one data argument (`x`), one descrptor (`pattern`), and a number 
+    of details (`fixed`, `perl`, `ignore.case`, ...).
+
+*   `rnorm()` has no data arguments and three descriptors (`n`, `mean`, `sd`).
+    `mean` and `sd` default to 0 and 1 respectively, which makes them feel 
+    more like details. I'd argue that they shouldn't have defaults to make it 
+    more clear that they're descriptors. This would have the side-effect of 
+    making `rnrorm()` more consistent with the other RNGs.
+
+    In `rt(n, df, ncp)`, however, I think `ncp` should default to `0` to make 
+    it clear that the non-centrality parameter is detail of the t-distribution,
+    not a core part.
+
+*   `stringr::str_sub()` has three data arguments (`string`, `start` and 
+    `end`). You might wonder what makes `start` and `end` data arguments, and
+    I admit it took me a while to figure this out too, but I think the
+    crucial factor is that you can give a single `string` and multiple 
+    `start`/`end` positions:
+
+    ```{r}
+    stringr::str_sub("Hello", 1:5, -1)
+    ```
+    
+    If I was to write `str_sub()` today, I'd call the first argument `x` (not
+    `string`), and I wouldn't give `start` and `end` default arguments to that 
+    they'd be required.
 
-*   In `lm()`, I think it's fairly obvious that the primary data argument is 
-    `data` and one of the key details is the model specification in the 
-    `formula` argument. This is partly a historic accident, because having 
-    `lm()` work with data frames (rather than individual vectors in the 
-    current environment), did become important until later. 
+*   `ggplot2::ggplot()` has one data argument (`data`) and one descriptor 
+    (`mapping`). 
 
-*   `replicate()` takes the number of replicates as the first argument, 
-    whereas that seems more like a detail of the computation.
+*   `lm()` has one data argument (`data`), one descriptor (`formula`), and
+    many details (`weights`, `na.action`, `method`, ...). Unfortunately 
+    `formula` comes before `data`. This is a historical accident, because 
+    putting all model variables into a data frame is a relatively recent
+    innovation in the long life cycle of `lm()`.
 
-    ```{r}
-    replicate(3, rnorm(4))
-    replicate(rnorm(4), n = 3)
-    ```
+*   `purrr::map()` has one data argument (`.x`) and one descriptor (`.f`).
+    `purrr::map2()` has two data arguments (`.x`, `.y`) and one descrptor 
+     (`.f`).
 
-*   `mapply()` takes `FUN` as the first argument. This is different to all the 
-    other `apply()` functions which take the data as the first argument.
-    See Chapter \@ref(cs-mapply-pmap) for more details.
+*   `mapply()` has any number of data arguments (...), one descriptor (`FUN`),
+    and a number of details (`SIMPLIFY`, `USE.NAMES`, ...). The descriptor
+    comes before the data arguments.
 
 ## Why is it important?
 
-Convention makes it easier to understand structure of function at a glance - know that the most important arguments come first.
+This convention makes it easy to understand the structure of a function at a glance, as when reading the arguments of a function, you know that they are roughly ordered in terms of importance. 
 
-Pipe is designed to work with this convention.
+As in Chapter \@ref(call-data-details), the distinction affects how you call functions: you should always name details arguments, and leave data and descriptors to be specified by position.
+
+Many families of functions are transformations: the output is the same type and size as the input. This includes dplyr (data frames), stringr (character vectors), and the map family (vectors) (and seen in a certain light ggplot2 also works the same way). Consistently placing the primary data first in the argument list makes it easy to generate pipelines of transformations.
 
 ## How do I remediate the problem?
 
+Data arguments are always required, and details arguments are always optional. Descriptors are usually required; if a function has multiple descriptor arguments and only some are required, they should come before the optional descriptors. The line between descriptor and details is often thin. And it's easy to make mistakes: giving a descriptor a default value often makes the function easier to use, and the cost of making the typically call less expressive.
+
+This categorisation is somewhat of a false trichotomy: there is really a continuous gradient of importance, where you want the arguments to be roughly in order of importance (although sometimes other concerns, like organising related arguments together, will dominate). The most important arguments are obviously those that you _must_ supply, and the least important are optional. Descriptors are somewhere in the middle. Despite being an approximation, I think these three terms are useful.
+
+To avoid the problem, ensure you carefully analyse the arguments to your function and make sure the most important come first. 
+
+Having an argument without a default after an argument with a default is a warning sign.
+
 If you have an exported function with this problem, there's no way to fix it. You'll need to deprecate the function and come up with a new replacement without the problem.
 
 For example, `setNames()` has an unusual definition:

diff --git a/args-magical-defaults.Rmd b/args-magical-defaults.Rmd
@@ -48,6 +48,14 @@ If a function behaves differently when the default value is suppled explicitly,
     Vectorize(rep.int, vectorize.args = arg.names)
     ```
 
+*   In `rbeta()`, the default value of `ncp` is 0, but if you explicitly supply
+    it the function uses a different algorithm:
+    
+    ```{r}
+    rbeta
+    ```
+    
+
 ## What are the exceptions?
 
 It's ok to use this behaviour when you want the default value of one argument to be the same as another. For example, take `rlang::set_names()`, which allows you to create a named vector from two inputs:

diff --git a/args-required.Rmd b/args-required.Rmd
@@ -21,6 +21,9 @@ When reading a function, it's important to be able to tell at a glance which arg
     sample(4)
     ```
 
+*   `rt()` (draw random numbers from the t-distribution) looks like it 
+    requires the `ncp` parameter but it doesn't.
+
 *   `lm()` does not have defaults for `formula`, `data`, `subset`, `weights`, 
     `na.action`, or `offset`. Only `formula` is actually required, but even 
     its absence fails to generate a clear error message:

diff --git a/call-data-details.Rmd b/call-data-details.Rmd
@@ -1,4 +1,4 @@
-# Name data arguments {#call-data-details}
+# Name details arguments {#call-data-details}
 
 ```{r, include = FALSE}
 source("common.R")

diff --git a/dots-data.Rmd b/dots-data.Rmd
@@ -21,8 +21,13 @@ f
 fct_relevel(f, c("b", "a"))
 # can be shortened to:
 fct_relevel(f, "b", "a")
+
 ```
 
+*  `mapply()`
+
+
+
 ## Why is it important?
 
 In general, I think it is best to avoid using `...` for this purpose because it has a relatively small benefit, only reducing typing by three letters `c()`, but has a number of costs:

diff --git a/dots-inspect.Rmd b/dots-inspect.Rmd
@@ -4,6 +4,8 @@
 source("common.R")
 ```
 
+<!-- call dots-details-s3 ? -->
+
 ## What's the pattern?
 
 Whenever you use `...` in an S3 generic to allow methods to add custom arguments, you should inspect the dots to make sure that every argument is used. You can also use this same approach when passing `...` to an overly permissive function.

diff --git a/dots-position.Rmd b/dots-position.Rmd
@@ -4,6 +4,8 @@
 source("common.R")
 ```
 
+<!-- Rename to details in dots? dots-details -->
+
 ## What's the pattern?
 
 When you use `...` in a function, where should you put it? It's obvious that the data arguments must come first. But which should come next, the dots or the details? This patter tells you to place `...` between the data arguments (the required arguments that supply the key "data" to the function) and the details arguments (optional additional arguments that control the finer details of the function). 

diff --git a/fun_def.R b/fun_def.R
@@ -1,7 +1,10 @@
 library(purrr)
 library(rlang)
 
-# Should return more complex function object with print method
+# TODO:
+# * way to print function with body
+# * way to highlight arguments
+
 pkg_funs <- function(pkg) {
   env <- pkg_env(pkg)
   funs <- keep(as.list(env, sorted = TRUE), is_closure)