Noodling on independent args

* Not actually about scannable definitions, so move later * Pull out case where one argument changes the meanining of another
soodoku · Jul 27, 2023 · c5e8730 · c5e8730
1 parent 8ac437c
commit c5e8730
Show file tree

Hide file tree

Showing 3 changed files with 74 additions and 70 deletions.
diff --git a/_quarto.yml b/_quarto.yml
@@ -29,14 +29,11 @@ book:
     - part: Scannable definitions
       chapters:
       - important-args-first.qmd
+      - required-no-defaults.qmd
       - inputs-explicit.qmd
       - dots-after-required.qmd
-      - required-no-defaults.qmd
-      - arguments-independent.qmd
-      - enumerate-options.qmd
       - defaults-short-and-sweet.qmd
-      - cs-setNames.qmd
-      - cs-stringr.qmd
+      - enumerate-options.qmd
 
     - part: Function arguments
       chapters: 
@@ -49,8 +46,11 @@ book:
       - dots-prefix.qmd
       - dots-inspect.qmd
       - cs-mapply-pmap.qmd
+      - arguments-independent.qmd
       - mutually-exclusive-arguments.qmd
       - compound-arguments.qmd
+      - cs-setNames.qmd
+      - cs-stringr.qmd
 
     - part: Outputs
       chapters: 

diff --git a/arguments-independent.qmd b/arguments-independent.qmd
@@ -7,67 +7,50 @@ source("common.R")
 
 ## What's the problem?
 
-Avoid creating dependencies between details arguments so that only certain combinations are permitted.
+Avoid complex patterns of dependencies between arguments so that only certain combinations are permitted.
 
 Dependencies between arguments makes functions harder:
 
--   It suggests that there are many more viable code paths than there really are and all those (unnecessary) possibilities still occupy head space.
+-   It can suggest that there are many more viable combination of inputs than actually exist and those unnecessary possibilities still occupy head space.
     You have to learn and the remember the set of allowed combinations, rather than them being implied by the structure of the function.
 
--   If one argument affects the operation of another argument, you have to read the entire function call before you can successfully understand it.
-
--   It increases implementation complexity.
-    Interdependence of arguments suggests complex implementation paths which are harder to analyse and test.
+-   Interdependence of arguments suggests complex implementation paths which are harder to analyse and test.
 
 -   It makes documentation harder to write.
     You have to use extra words to explain exactly how combinations of arguments work together, and it's not obvious where those words should go.
     If there's an interaction between `arg_a` and `arg_b` do you document with `arg_a`, with `arg_b`, or with both?
 
 ## What are some examples?
 
--   In `rep()` you can supply both `times` and `each` *unless* `times` is a vector :
+-   `forcats::fct_lump()` decides which algorithm to use based on a combination of the `n` and `prop` arguments.
 
-    ```{r}
-    #| error = TRUE
-    rep(1:3, times = 2, each = 3)
-    rep(1:3, times = 1:3, each = 2)
-    ```
+-   In `ggplot2::geom_histogram()`, you can specify the histogram breaks in three ways: as a number of `bins`, as the width of each bin (`binwidth`, plus `center` or `boundary`), or the exact `breaks`.
+    You can only pick one of the three options, which is hard to convey in the documentation.
+    There's also an implied precedence so that if more than one option is supplied, one will silently win.
 
-    Learn more in Chapter @sec-cs-rep.
+-   In `readr::locale()` there's a complex dependency between `decimal_mark` and `grouping_mark` because they can't be the same value, and Europe and the US Europe use different standards.
 
 -   `grepl()` has `perl`, `fixed`, and `ignore.case` arguments which can either be `TRUE` or `FALSE`.
-    But `fixed = TRUE` overrides `perl = TRUE`, and `ignore.case = TRUE` only works if `fixed = FALSE`.
-    Both `fixed` and `perl` change how another argument, `pattern`, is interpreted.
+    If these arguments were independent that would imply 2\^3 = 8 possible combinations.
+    But `fixed = TRUE` overrides `perl = TRUE`, and `ignore.case = TRUE` only works if `fixed = FALSE` so there are only 5 valid combinations.
 
--   In `library()` the `character.only` argument changes how `package` is interpreted:
+-   In `rep()` you can supply both `times` and `each` *unless* `times` is a vector :
 
     ```{r}
-    #| eval = FALSE
-    ggplot2 <- "dplyr"
-
-    # Loads ggplot2
-    library(ggplot2)
-
-    # Loads dplyr
-    library(ggplot2, character.only = TRUE)
+    #| error = TRUE
+    rep(1:3, times = 2, each = 3)
+    rep(1:3, times = 1:3, each = 2)
     ```
 
--   `forcats::fct_lump()` decides which algorithm to use based on a combination of the `n` and `prop` arguments.
-
--   In `ggplot2::geom_histogram()`, you can specify the histogram breaks in three ways: as a number of `bins`, as the width of each bin (`binwidth`, plus `center` or `boundary`), or the exact `breaks`.
-    You can only pick one of the three options, which is hard to convey in the documentation.
-    There's also an implied precedence so that if more than one option is supplied, one will silently win.
-
--   In `readr::locale()` there's a complex dependency between `decimal_mark` and `grouping_mark` because they can't be the same value, and Europe and the US Europe use different standards.
-
-See @sec-mutually-exclusive and @sec-compound-arguments for a two exceptions where the dependency is via specific patterns of missing arguments.
+    Learn more in @sec-cs-rep.
 
-(Other examples for you to explore: `rate` and `scale` in `rgamma()`, `na.rm` and `use` in `var()`)
+(Other examples for you to explore: `na.rm` and `use` in `var()`.
+Why does this arise?)
 
-## How do I remediate?
+## How do I remediate past mistakes?
 
 Often these problems arise because the scope of a function grows over time.
-When the function was initially designed, the scope was small, and it grew incrementally over time.
+The scope of a function was small when it was initially designed but it has grown incrementally over time.
 At no point did it seem worth the additional effort to refactor to a new design, but now you have a large complex function.
 This makes the problem hard to avoid.
 
@@ -78,37 +61,64 @@ There are two common outcomes which are illustrated in the case studies below:
 
 -   Encapsulating related details arguments into a single object.
 
-See also larger case study in @sec-cs-rep where this problem is tangled up with other problems.
-
 If these changes to the interface occur to exported functions in a package, you'll need to consider how to preserve the interface with deprecation warnings.
 For important functions, it is worth generating an message that includes new code to copy and paste.
 
-### Case study: `fct_lump()` {#sec-cs-fct-lump}
+### Splitting into multiple functions {#sec-cs-fct-lump}
+
+The goal of `fct_lump()` is to combine infrequent factor levels into a common "other" level, which is useful for displays where you want to concentrate on the most common values but still account for every observation.
+When I first wrote `fct_lump()`, it implemented a single strategy.
+But over time people asked for more and more variations, which I kept adding to `fct_lump()`.
+This lead to a function that picks from one of three different strategies depending which of the `n` and `prop` arguments you supply:
 
-There are many different ways to decide how to lump uncommon factor levels together, and initially we attempted to encode these through arguments to `fct_lump()`.
-However, over time as the number of arguments increased, it gets harder and harder to tell what the options are.
-Currently there are three behaviours:
+-   If `n` and `prop` are missing, it will merge together the least frequent levels, ensuring that `other` is still the smallest level.
+    This case ignores the `ties.method` argument, adding another dependency between arguments.
 
--   Both `n` and `prop` missing - merge together the least frequent levels, ensuring that `other` is still the smallest level.
-    (For this case, the `ties.method` argument is ignored.)
+-   If a positive `n` is supplied, it preserves the `n` most common values; if a negative `n` is supplied it preserves the `n` least common values.
 
--   Only `n` supplied: if positive, preserves `n` most common values.
+-   If a positive `prop` is supplied, lumps values which do not appear at least `prop` of the time.
+    Negative `prop` lumps values that do not appear at most `-prop` of the time.
 
--   Only `prop` supplied: if positive, preserves
+Overall, this become very hard to explain in the documentation, so in forcats 0.5.0 we split `fct_lump()` into three separate functions: `fct_lump_prop()`, `fct_lump_n()`, and `fct_lump_lowfreq()`.
+This allows the function name to hint at the purpose, prevents you from supplying both `n` and `prop` through the design of the functions, and only has the `ties.method` argument where it makes sense.
 
--   Both `n` and `prop` supplied: due to a bug in the code, this is treated the same way as both `n` and `prop` missing!
-    (But it really should be an error)
+## Using an enumeration
 
-Would be better to break into three functions:
+One problem with the `grepl()` interface is that the `fixed` and `perl` arguments are actually used to pick from one of three engines for matching text:
 
--   `fct_lump_n(f, n)`
--   `fct_lump_prop(f, prop)`
--   `fct_lump_smallest(f)`
+-   The default is POSIX 1003.2 extended regular expressions.
+-   `perl = TRUE` uses Perl-style regular expressions.
+-   `fixed = TRUE` uses fixed matching.
 
-That has three advantages:
+This makes it more clear why using `perl = TRUE` and `fixed = TRUE` doesn't make sense: you're trying to pick two conflicting engines.
 
--   The name of function helps remind you of the purpose.
+An alternative interface that makes this choice more clear would be to use @sec-enumerate-options and create a new argument called something like `engine = c("POSIX", "perl", "fixed")`.
+This also has the nice feature of making it easier to extend in the future.
 
--   There's no way to supply both `n` and `prop`.
+This is more appealing than creating a separate function for each engine because there are many other functions in the same family as `grepl()`.
+If we created `grepl_fixed()`, we'd also need `gsub_fixed()`, `regexp_fixed()` etc.
 
--   The `ties.method` argument would only appear in `fct_lump_n()` and `_prop()`, not `_smallest()`.
+## Creating a details object
+
+Generating the bins for a histogram is a surprisingly complex topic.
+`stat_bin()`, which powers `geom_histogram()`, has a total of 5 arguments that control where the bins are placed: `binwidth`, `bins`[^arguments-independent-1], `boundary`, `breaks,` and `closed`. They have a complex set of interdependencies, which have a choose your own adventure feel.
+Firstly you can select been `breaks`, `bins`, and `binwidths`. Then if you pick `bins` or `binwidths`, you can also optionally selected `center` or `boundary`. `binwidth` can also be a function, in which case it's called individually on each layer.
+If we're going to clean up these arguments, it would also be nice to consider how you might supply a custom breaks for each layer (this would make it easier to implement an equal area histogram, which currently requires an custom stat, as in <https://github.com/eliocamp/ggpercentogram/>).
+
+[^arguments-independent-1]: `center` is also a little problematic as an argument name, because UK English would prefer `centre`.
+    It's probably ok here since this it's a very rarely used argument, but `middle` would be a reasonable alternative that doesn't have the same problem
+
+One way to resolve this tension would be to use a single argument that takes objects created by a helper functions, e.g.:
+
+-   `bins = bin_width(width, center, boundary)`
+-   `bins = bin_number(n, center, boundary)`
+-   `bins = bin_breaks(breaks)`
+
+(where `center` and `boundary` would be mutually exclusive, @sec-mutually-exclusive)
+
+This is a bit verbose for the most common case where you just want to set the width of the bins.
+You could automatically wrap a bare number in `bin_width()`, but `bins = 10` seems more likely to imply
+
+## See also
+
+See @sec-mutually-exclusive and @sec-compound-arguments for a two exceptions where the dependency is via specific patterns of missing arguments.
diff --git a/cs-stringr.qmd b/cs-stringr.qmd
@@ -9,22 +9,16 @@ source("common.R")
 <!-- 
 https://github.com/wch/r-source/blob/trunk/src/main/grep.c#L891-L1151 -->
 ```
-`grepl()`, has three arguments that take either `FALSE` or `TRUE`: `ignore.case`, `perl`, `fixed`, which might suggest that there are 2 \^ 3 = 16 possible invocations.
+## Pattern engines
+
+`grepl()`, has three arguments that take either `FALSE` or `TRUE`: `ignore.case`, `perl`, `fixed`, which might suggest that there are 2 \^ 8 = 16 possible options.
 However, a number of combinations are not allowed:
 
 ```{r}
 x <- grepl("a", letters, fixed = TRUE, ignore.case = TRUE)
 x <- grepl("a", letters, fixed = TRUE, perl = TRUE)
 ```
 
-Part of this problem could be resolved by making it more clear that this function provides three engines for matching text:
-
--   POSIX 1003.2 extended regular expressions (the default),
--   Perl-style regular expressions (`perl = TRUE`)
--   Fixed matching (`fixed = TRUE`).
-
-One way to make this more clear would be to use @sec-enumerate-options and create a new argument called something like `engine = c("POSIX", "perl", "fixed")`.
-
 The other problem is that `ignore.case` only works with two of the three engines: POSIX and perl.
 This is hard to remedy without creating a completely new matching engine for fixed case, which is particularly hard because different languages have different rules for case.