arguments-independent.qmd

# Keep arguments independent {#sec-arguments-independent}

```{r}
#| include = FALSE
source("common.R")
```

## What's the problem?

Avoid complex patterns of dependencies between arguments so that only certain combinations are permitted.

Dependencies between arguments makes functions harder:

-   It can suggest that there are many more viable combination of inputs than actually exist and those unnecessary possibilities still occupy head space.
    You have to learn and the remember the set of allowed combinations, rather than them being implied by the structure of the function.

-   Interdependence of arguments suggests complex implementation paths which are harder to analyse and test.

-   It makes documentation harder to write.
    You have to use extra words to explain exactly how combinations of arguments work together, and it's not obvious where those words should go.
    If there's an interaction between `arg_a` and `arg_b` do you document with `arg_a`, with `arg_b`, or with both?

## What are some examples?

-   `forcats::fct_lump()` decides which algorithm to use based on a combination of the `n` and `prop` arguments.

-   In `ggplot2::geom_histogram()`, you can specify the histogram breaks in three ways: as a number of `bins`, as the width of each bin (`binwidth`, plus `center` or `boundary`), or the exact `breaks`.
    You can only pick one of the three options, which is hard to convey in the documentation.
    There's also an implied precedence so that if more than one option is supplied, one will silently win.

-   In `readr::locale()` there's a complex dependency between `decimal_mark` and `grouping_mark` because they can't be the same value, and Europe and the US Europe use different standards.

-   `grepl()` has `perl`, `fixed`, and `ignore.case` arguments which can either be `TRUE` or `FALSE`.
    If these arguments were independent that would imply 2\^3 = 8 possible combinations.
    But `fixed = TRUE` overrides `perl = TRUE`, and `ignore.case = TRUE` only works if `fixed = FALSE` so there are only 5 valid combinations.

-   In `rep()` you can supply both `times` and `each` *unless* `times` is a vector :

    ```{r}
    #| error = TRUE
    rep(1:3, times = 2, each = 3)
    rep(1:3, times = 1:3, each = 2)
    ```

    Learn more in @sec-cs-rep.

(Other examples for you to explore: `na.rm` and `use` in `var()`.
Why does this arise?)

## How do I remediate past mistakes?

Often these problems arise because the scope of a function grows over time.
The scope of a function was small when it was initially designed but it has grown incrementally over time.
At no point did it seem worth the additional effort to refactor to a new design, but now you have a large complex function.
This makes the problem hard to avoid.

To remediate the problem, you'll need to think holistically and reconsider the complete interface.
There are two common outcomes which are illustrated in the case studies below:

-   Splitting the function into multiple functions that each do one thing.

-   Encapsulating related details arguments into a single object.

If these changes to the interface occur to exported functions in a package, you'll need to consider how to preserve the interface with deprecation warnings.
For important functions, it is worth generating an message that includes new code to copy and paste.

### Splitting into multiple functions {#sec-cs-fct-lump}

The goal of `fct_lump()` is to combine infrequent factor levels into a common "other" level, which is useful for displays where you want to concentrate on the most common values but still account for every observation.
When I first wrote `fct_lump()`, it implemented a single strategy.
But over time people asked for more and more variations, which I kept adding to `fct_lump()`.
This lead to a function that picks from one of three different strategies depending which of the `n` and `prop` arguments you supply:

-   If `n` and `prop` are missing, it will merge together the least frequent levels, ensuring that `other` is still the smallest level.
    This case ignores the `ties.method` argument, adding another dependency between arguments.

-   If a positive `n` is supplied, it preserves the `n` most common values; if a negative `n` is supplied it preserves the `n` least common values.

-   If a positive `prop` is supplied, lumps values which do not appear at least `prop` of the time.
    Negative `prop` lumps values that do not appear at most `-prop` of the time.

Overall, this become very hard to explain in the documentation, so in forcats 0.5.0 we split `fct_lump()` into three separate functions: `fct_lump_prop()`, `fct_lump_n()`, and `fct_lump_lowfreq()`.
This allows the function name to hint at the purpose, prevents you from supplying both `n` and `prop` through the design of the functions, and only has the `ties.method` argument where it makes sense.

## Using an enumeration

One problem with the `grepl()` interface is that the `fixed` and `perl` arguments are actually used to pick from one of three engines for matching text:

-   The default is POSIX 1003.2 extended regular expressions.
-   `perl = TRUE` uses Perl-style regular expressions.
-   `fixed = TRUE` uses fixed matching.

This makes it more clear why using `perl = TRUE` and `fixed = TRUE` doesn't make sense: you're trying to pick two conflicting engines.

An alternative interface that makes this choice more clear would be to use @sec-enumerate-options and create a new argument called something like `engine = c("POSIX", "perl", "fixed")`.
This also has the nice feature of making it easier to extend in the future.

This is more appealing than creating a separate function for each engine because there are many other functions in the same family as `grepl()`.
If we created `grepl_fixed()`, we'd also need `gsub_fixed()`, `regexp_fixed()` etc.

## Creating a details object

Generating the bins for a histogram is a surprisingly complex topic.
`stat_bin()`, which powers `geom_histogram()`, has a total of 5 arguments that control where the bins are placed: `binwidth`, `bins`[^arguments-independent-1], `boundary`, `breaks,` and `closed`. They have a complex set of interdependencies, which have a choose your own adventure feel.
Firstly you can select been `breaks`, `bins`, and `binwidths`. Then if you pick `bins` or `binwidths`, you can also optionally selected `center` or `boundary`. `binwidth` can also be a function, in which case it's called individually on each layer.
If we're going to clean up these arguments, it would also be nice to consider how you might supply a custom breaks for each layer (this would make it easier to implement an equal area histogram, which currently requires an custom stat, as in <https://github.com/eliocamp/ggpercentogram/>).

[^arguments-independent-1]: `center` is also a little problematic as an argument name, because UK English would prefer `centre`.
    It's probably ok here since this it's a very rarely used argument, but `middle` would be a reasonable alternative that doesn't have the same problem

One way to resolve this tension would be to use a single argument that takes objects created by a helper functions, e.g.:

-   `bins = bin_width(width, center, boundary)`
-   `bins = bin_number(n, center, boundary)`
-   `bins = bin_breaks(breaks)`

(where `center` and `boundary` would be mutually exclusive, @sec-mutually-exclusive)

This is a bit verbose for the most common case where you just want to set the width of the bins.
You could automatically wrap a bare number in `bin_width()`, but `bins = 10` seems more likely to imply

## See also

See @sec-mutually-exclusive and @sec-compound-arguments for a two exceptions where the dependency is via specific patterns of missing arguments.