Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mortuary: geom_histogram() binning strategy #173

Open
hadley opened this issue Oct 24, 2023 · 0 comments
Open

Mortuary: geom_histogram() binning strategy #173

hadley opened this issue Oct 24, 2023 · 0 comments

Comments

@hadley
Copy link
Member

hadley commented Oct 24, 2023

Sometimes the strategy will be tangled in with many other arguments, or they might be multiple strategies used simultaneously.
In these situations you want to avoid creating a combinatorial explosion of functions, and instead might want to use a strategy object.

For example, generating the bins for a histogram is a surprisingly complex topic.
ggplot2::stat_bin(), which powers ggplot2::geom_histogram(), has a total of 5 arguments that control where the bins are placed:

  • You can supply either binwidth or bins to specify either the width or the number of evenly spaced bins. Alternatively, you supply breaks to specify the exact bin locations yourself (which allows you to create unevenly sized bins1).
  • If you use binwidth or bins, you're specifying the width of each bin, but not where the bins start. So additionally you can use either boundary or center2 to specify the location of a side (boundary) or the middle (center) of a bin3. boundary and center are mutually exclusive; you can only specify one (see @sec-mutually-exclusive for more).
  • Regardless of the way that you specify the locations of the bins, you need to choose where a bin from a to b, is [a, b) or (a, b], which is the job of the closed argument.

One way to resolve this problem would encapsulate the three basic strategies into three functions:

  • bin_width(width, center, boundary, closed)
  • bin_number(bins, center, boundary, closed)
  • bin_breaks(breaks, closed)

That immediately makes the relationship between the arguments and the strategies more clear.

Note that these functions create "strategies"; i.e. they don't take the data needed to actual perform the operation --- none of these functions take range of the data.
This makes these functions function factories, which is a relatively complex technique.

bin_width <- function(width, center, boundary, closed = c("left", "right")) {
  # https://adv-r.hadley.nz/function-factories.html#forcing-evaluation
  list(width, center, boundary, closed)
  
  function(range) {
    
  }
}

As in @sec-argument-clutter, you may want to give these functions custom classes so that the function that uses them can provide better error messages if the user supplies the wrong type of object.

Alternatively, you might want to just check that the input is a function with the correct formals; that allows the user to supply their own strategy function.
It's probably something that few people will take advantage of, but it's a nice escape hatch.

Footnotes

  1. One nice application of this principle is to create a histogram where each bin contains (approximately) the same number of points, as implemented in https://github.com/eliocamp/ggpercentogram/.

  2. center is also a little problematic as an argument name, because UK English would prefer centre.
    It's probably ok here since this it's a very rarely used argument, but middle would be good alternatives that don't have the same US/UK problem.
    Alternatively the pair could be endpoint and midpoint which perhaps suggest a tighter pairing than center and boundary.

  3. It can be any bin; stat_bin() will automatically adjust all the other bins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant