-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC bin breaks derived from scale breaks #6174
base: main
Are you sure you want to change the base?
Conversation
I'm a bit weary of adding this kind of link between scales and stats tbh. It makes the code much harder to reason about. Is there anything in this you couldn't achieve with binned scale and |
From the linked issue:
|
Thanks, yes it's exactly that. A typical use case I'm working with is epidemic curves that are annotated with contextual events. To pick a random example online, something like this (but sometimes the date ranges incolved are shorter so snapping to the bin centre is a more significant change): The main data is a histogram with date bins that are either epidemiological weeks (i.e. weeks with the boundary fixed to a specific day of the week depending on local concentions) or calendar months/years. Annotations refer to events on a specific day or are time series that might be binned differently. A binned scale would force all layers to use the same bins. The |
Ah, I see... Sorry for driving by with a half-informed suggestion 🙈 |
I'm still very much against geoms/stats being able to control aspects of the scaling. This kind of flow of control could lead to incompatibility between layers etc (and make code harder to reason about) |
The change I proposed is the opposite: scales influence stats. I get that it stretches the abstractions a bit unnaturally, though. The status quo solution is to separately tell the scale and the stat to use the same bin/break width and offset. It's certainly doable when you understand the issue, but it tends to be brittle. So I suppose a compromise could be tweaks to the way breaks and bins are specified (especially with date scales) so that there's a more discoverable way to control them with the same syntax? I've see a lot of cases where people worked out how to synchronise bin ( |
My reading skills are obviously impaired this morning 🙈 |
I've found a different way to get the result I'm after without needing any changes to the scale internals. Essentially rather than the scales going out of their way to avoid updating breaks after the first time, the breaks function is memoised by wrapping it along with an environment. I previously thought that interior mutability in a function wouldn't be possible. library(ggplot2)
#>
#> Attaching package: 'ggplot2'
#> The following object is masked from 'package:base':
#>
#> is.element
set.seed(2024)
df <- data.frame(date = as.Date("2024-01-01") + rnorm(100, 0, 5))
StatBin2 <- ggproto(
"StatBin2", StatBin,
compute_panel = function(self, data, scales, breaks = NULL, ...) {
breaks <- breaks %||% scales$x$get_transformation()$inverse(scales$x$get_breaks())
ggproto_parent(StatBin, self)$compute_panel(data, scales, breaks = breaks, ...)
}
)
p <- ggplot(df, aes(date)) + geom_histogram(stat = StatBin2)
p + scale_x_date(breaks = scales::breaks_width("2 week"))
#> `stat_bin2()` using `bins = 30`. Pick better value with `binwidth`. (this bit as before, with breaks undesirably recomputed after scales are expanded) breaks_cached <- function(breaks) {
ggplot2::ggproto(
"BreaksCached", NULL,
fn = breaks,
cached = NULL,
get_breaks = function(self, limits) {
if (is.null(self$cached)) self$cached <- self$fn(limits)
self$cached
}
)$get_breaks
}
p + scale_x_date(breaks = breaks_cached(scales::breaks_width("2 week")))
#> `stat_bin2()` using `bins = 30`. Pick better value with `binwidth`. Created on 2024-11-06 with reprex v2.1.1 It's feasible for an extension to handle everything by providing its own version of each binning stat (which is a minor hassle to substitute in for the normal stats) along with Alternatively there could be a less invasive ggplot implementation with breaks_cached <- function(f) {
fn <- function(...) {
if (is.null(.cache.env$cached)) {
.cache.env$cached <- .cache.env$inner(...)
}
.cache.env$cached
}
e <- new.env()
e$inner <- f
e$cached <- NULL
rlang::fn_env(fn)$.cache.env <- e
fn
} Does this approach feel better for ggplot? |
I agree with Thomas that anything that alters the state of the scale is undesirable. While caching breaks technically alters state, I feel there is not really a way ggplot2 could enforce this, even if it wanted to, because a breaks function is entirely up to users to specify. For this reason, I think it may be feasible to give |
This is a proof of concept of the minimal changes necessary to fix #6159. If you're willing to consider this approach I'll finish it off with documentation, tests, and the outstanding TODOs below.
The first part is essentially the same as the extension discussed in the issue: i.e. the
follow.scale
param onstat_bin
causes it to inherit bins from the scale. As noted, that only works if the scale doesn't get new breaks during the final retraining, i.e. provide fixed breaks, or disable scale expansion and hope other layers don't cause issues. In this example the bins don't align with the final breaks because the scale expands after the binning, causing the breaks to move.(TODO: add
follow.scale
to the other binning stats. Suppress the default binning warning whenfollow.scale = TRUE
. Add a value likefollow.scales = "minor"
to allow inheriting major and minor breaks?)The fix is to tell the scale that we want the breaks to be "frozen" before the stats are computed. Subsequent retraining is free to change the limits, which affects which breaks are shown, but once the breaks are frozen it acts as though they had been passed in as an explicit breaks vector.
(TODO: Add a param to the continuous scale constructor and
scale_{x,y}_{continuous,date,datetime}
. Maybe come up with a better name than freezing, likebreaks_computation = c("auto", "before_stat")
)It also looks reasonable when there are multiple facets:
Adding a distant data point and setting the scales to free, we can see that the binning is done independently for different facets:
Created on 2024-11-01 with reprex v2.1.1
This seems probably desirable behaviour since we did explicitly request free scales here. Changing it to make the binning consistent across panels would also be a bit complicated because I think the facets clone the scales before the first time breaks are computed.
To make the combination of settings more discoverable, it's probably reasonable to add a warning when using
follow.scale
with a scale that doesn't havefreeze_breaks = TRUE
.Please let me know if I've overlooked some way that these changes will cause problems with other parts of ggplot!