Skip to content

add fun_avg to ppc_avg functions #349

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

add fun_avg to ppc_avg functions #349

wants to merge 8 commits into from

Conversation

tjmahr
Copy link
Collaborator

@tjmahr tjmahr commented May 13, 2025

Following #348, allow user to set the averaging function so that, e.g., user can choose median for heavy-tailed distributions.

l <- jsonlite::read_json(
  "https://gist.githubusercontent.com/tjmahr/3eba4f51e4b122b19417718b56423a6c/raw/31711cacef59eff69ed95eb51e2ae844eaad01b7/draws.json", 
  simplifyVector = TRUE
)

# Defaults to mean
bayesplot::ppc_error_scatter_avg_vs_x(l$y, l$yrep, l$x) +
  ggplot2::expand_limits(x = c(-10, 10))

bayesplot::ppc_error_scatter_avg_vs_x(l$y, l$yrep, l$x, fun_avg = median) + 
  ggplot2::expand_limits(x = c(-10, 10))

Created on 2025-05-13 with reprex v2.1.1

@codecov-commenter
Copy link

codecov-commenter commented May 13, 2025

Codecov Report

Attention: Patch coverage is 97.26027% with 2 lines in your changes missing coverage. Please review.

Project coverage is 98.58%. Comparing base (23e00b5) to head (4293eb8).

Files with missing lines Patch % Lines
R/ppc-errors.R 97.14% 1 Missing ⚠️
R/ppc-scatterplots.R 93.75% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #349      +/-   ##
==========================================
- Coverage   98.60%   98.58%   -0.03%     
==========================================
  Files          35       35              
  Lines        5539     5589      +50     
==========================================
+ Hits         5462     5510      +48     
- Misses         77       79       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@TeemuSailynoja TeemuSailynoja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fast reaction. This looks like a clean implementation to address the listed issue. My only comment is about the documentation, which could mention the expected format of fun_avg.

R/ppc-errors.R Outdated
Comment on lines 13 to 14
#' @param fun_avg Function to apply to compute the posterior average.
#' Defaults to `"mean"`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the ppc_stat functions, we have a similar argument, by the name stat, which does a very similar job:

#' @param stat A single function or a string naming a function, except for the
#' 2D plot which requires a vector of exactly two names or functions. In all
#' cases the function(s) should take a vector input and return a scalar
#' statistic. If specified as a string (or strings) then the legend will
#' display the function name(s). If specified as a function (or functions)
#' then generic naming is used in the legend.

We could align this doc to read, for example:

#' @param fun_avg A function or a string naming a function for computing the
#' posterior average. In both cases, the function should take a vector input and
#' return a scalar statistic. If specified as a string, then the legend will
#' display the function name. If specified as a function
#' then generic naming is used in the legend.
#' Defaults to "mean".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Average y - y_rep axis label is not affected. I didn't want to make yrep_avg_label() and error_avg_label() depend on fun_avg or change the default "Average y - y_rep" labels.

It does affect the $rep_label in ppc_scatter_avg_data(y, yrep) when fun_avg is a string.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, i'm just noticing Aki's comment. I'll switch to stat.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tjmahr. The code looks good. And I agree with changing to stat

Comment on lines 14 to 15
#' @param fun_avg Function to apply to compute the posterior average.
#' Defaults to `"mean"`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above.

@avehtari
Copy link
Member

I was also thinking stat instead of fun_avg.

It would be good to change the y-axis label to show the stat which was used, and also add parentheses. For example:
"median(y - y_rep)" or "y - median(y_rep)"

@kruschke
Copy link

kruschke commented May 14, 2025

Following up on @avehtari , also would be good to change the x-axis label to show the x variable name instead of just generic "$x$".

@tjmahr
Copy link
Collaborator Author

tjmahr commented May 14, 2025

  1. I have changed fun_avg to have the name stat.
  2. The stat name is shown on the axis label.
  3. I have made the stat argument handle functions given by strings, function objects, anonymous function objects, and quoted functions names.

Because the function name is used for both labeling and invoking a function name, I added as_tagged_function() that keeps track of the expression provided by the user.


library(bayesplot)
#> This is bayesplot version 1.12.0.9000
#> - Online documentation and vignettes at mc-stan.org/bayesplot
#> - bayesplot theme set to bayesplot::theme_default()
#>    * Does _not_ affect other ggplot2 plots
#>    * See ?bayesplot_theme_set for details on theme setting

l <- jsonlite::read_json(
  "https://gist.githubusercontent.com/tjmahr/3eba4f51e4b122b19417718b56423a6c/raw/31711cacef59eff69ed95eb51e2ae844eaad01b7/draws.json", 
  simplifyVector = TRUE
)
y <- l$y 
yrep <- l$yrep 
x <- l$x

stat is inlined into the axis label

ppc_scatter_avg(y, yrep, stat = "mean")

stat can be a function

ppc_scatter_avg(y, yrep, stat = median)

or a symbol

ppc_scatter_avg(y, yrep, stat = quote(median))

stat can be a primitive function

ppc_error_scatter_avg_vs_x(y, yrep, x, stat = min)

stat can be an anonymous function, but it gets a generic label

ppc_scatter_avg(y, yrep, stat = function(x) quantile(x, .1))

Created on 2025-05-14 with reprex v2.1.1

@TeemuSailynoja
Copy link
Collaborator

Following up on @avehtari , also would be good to change the x-axis label to show the x variable name instead of just generic "$x$".

If I'm not wrong, this is more challenging, as when passing x = data_frame$col_name, the information of the name of the column is lost, and only a numeric vector is passed to x.

@tjmahr
Copy link
Collaborator Author

tjmahr commented May 15, 2025

If I'm not wrong, this is more challenging, as when passing x = data_frame$col_name, the information of the name of the column is lost, and only a numeric vector is passed to x.

No, you can capture that. (plot() does that.) It's really easy in a single function setup.

library(rlang)
library(ggplot2)
f <- function(x, y) {
  qx <- enquo(x)
  qy <- enquo(y)
  data <- data.frame(x = x, y = y)
  ggplot(data) + 
    aes(x, y) + 
    geom_point() + 
    labs(
      x = as_label(quo_get_expr(qx)),
      y = as_label(quo_get_expr(qy))
    )
}

data <- data.frame(
  y = bayesplot::example_y_data(),
  x = bayesplot::example_x_data()
)
f(data$x, data$y)

Created on 2025-05-15 with reprex v2.1.1

The hard part for us is that when you call f(x = my_x, y = my_y) which internally calls g(x, y), then g() loses access to the user expression unless you take extra care along the way. (Or use quosures. I'm not sure yet.) We do that when we modify data and call some internal pp_() function to render the plot.

My update yesterday kept track of the expression for stat along the way but it was pretty tricky.

Edit: I figured out how to handle the inner function problem by using quosures/tunnelling.

@tjmahr tjmahr marked this pull request as draft May 15, 2025 19:23
@tjmahr
Copy link
Collaborator Author

tjmahr commented May 15, 2025

  1. Label anonymous functions with "stat"
  2. Support ~ style anonymous functions
  3. Simplify as_tagged_function() by using quosures

library(bayesplot)
#> This is bayesplot version 1.12.0.9000
#> - Online documentation and vignettes at mc-stan.org/bayesplot
#> - bayesplot theme set to bayesplot::theme_default()
#>    * Does _not_ affect other ggplot2 plots
#>    * See ?bayesplot_theme_set for details on theme setting

l <- jsonlite::read_json(
  "https://gist.githubusercontent.com/tjmahr/3eba4f51e4b122b19417718b56423a6c/raw/31711cacef59eff69ed95eb51e2ae844eaad01b7/draws.json", 
  simplifyVector = TRUE
)
y <- l$y 
yrep <- l$yrep 
x <- l$x

Function name

ppc_scatter_avg(y, yrep, stat = "mean")

ppc_error_scatter_avg(y, yrep, stat = "mean")

Function object

ppc_scatter_avg(y, yrep, stat = median)

ppc_error_scatter_avg(y, yrep, stat = median)

Primitive function

ppc_scatter_avg(y, yrep, x, stat = min)

ppc_error_scatter_avg_vs_x(y, yrep, x, stat = min)

Anonymous function

ppc_scatter_avg(y, yrep, x, stat = function(x) quantile(x, .1))

ppc_error_scatter_avg_vs_x(y, yrep, x, stat = function(x) quantile(x, .1))

Anonymous function (formulas)

ppc_scatter_avg(y, yrep, x, stat = ~ quantile(.x, .1))

ppc_error_scatter_avg_vs_x(y, yrep, x, stat = ~ quantile(.x, .1))

Created on 2025-05-15 with reprex v2.1.1

@tjmahr
Copy link
Collaborator Author

tjmahr commented May 16, 2025

This patch should be ready for review.

The expression for x now appears in the axis label.

library(bayesplot)
#> This is bayesplot version 1.12.0.9000
#> - Online documentation and vignettes at mc-stan.org/bayesplot
#> - bayesplot theme set to bayesplot::theme_default()
#>    * Does _not_ affect other ggplot2 plots
#>    * See ?bayesplot_theme_set for details on theme setting
l <- jsonlite::read_json(
  "https://gist.githubusercontent.com/tjmahr/3eba4f51e4b122b19417718b56423a6c/raw/31711cacef59eff69ed95eb51e2ae844eaad01b7/draws.json", 
  simplifyVector = TRUE
)

ppc_error_scatter_avg_vs_x(l$y, l$yrep, l$x, stat = "sd")

Created on 2025-05-16 with reprex v2.1.1

@tjmahr
Copy link
Collaborator Author

tjmahr commented May 19, 2025

@kruschke:

But the related thread regarding residuals #349 seems to suggest that residuals are computed as s t a t ( y − y r e p ) , with s t a t ( y r e p ) not separately, explicitly computed. Hmmm...?

Sorry if I'm late to the party, but shouldn't the residual label be $y - stat(y_{rep})$, not $stat(y - y_{rep})$? Isn't $y - stat(y_{rep})$ being computed internally, not $stat(y - y_{rep})$?

stat(y - y_rep) is what is computed internally.

@jgabry
Copy link
Member

jgabry commented May 20, 2025

@tjmahr sorry for the delay in reviewing this. Been a busy few days. Will try to get to it soon!

Copy link
Member

@jgabry jgabry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! I made a few minor comments/questions. Aside from those my only other question is does it makes sense to also use this new approach with the stat argument to ppc_stat()? You could do something similar to what you've done with the axis labels here but for the legend for ppc_stat(). What do you think?

@@ -26,7 +26,7 @@ URL: https://mc-stan.org/bayesplot/
BugReports: https://github.com/stan-dev/bayesplot/issues/
SystemRequirements: pandoc (>= 1.12.3), pandoc-citeproc
Depends:
R (>= 3.1.0)
R (>= 4.1.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at this point it's been out long enough that bumping the required R version is fine.

#' containing the expression to represent the function name and an
#' `"is_anonymous_function"` attribute to flag if the expression is a call to
#' `function()`.
as_tagged_function <- function(f = NULL, fallback = "func") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function works as described, just one question: how come sometimes you call rlang functions with rlang::foo() and other times just foo()? I think either way works since we have @import rlang in bayesplot-package.R, just curious if your choices were intentional.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Probably an artifact of me using typing rlang::[TAB] to use autocomplete to find the right rlang function.

#' draws (rows).
#' the points `(x,y) = (average(yrep[, n]), y[n])`, where each `yrep[, n]` is
#' a vector of length equal to the number of posterior draws and `average()`
#' is summary statistic. Unlike for `ppc_scatter()`, for `ppc_scatter_avg()`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#' is summary statistic. Unlike for `ppc_scatter()`, for `ppc_scatter_avg()`
#' is a summary statistic. Unlike for `ppc_scatter()`, for `ppc_scatter_avg()`

Comment on lines +209 to +211
levels(data$rep_label) <- yrep_avg_label(stat) |>
as.expression() |>
as.character()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me to use the base R pipe, but we do import %>% from dplyr so could also use that.

@jgabry
Copy link
Member

jgabry commented May 22, 2025

@tjmahr Also see @avehtari's comment here #350 (comment), which supports @kruschke's suggestion.

When we were always using the mean it didn't matter if we were computing y - stat(y_rep) or stat(y - y_rep). But if we're going to allow arbitrary functions then the two quantities won't necessarily be the same and users will likely expect us to be computing y - stat(y_rep), i.e. y minus a point prediction.

Then there's the separate question of what to plot on x and y axes for ppc_error_scatter_avg. @kruschke is right that standard residual plots would put y - stat(y_rep) on the y-axis and stat(y_rep) on the x-axis. If we switch to doing that do you think that's too big of a change? That is, if we do that should we change the name of the function and deprecate the old version? Some other option?

@tjmahr
Copy link
Collaborator Author

tjmahr commented May 22, 2025

But if we're going to allow arbitrary functions then the two quantities won't necessarily be the same and users will likely expect us to be computing y - stat(y_rep), i.e. y minus a point prediction.

My thinking was stat(y - y_rep) is an average error, and y - stat(y_rep) is the error of the average prediction, so the former better matches the function names and how I think about ppc functions.

I don't oppose the change, and updating the axis label will make it unambiguous what's being computed.

@jgabry
Copy link
Member

jgabry commented May 22, 2025

Yeah I have mixed feelings about this. Both feel intuitive to me but represent slightly different things.

A couple of options (there are probably others):

  1. Change ppc_error_scatter_avg to do y - stat(y_rep) on the y-axis and stat(y_rep) on the x-axis. This is what @kruschke was expecting and @avehtari agreed with, so it seems like many users would probably expect this.

  2. Keep ppc_error_scatter_avg as you have it and add a new one that does y - stat(y_rep) on the y-axis and stat(y_rep) on the x-axis. We could call it ppc_residual() or something like that. I think with the right documentation having both would probably be ok.

Do you have a preference? Or a 3rd option?

@avehtari
Copy link
Member

I think it would be better to make a new function, to not break anyone's current plots. I was also thinking of name something like ppc_residual(). @TeemuSailynoja has also new binned residual plot with PAVA, which could then be ppc_residual_binned() and we would not change the current ppc_error_binned()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ppc_error_scatter_avg_vs_x, has unstable residuals when the noise distribution has heavy tails; needs median not mean?
6 participants