-
-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add KDE bandwidth selectors using biased or unbiased cross-validation #2384
base: main
Are you sure you want to change the base?
Conversation
Remaining pylint errors seem to be in code not touched in this PR |
I re-made the plots in #2384 (comment) using the explicit algorithm in the reference (i.e. without histogram binning) to see if the results were very different, and for all but log-normal, the bandwidth from the explicit algorithm is indistinguishable from the one implemented here. In that case, explicit UCV looks like BCV. |
It seems UCV and BCV, can be quite noisy compared to "default" except when distribution are normal-like..., so their main use case could be multimodal densities? Did you compared against |
} | ||
|
||
|
||
def _get_bw(x, bw, grid_counts=None, x_std=None, x_range=None): | ||
def _get_bw(x, bw, grid_counts=None, bin_width=None, x_std=None, x_range=None): # pylint: disable=too-many-positional-arguments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can disable "too-many..." globally. It's popping up in many places.
I suspect it's more that the UCV and BCV criteria don't use any boundary correction, and when the distribution is bounded with high density near the bounds, then for large bandwidths the mean integrated squared error (MISE) is very high near the bound, while smaller bandwidths allow a sharper transition from low density on the wrong side of the bound to high density on the right side of the bound. So this explains why the bw that minimizes this objective for Normal, beta, t, and GMM produces smooth KDEs, while for exponential and uniform it looks undersmoothed. (log-normal looking weird is due to histogram binning, not due to the objective itself). I've worked out the equivalent objective for boundary correction using reflection. Reflection is not technically equivalent to what we do, but I suspect it would produce similar results. I'm working out efficient evaluation of the objective now.
Not directly, but I will. I also located the source for how the R stats module computes selects the UCV and BCV-based bandwidths, and they seem to use the identical approach to this PR, except they set the lower bound for the optimization to |
I did some informal comparisons with |
Sounds good to me. We may want to add a note into the docstring with suggestions about when to switch to isj. I will also add a note here https://github.com/arviz-devs/Exploratory-Analysis-of-Bayesian-Models (unless someone wants to do it first) |
Description
As discussed on Slack, the existing bandwidth selection methods for
kde
oversmooth for draws from multimodal distributions with well-separated modes. This PR adds newbw
options "ucv" and "bcv", which used unbiased or biased LOO-CV to select the bandwidth.Example
Checklist
PR format?
section of the changelog?
📚 Documentation preview 📚: https://arviz--2384.org.readthedocs.build/en/2384/