Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to multimodal HDI #28

Merged
merged 25 commits into from
Oct 26, 2024
Merged

Improvements to multimodal HDI #28

merged 25 commits into from
Oct 26, 2024

Conversation

sethaxen
Copy link
Member

@sethaxen sethaxen commented Oct 11, 2024

This PR implements the improvements to HDI suggested in arviz-devs/arviz#2394, with a few differences:

  • Supporting passing an array of HDI probabilities is left for a future PR.
  • Circular support for continuous multimodal HDI is included.
  • When more intervals than max_modes are computed, now the max_modes highest probability intervals are returned instead of just the ones that are lowest on the real line.

📚 Documentation preview 📚: https://arviz-stats--28.org.readthedocs.build/en/28/

Copy link
Member

@OriolAbril OriolAbril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the improvements, thanks!

Completely unrelated but it would also be interesting to know how hard it was to navigate the codebase in the current stats of docs tending to zero. Both so we can prioritize which docs and tests to write first and maybe rethink some design choices.

@sethaxen
Copy link
Member Author

Oh one thing I left out but will add is the ability to restrict the bounds to points actually in the sample. There are 2 ways to do this, via trimming or interpolation. As I shared on Slack, experiments show that interpolation produces better estimates than trimming and for moderate to large sample sizes (n>O(100)) is better than KDE-based estimates. The latter makes me wonder if this should be the default, but I hesitate mostly because I haven't seen it discussed in the litrature.

I'm not certain what keyword to use to allow the user to configure this behavior. BTW, what are the names mode='nearest' and mode='agg_nearest' meant to convey?

@OriolAbril
Copy link
Member

I'm not certain what keyword to use to allow the user to configure this behavior. BTW, what are the names mode='nearest' and mode='agg_nearest' meant to convey?

IIRC, there were two implementations of hdi in regular arviz. One that used the raw samples, looking at how many k samples need to be included to cover ci_prob then generate all intervals between a sample and k samples ahead. That is nearest. The other used kde/hdi first then very similar approach but with kde/hist results, so closer to what the multimodal is doing. That is agg_nearest.

@sethaxen
Copy link
Member Author

IIRC, there were two implementations of hdi in regular arviz. One that used the raw samples, looking at how many k samples need to be included to cover ci_prob then generate all intervals between a sample and k samples ahead. That is nearest.

As far as I can tell from reading the regular arviz code, the two methods that are supported are 1) the same as "nearest" here and 2) the same as "multimodal" before this PR. "agg_nearest" does not seem to be supported.

I was more wondering why these names were chosen. e.g. what is "nearest" near to? The original draws? If so then I think "contiguous" is a more accurate name, since that's really the constraint applied by this method.

The other used kde/hdi first then very similar approach but with kde/hist results, so closer to what the multimodal is doing. That is agg_nearest.

agg_nearest seems to be not so similar to nearest at all. It will only be similar when multimodal would return a single interval. Otherwise, it also includes all the intervals between the HDI intervals and can contain much more probability than the requested "hdi_prob". Personally, I don't see a use for agg_nearest; I see it as being similar to reaching for a CDF estimated from a KDE instead of an ECDF. I think the smoothing would in general only make the estimate worse. If something similar to nearest that forces contiguity but uses "bin centers" is really what's desired, a JIT-compiled sliding window approach would be quite efficient:

import numpy as np
import numba

@numba.jit
def hdi_contiguous_weighted(bins, bin_probs, prob):
    n = len(bins)
    is_discrete = bins.dtype.kind != 'f'

    cum_probs = np.cumsum(bin_probs)
    bins_diff = np.diff(bins)

    i_lower = 0
    i_upper = np.searchsorted(cum_probs, prob, side="left")
    interval_width = bins[i_upper] - bins[i_lower] + is_discrete
    min_interval_width = interval_width
    interval_prob = cum_probs[i_upper]
    interval = np.array([i_lower, i_upper])
    while i_upper < n - 1:
        # increase lower bound until interval is invalid
        while interval_prob >= prob and i_lower <= i_upper:
            if interval_width < min_interval_width:
                interval[:] = (i_lower, i_upper)
                min_interval_width = interval_width
            interval_prob -= bin_probs[i_lower]
            interval_width -= bins_diff[i_lower]
            i_lower += 1

        # increase upper bound until interval is valid again
        while interval_prob < prob and i_upper < n - 1:
            interval_width += bins_diff[i_upper]
            i_upper += 1
            interval_prob += bin_probs[i_upper]

    return bins[interval]

@OriolAbril
Copy link
Member

Let's remove agg_nearest then, not sure where I took it from if it is not in current ArviZ. I didn't think too much about the names either. To rename nearest numpy also has a closest_observation method for np.quantile which is probably better than nearest, I like contiguous too.

This was referenced Oct 16, 2024
@OriolAbril
Copy link
Member

I think the only thing left for merging is the api and behaviour for bins in multimodal hdi for discrete data

@sethaxen
Copy link
Member Author

I've added multimodal_nearest method, which returns multimodal HDIs where the bounds come from the sample. I don't think it's the best name, as naively I would assume this means "compute the multimodal HDI using a KDE and then snap the bounds to the nearest sample points," which is not quite what this does. What this does is compute the densities at the sample points and use that to rank the points to find the HDI bounds.

@sethaxen
Copy link
Member Author

I think multimodal_sample might be a better name than multimodal_nearest, as it's not so much returning the nearest point in the bounds but rather selecting the bounds from the sample.

@aloctavodia
Copy link
Contributor

multimodal_sample sounds good to me.

Copy link
Member

@OriolAbril OriolAbril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed method and opened an issue for the unimodal version so we can merge

@sethaxen
Copy link
Member Author

Thanks, @OriolAbril!

@sethaxen sethaxen merged commit c5a4f5f into main Oct 26, 2024
4 checks passed
@sethaxen sethaxen deleted the hdi_improvements branch October 26, 2024 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants