Improvements to multimodal HDI #28

sethaxen · 2024-10-11T12:58:44Z

This PR implements the improvements to HDI suggested in arviz-devs/arviz#2394, with a few differences:

Supporting passing an array of HDI probabilities is left for a future PR.
Circular support for continuous multimodal HDI is included.
When more intervals than max_modes are computed, now the max_modes highest probability intervals are returned instead of just the ones that are lowest on the real line.

📚 Documentation preview 📚: https://arviz-stats--28.org.readthedocs.build/en/28/

More modular functions and vectorization

OriolAbril

I love the improvements, thanks!

Completely unrelated but it would also be interesting to know how hard it was to navigate the codebase in the current stats of docs tending to zero. Both so we can prioritize which docs and tests to write first and maybe rethink some design choices.

src/arviz_stats/base/core.py

src/arviz_stats/base/array.py

src/arviz_stats/base/core.py

sethaxen · 2024-10-12T22:40:45Z

Oh one thing I left out but will add is the ability to restrict the bounds to points actually in the sample. There are 2 ways to do this, via trimming or interpolation. As I shared on Slack, experiments show that interpolation produces better estimates than trimming and for moderate to large sample sizes (n>O(100)) is better than KDE-based estimates. The latter makes me wonder if this should be the default, but I hesitate mostly because I haven't seen it discussed in the litrature.

I'm not certain what keyword to use to allow the user to configure this behavior. BTW, what are the names mode='nearest' and mode='agg_nearest' meant to convey?

OriolAbril · 2024-10-15T13:34:16Z

I'm not certain what keyword to use to allow the user to configure this behavior. BTW, what are the names mode='nearest' and mode='agg_nearest' meant to convey?

IIRC, there were two implementations of hdi in regular arviz. One that used the raw samples, looking at how many k samples need to be included to cover ci_prob then generate all intervals between a sample and k samples ahead. That is nearest. The other used kde/hdi first then very similar approach but with kde/hist results, so closer to what the multimodal is doing. That is agg_nearest.

sethaxen · 2024-10-15T16:47:09Z

IIRC, there were two implementations of hdi in regular arviz. One that used the raw samples, looking at how many k samples need to be included to cover ci_prob then generate all intervals between a sample and k samples ahead. That is nearest.

As far as I can tell from reading the regular arviz code, the two methods that are supported are 1) the same as "nearest" here and 2) the same as "multimodal" before this PR. "agg_nearest" does not seem to be supported.

I was more wondering why these names were chosen. e.g. what is "nearest" near to? The original draws? If so then I think "contiguous" is a more accurate name, since that's really the constraint applied by this method.

The other used kde/hdi first then very similar approach but with kde/hist results, so closer to what the multimodal is doing. That is agg_nearest.

agg_nearest seems to be not so similar to nearest at all. It will only be similar when multimodal would return a single interval. Otherwise, it also includes all the intervals between the HDI intervals and can contain much more probability than the requested "hdi_prob". Personally, I don't see a use for agg_nearest; I see it as being similar to reaching for a CDF estimated from a KDE instead of an ECDF. I think the smoothing would in general only make the estimate worse. If something similar to nearest that forces contiguity but uses "bin centers" is really what's desired, a JIT-compiled sliding window approach would be quite efficient:

import numpy as np
import numba

@numba.jit
def hdi_contiguous_weighted(bins, bin_probs, prob):
    n = len(bins)
    is_discrete = bins.dtype.kind != 'f'

    cum_probs = np.cumsum(bin_probs)
    bins_diff = np.diff(bins)

    i_lower = 0
    i_upper = np.searchsorted(cum_probs, prob, side="left")
    interval_width = bins[i_upper] - bins[i_lower] + is_discrete
    min_interval_width = interval_width
    interval_prob = cum_probs[i_upper]
    interval = np.array([i_lower, i_upper])
    while i_upper < n - 1:
        # increase lower bound until interval is invalid
        while interval_prob >= prob and i_lower <= i_upper:
            if interval_width < min_interval_width:
                interval[:] = (i_lower, i_upper)
                min_interval_width = interval_width
            interval_prob -= bin_probs[i_lower]
            interval_width -= bins_diff[i_lower]
            i_lower += 1

        # increase upper bound until interval is valid again
        while interval_prob < prob and i_upper < n - 1:
            interval_width += bins_diff[i_upper]
            i_upper += 1
            interval_prob += bin_probs[i_upper]

    return bins[interval]

OriolAbril · 2024-10-15T17:03:37Z

Let's remove agg_nearest then, not sure where I took it from if it is not in current ArviZ. I didn't think too much about the names either. To rename nearest numpy also has a closest_observation method for np.quantile which is probably better than nearest, I like contiguous too.

OriolAbril · 2024-10-16T10:57:48Z

I think the only thing left for merging is the api and behaviour for bins in multimodal hdi for discrete data

sethaxen · 2024-10-17T11:19:41Z

I've added multimodal_nearest method, which returns multimodal HDIs where the bounds come from the sample. I don't think it's the best name, as naively I would assume this means "compute the multimodal HDI using a KDE and then snap the bounds to the nearest sample points," which is not quite what this does. What this does is compute the densities at the sample points and use that to rank the points to find the HDI bounds.

sethaxen · 2024-10-18T10:53:37Z

I think multimodal_sample might be a better name than multimodal_nearest, as it's not so much returning the nearest point in the bounds but rather selecting the bounds from the sample.

aloctavodia · 2024-10-18T12:55:58Z

multimodal_sample sounds good to me.

OriolAbril

Renamed method and opened an issue for the unimodal version so we can merge

sethaxen · 2024-10-26T19:55:25Z

Thanks, @OriolAbril!

sethaxen added 9 commits October 10, 2024 12:58

Fix typo

8dec502

Refactor multimodal HDI code

1d4a395

More modular functions and vectorization

Ensure interval contains >=hdi_prob

963bb24

For integer/bool HDI, default to bin width of 1

d2f3996

Split continuous and discrete multimodal HDI

0fe0d49

Default to ISJ bandwidth for multimodal HDI

c1642bc

Return highest probability modes

febca74

Fix bugs in circular KDE

5e4c51c

Support circular continuous multimodal HDI

dc162d6

OriolAbril reviewed Oct 11, 2024

View reviewed changes

src/arviz_stats/base/core.py Outdated Show resolved Hide resolved

src/arviz_stats/base/core.py Outdated Show resolved Hide resolved

src/arviz_stats/base/array.py Outdated Show resolved Hide resolved

src/arviz_stats/base/core.py Outdated Show resolved Hide resolved

This was referenced Oct 16, 2024

Ensure KDE is normalized #30

Merged

Remove HDI method agg_nearest #31

Merged

sethaxen added 13 commits October 16, 2024 12:59

Merge branch 'main' into hdi_improvements

567d4b2

Assume input probabilities sum to 1

ea75142

Merge lines

48439ac

Scale KDE density to bin probabilities

ff32878

Use bins returned by _histogram

4ecce3c

Avoid duplication of HDI defaults

b6e7bc1

Fix and test passing bins to discrete multimodal

8c45c45

Simplify HDI nearest code

7b60afe

Fix circular standardization

ef41ebc

Correctly compute bin centers

871b3fb

Fix pylint issues

9605b3d

Move interval splitting to own function

cc996a9

Use circular standardization

cd76b29

sethaxen added 2 commits October 17, 2024 13:06

Add method for computing HDI from point densities

2e162e0

Add multimodal_nearest HDI method

ff42224

OriolAbril mentioned this pull request Oct 25, 2024

Better name for default unimodal HDI method #36

Open

rename and add check for warning in tests

4739a01

OriolAbril approved these changes Oct 25, 2024

View reviewed changes

sethaxen merged commit c5a4f5f into main Oct 26, 2024
4 checks passed

sethaxen deleted the hdi_improvements branch October 26, 2024 19:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to multimodal HDI #28

Improvements to multimodal HDI #28

sethaxen commented Oct 11, 2024 •

edited by github-actions bot

Loading

OriolAbril left a comment

sethaxen commented Oct 12, 2024

OriolAbril commented Oct 15, 2024

sethaxen commented Oct 15, 2024

OriolAbril commented Oct 15, 2024

OriolAbril commented Oct 16, 2024

sethaxen commented Oct 17, 2024

sethaxen commented Oct 18, 2024

aloctavodia commented Oct 18, 2024

OriolAbril left a comment

sethaxen commented Oct 26, 2024

Improvements to multimodal HDI #28

Improvements to multimodal HDI #28

Conversation

sethaxen commented Oct 11, 2024 • edited by github-actions bot Loading

OriolAbril left a comment

Choose a reason for hiding this comment

sethaxen commented Oct 12, 2024

OriolAbril commented Oct 15, 2024

sethaxen commented Oct 15, 2024

OriolAbril commented Oct 15, 2024

OriolAbril commented Oct 16, 2024

sethaxen commented Oct 17, 2024

sethaxen commented Oct 18, 2024

aloctavodia commented Oct 18, 2024

OriolAbril left a comment

Choose a reason for hiding this comment

sethaxen commented Oct 26, 2024

sethaxen commented Oct 11, 2024 •

edited by github-actions bot

Loading