Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-scale data before performing bivariate KDE #96

Open
farr opened this issue Apr 30, 2021 · 0 comments
Open

Re-scale data before performing bivariate KDE #96

farr opened this issue Apr 30, 2021 · 0 comments

Comments

@farr
Copy link

farr commented Apr 30, 2021

If you have data that are highly correlated or have unusual shapes, the current code in BivariateKDE does not give a very good representation of the true distribution because the bandwidths and kernel densities are estimated separately in the x and y dimensions. For example, with this data set, the current code applied to this data:

using KernelDensity
using StatsPlots

x = randn(1024)
y = x .+ 0.1 .* randn(1024)
plot(kde((x,y)))

produces

image

A better approach is to estimate the bandwidth of the KDE as a 2x2 covariance matrix (this is, for example, what SciPy does); but a naive implementation of such a "full rank" bandwidth makes it much harder to employ the FFT-to-a-grid trick used in KernelDensity. An equivalent approach that allows to keep the FFT trick is to first de-correlate the input data by performing a rotation and scale so that is covariance is the identity, then estimate using the current KDE code, then de-rotate onto the original grid. This code performs the manipulations:

function rescaled_bivariate_kde((x, y))
    z = hcat(x, y)

    mu = mean(z, dims=1)
    Sigma = cov(z, dims=1)
    L = cholesky(Sigma).L
    detL = prod(diag(L))

    zr = L \ (z .- mu)'

    kr = kde((zr[1,:], zr[2,:]))
    k = kde((x,y))
    ikr = InterpKDE(kr)

    for i in 1:size(k.x, 1)
        for j in 1:size(k.y, 1)
            zr = L \ (permutedims([k.x[i], k.y[j]]) .- mu)'
            p = pdf(ikr, zr[1,1], zr[2,1])
            k.density[i,j] = p / detL
        end
    end

    k
end

The result on the above data set is

plot(rescaled_bivariate_kde((x,y)))

image

which is a much better representation of the actual density. Note: I have checked that the proper Jacobian factors are included:

k = rescaled_bivariate_kde((x,y))
sum(k.density .* step(k.x) .* step(k.y)) # 1.0 to good accuracy

I propose that rescaled_bivariate_kde replace the existing kde constructor for bivariate data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant