You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If you have data that are highly correlated or have unusual shapes, the current code in BivariateKDE does not give a very good representation of the true distribution because the bandwidths and kernel densities are estimated separately in the x and y dimensions. For example, with this data set, the current code applied to this data:
using KernelDensity
using StatsPlots
x =randn(1024)
y = x .+0.1.*randn(1024)
plot(kde((x,y)))
produces
A better approach is to estimate the bandwidth of the KDE as a 2x2 covariance matrix (this is, for example, what SciPy does); but a naive implementation of such a "full rank" bandwidth makes it much harder to employ the FFT-to-a-grid trick used in KernelDensity. An equivalent approach that allows to keep the FFT trick is to first de-correlate the input data by performing a rotation and scale so that is covariance is the identity, then estimate using the current KDE code, then de-rotate onto the original grid. This code performs the manipulations:
functionrescaled_bivariate_kde((x, y))
z =hcat(x, y)
mu =mean(z, dims=1)
Sigma =cov(z, dims=1)
L =cholesky(Sigma).L
detL =prod(diag(L))
zr = L \ (z .- mu)'
kr =kde((zr[1,:], zr[2,:]))
k =kde((x,y))
ikr =InterpKDE(kr)
for i in1:size(k.x, 1)
for j in1:size(k.y, 1)
zr = L \ (permutedims([k.x[i], k.y[j]]) .- mu)'
p =pdf(ikr, zr[1,1], zr[2,1])
k.density[i,j] = p / detL
endend
k
end
The result on the above data set is
plot(rescaled_bivariate_kde((x,y)))
which is a much better representation of the actual density. Note: I have checked that the proper Jacobian factors are included:
k =rescaled_bivariate_kde((x,y))
sum(k.density .*step(k.x) .*step(k.y)) # 1.0 to good accuracy
I propose that rescaled_bivariate_kde replace the existing kde constructor for bivariate data.
The text was updated successfully, but these errors were encountered:
If you have data that are highly correlated or have unusual shapes, the current code in BivariateKDE does not give a very good representation of the true distribution because the bandwidths and kernel densities are estimated separately in the
x
andy
dimensions. For example, with this data set, the current code applied to this data:produces
A better approach is to estimate the bandwidth of the KDE as a 2x2 covariance matrix (this is, for example, what SciPy does); but a naive implementation of such a "full rank" bandwidth makes it much harder to employ the FFT-to-a-grid trick used in KernelDensity. An equivalent approach that allows to keep the FFT trick is to first de-correlate the input data by performing a rotation and scale so that is covariance is the identity, then estimate using the current KDE code, then de-rotate onto the original grid. This code performs the manipulations:
The result on the above data set is
which is a much better representation of the actual density. Note: I have checked that the proper Jacobian factors are included:
I propose that
rescaled_bivariate_kde
replace the existingkde
constructor for bivariate data.The text was updated successfully, but these errors were encountered: