generated from jtr13/cctemplate
-
Notifications
You must be signed in to change notification settings - Fork 139
/
improved_density_estimation.Rmd
347 lines (243 loc) · 13.6 KB
/
improved_density_estimation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
# Improved density estimation
Abel Perez Vargas
## Introduction
In the field of data analysis it is common to come across non-negative variables, such as waiting times, prices, distances, age and weight, as well as non-positive variables such as money due, level below the sea or below freezing point temperature.
Before thinking in further modeling of such variables, we may be first interested in exploring and visualizing the data. And among the first techniques one may want to implement, there is univariate visualization.
Density plots are a very useful technique that unveils the distributional form of the variables, and it may help us identify patterns that may be hidden in a traditional histogram (e.g. "gaps" in the data that may happen to be inside one specific histogram class).
However, density plots may sometimes give the false impression that a know-to-be non-negative variable has some negative values like in this example using simulations of a $\Gamma(2,2)$ distribution.
```{r}
set.seed(123)
plot(density(rgamma(100, 2, 2) ), lty=2, lwd=2, col="red", main="Densities")
curve(dgamma(x, 2, 2), from=-1, to=5, lwd=2, add=T)
legend("topright", legend=c("True density", "Estimated density"),
lty=c(1, 2), col=c("black", "red"))
```
It can also be the case that the data exhibits values exactly equal to 0, in which case the standard density plot cannot help us detect it and we would have to do further exploration in order to do it.
```{r}
set.seed(123)
data <- c(rep(0, 10), rgamma(100, 1, 1))
plot(density( data ), lty=2, lwd=2, col="red", main="Densities")
legend("topright", legend="Estimated density",
lty=2, col="red")
print(paste("Frequency of 0:", sum(data==0) ))
```
## Generalizing Kernel Smoothing for Densities
In this tutorial we focus on density plots, or more formally, density estimation based on kernels (Venables & Ripley, 2002) for data that may come from a mixture of discrete and continuous random variables (e.g. data exhibiting 0's), and avoid positive density below or above 0 if the data exhibits such behavior.
For this, we first perform an algorithm that detects repeated values in the data and then tests their significance with a Binomial test for the proportion with the following hypothesis:
$$H_0: p_i=\frac{1}{n} ~~ vs ~~ H_1: p_i>\frac{1}{n},$$
for $i$ an index for such values with repetitions.
If repeated values are found to be statistically significant (that is, when $H_0$ is rejected under $\alpha=0.05$), we then model the data as a mixture of a discrete random variable with support on the repeated values, and a continuous random variable for the data without repetitions, which implies the following density:
$$ f_X(x)=\sum_{i^*}\hat{p}_{i^*} \delta_{ \{ x_{i^*} \} }(x) ~ + ~ \left( 1-\sum_{i^*}\hat{p}_{i^*} \right)g(x),$$
where $g$ is the density estimate of the data without the significant repeated values: $x_{i^*}$, $\hat{p}_{i^*}$ represents the estimated probability mass (proportion) of such values, and $\delta_{ \{ k \}}(x)$ is Dirac's delta, a function that is equal to infinity in $k$, $0$ otherwise, and integrates to one.
The previous representation is useful to keep $f_X$ as a valid density function, but for visualization purposes we will only plot $\hat{p}_{i^*}$ as Dirac's delta would be represented as a vertical infinite line.
The density $g$ will be modified from the standard `stats::density` function in `R` to avoid positive density in non-admissible regions (like in the negative numbers if all the data are all positive). For that end, we will slightly modify the kernel method of density estimation, by generalizing the gaussian kernel to truncated normal for the smoothing. In the case of non-negative data, the kernel will be truncated to be distinct than $0$ only for $x\geq0$ (and a similar case for non-positive data).
## plot_density Function
We introduce `plot_density`, a function that adjusts itself in the presence of non-negative, non-positive or real data to avoid assigning density to unlikely values, and also automatically detects the presence of significant repeated values in the data.
```{r}
# plot_density function
#
# Inputs:
#
# x: data, a numeric vector possibly with repeated values
# k: split parameter (default=0), the value for which the function will
# evaluate whether or not the data is all "to the left", "to the right"
# or neither from k
# bw: bandwidth, either the character "nrd0" as in stats::density, or a number
# for manual bandwidth selection
# main, xlab, ylab: graphical parameters to be passed to base::plot
#
# Output: data density plot with displayed relative frequency of repeated vales (if that's the case)
plot_density <- function(x, thr=0, bw="nrd0",
main="Density Plot", xlab="x", ylab="density"){
if(sum(is.na(x))>0 ){
stop("NAs found")
}else{
# Repeated Values detection
n <- length(x)
x <- sort(x)
y <- x[-1][x[-1]==x[-n]]
vals <- NULL
# Smallest difference distinct than 0 to be used for plot margins
diff <- x[-1]-x[-n]
diff <- min(diff[diff>0])
if(length(y)>0 ){
rep <- table(y) + 1# frequency of the repeated values
M <- sapply(rep, binom.test, n=n, p=1/n,
alternative="greater")# binomial test p-values
M <- unlist(M[row.names(M)=="p.value", ])
vals <- names(rep)[M<0.05]
freq <- rep[M<0.05]
# If significant values where detected, they are stored in 'vals'
if(length(vals)>0){
vals <- as.numeric(vals)
}
}
# banwidth
if(bw=="nrd0"){
# Silverman's rule of thumb: Silverman (1986, page 48, eqn (3.31)))
IQ <- quantile(x, 0.75) - quantile(x, 0.25)
if(IQ>0){
bw <- 0.9*min((sd(x)/1.34*n)^(-1/5), (IQ/1.34*n)^(-1/5))
}else{
bw <- 0.9*(sd(x)/1.34*n)^(-1/5)
}
}
# Density plotting
if( min(x)>=thr ){
# Non-negative case: Gaussian kernel modified to positive truncated Normal
# Truncated normal kernel
k <- function(x, m, sd){
if(x<0){
0
}else{
dnorm(x, m, sd)/(1-pnorm(0, m, sd))
}
}
if(length(y)>0 & length(vals)>0){
# Discrete probability mass will be plotted
p <- freq/n# mass probabilities
c <- x[!x %in% vals]# continuous component
# continous density component re-scaled with the mass probability
d <- function(x, c, bw){
(1-sum(p))*mean(k(x, c, bw))
}
# Plot
grid <- c( min(x)-diff, c, max(x)+diff )
plot(grid, sapply( grid, d, c, bw ), main=main, xlab=xlab,
ylab=ylab, col="darkgray", type="l")
points(vals, p, pch=16)
for(j in 1:length(vals)){lines(c(vals[j], vals[j]), c(0, p[j]) )}
legend("topright", legend="Mass Probability", col="black", pch=16)
}else{
# No probability mass found
# continous density
d <- function(x, c, bw){
mean(k(x, c, bw))
}
# Plot
grid <- c( min(x)-diff, x, max(x)+diff )
plot(grid, sapply( grid, d, x, bw ), main=main, xlab=xlab,
ylab=ylab, col="gray", type="l")
}
}else{
if( max(x)<=thr ){
# Non-positive case: Gaussian kernel modified to negative truncated Normal
# Truncated normal kernel
k <- function(x, m, sd){
if(x>0){
0
}else{
dnorm(x, m, sd)/pnorm(0, m, sd)
}
}
if(length(y)>0 & length(vals)>0){
# Discrete probability mass will be plotted
p <- freq/n# mass probabilities
c <- x[!x %in% vals]# continuous component
# continous density component re-scaled with the mass probability
d <- function(x, c, bw){
(1-sum(p))*mean(k(x, c, bw))
}
# Plot
grid <- c( min(x)-diff, c, max(x)+diff )
plot(grid, sapply( grid, d, c, bw ), main=main, xlab=xlab,
ylab=ylab, col="gray", type="l")
points(vals, p, pch=16)
for(j in 1:length(vals)){lines(c(vals[j], vals[j]), c(0, p[j]) )}
legend("topright", legend="Mass Probability", col="black", pch=16)
}else{
# No probability mass found
# continous density
d <- function(x, c, bw){
mean(k(x, c, bw))
}
# Plot
grid <- c( min(x)-diff, x, max(x)+diff )
plot(grid, sapply( grid, d, x, bw ), main=main, xlab=xlab,
ylab=ylab, col="gray", type="l")
}
}else{
# Real case: Gaussian kernel kept
if(length(y)>0 & length(vals)>0){
# Discrete probability mass will be plotted
p <- freq/n# mass probabilities
c <- x[!x %in% vals]# continuous component
# continous density component re-scaled with the mass probability
d <- function(x, c, bw){
(1-sum(p))*mean(dnorm(x, c, bw))
}
# Plot
grid <- c( min(x)-diff, c, max(x)+diff )
plot(grid, sapply( grid, d, c, bw ), main=main, xlab=xlab,
ylab=ylab, col="gray", type="l")
points(vals, p, pch=16)
for(j in 1:length(vals)){lines(c(vals[j], vals[j]), c(0, p[j]) )}
legend("topright", legend="Mass Probability", col="black", pch=16)
}else{
# No probability mass found
# continous density
d <- function(x, c, bw){
mean(dnorm(x, c, bw))
}
# Plot
grid <- c( min(x)-diff, x, max(x)+diff )
plot(grid, sapply( grid, d, x, bw ), main=main, xlab=xlab,
ylab=ylab, col="gray", type="l")
}
}
}
}
}
```
## Examples
Now we show some examples in which the function would be useful.
### Non-negative data with repetitions
Imagine data such as the waiting time of people in a bank branch. In this context there may be observations exactly equal to 0 indicating clients that got attended directly after.
```{r}
set.seed(123)
times <- c(rep(0, 15), rexp(200))
plot_density(times)
lines(density(times), col="blue", lty=2)
legend("top", legend=c("plot_density", "Usual density"),
col=c("darkgray", "blue"), lty=c(1, 2), cex=0.7)
```
In the previous example the data is non-negative and exhibits repetitions in 0. `plot_density` detects such repetitions as statistically significant and displays the proportion as a single dot connected with a black line. The density now doesn't exhibits positive probability in the negative numbers (as opossed to `stats::density` in blue).
### Non-positive data with repetitions
We now center our attention to negative data, e.g. credit due. In such cases density in positive values would be counter intuitive.
```{r}
set.seed(123)
credit <- c(rep(0, 20), rep(-1, 30), -rgamma(300, 8, 1))
plot_density(credit, bw=0.5)
lines(density(credit), col="blue", lty=2)
legend("topleft", legend=c("plot_density", "Usual density"),
col=c("darkgray", "blue"), lty=c(1, 2), cex=0.7)
```
In this case there are two probability mass points at 0 and -1, which may be associated with all the cards with 0 balance, and many having 1 due to, fo example, a cash back promotion for using $1,000 in credit (assume data is in thousands).
`plot_density` detects such repetitions and excludes them from the other data giving a broader picture of them, the standard density method gives the impression of having a bimodality around -1 when they are actually two repeated values repeated.
### Real data with repetitions
We now center in the case of real data that may have repetitions, e.g. mortgage loan amounts in which it is frequent to observe loans in multiples of $1,000.
```{r}
set.seed(123)
loans <- c(rep(1000, 50), rep(2000, 40), rep(3000, 60), rep(4000, 35), rep(5000, 45), rnorm(500, 3000, 800) )
plot_density(loans, bw=100)
lines(density(loans), col="blue", lty=2)
legend("topleft", legend=c("plot_density", "Usual density"),
col=c("darkgray", "blue"), lty=c(1, 2), cex=0.7)
```
In this case we also had to manually tune the bandwidth by changing the *bw* parameter. Due to the span of the data, the density is lower than the probability mass, however, the lines are still plotted and they let us know there are repetitions by displaying. This helps to avoid the false impression of many values around the first and last modalities displayed by the blue density.
### Real data without repetitions
We finally focus on the general case of real data with no repetitions, in which case `plot_density` closely resembles `stats::density`:
```{r}
set.seed(123)
data <- rt(1000, 3)
plot_density(data)
lines(density(data), col="blue", lty=2)
legend("topright", legend=c("plot_density", "Usual density"),
col=c("darkgray", "blue"), lty=c(1, 2), cex=0.7)
```
## Conclussion
We introduced a function, `plot_density`, that can be helpful for univariate visualization in cases with strictly non-negative or non-positive data and in the possible pressence of repetitions.
It generalizes `stats::density` by using truncated normal kernels for the continuous density, and mixing it with discrete probability mass components if repetitions are found.
## References
- Silverman, B. W. (1986). *Density Estimation*. London: Chapman and Hall.
- Venables, W.N. & Ripley, B.D. (2002). *Modern Applied Statistics with S*, Springer, 4th edition.