-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME.Rmd
204 lines (158 loc) · 8.27 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r preamble, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# BNPMIXcluster
<!-- badges: start -->
[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://cran.r-project.org/web/licenses/MIT)
[![CRAN status](https://www.r-pkg.org/badges/version/BNPMIXcluster)](https://CRAN.R-project.org/package=BNPMIXcluster)
[![devel-version](https://img.shields.io/badge/devel%20version-1.3-blue.svg)](https://github.com/christianu7/BNPMIXcluster)
[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable)
[![doi](https://img.shields.io/badge/doi-10.1007/s11634--018--0313--6-yellow.svg)](https://doi.org/10.1007/s11634-018-0313-6)
<!-- badges: end -->
The BNPMIXcluster package provides a method for model-based clustering of multivariate data. It is capable of combining different types of variables (continuous, ordinal and nominal) and accommodates for different sampling probabilities in a complex survey design.
The model is based on a location mixture model with a Poisson-Dirichlet process prior on the location parameters of the associated latent variables.
## Citing
If you use BNPMIXcluster in a scientific publication, we encourage you to add
the following references to the related papers:
```bibtex
@article{Carmona2019,
title = {Model-based approach for household clustering with mixed scale variables},
author = {Carmona, Christian and Nieto-Barajas, Luis and Canale, Antonio},
journal = {Advances in Data Analysis and Classification},
year = {2019}
month = {jun},
volume = {13},
number = {2},
pages = {559--583},
doi = {10.1007/s11634-018-0313-6},
arxivId = {1612.00083},
url = {https:www.doi.org/10.1007/s11634-018-0313-6},
}
```
## Installation
You can install the released version of BNPMIXcluster from [CRAN](https://CRAN.R-project.org/package=BNPMIXcluster) with:
``` r
install.packages("BNPMIXcluster")
```
And the development version from [GitHub](https://github.com/christianu7/BNPMIXcluster) with:
``` r
# install.packages("devtools")
devtools::install_github("christianu7/BNPMIXcluster")
```
## Examples
```{r sim_study_5_1}
require(BNPMIXcluster)
```
### Simulation study 1
In this toy example, we evaluate the clustering allocation of our model using synthetic iid data. Details are discussed in section 5.1 of Carmona et al. (2017).
We sample 3-dimensional latent continuous vectors $z=(z_1, z_2, z_3)$ from a
3-components mixture of Gaussians with equal mixing probabilities and mixture components with means $\mu_1 = (2, 2, 5)$, $\mu_2 = (6, 4, 2)$ and $\mu_3 = (1, 6, 2)$, and covariance matrices
$\Sigma_1 = \text{diag}(1, 1, 1)$, $\Sigma_2 = \text{diag}(0.1, 2, 0.1)$ and $\Sigma_3 = \text{diag}(2, 0.1, 0.1)$.
In the next figure we show the "True" (generative) cluster allocation using different colors. Such grouping is unknown and we would like to discover it using our model.
```{r sim_study_5_1_data, echo=FALSE, out.width="70%", fig.align="center"}
require(scatterplot3d)
cluster_color <- c(rgb(1,0,0,alpha = 0.5),
rgb(0,0,1,alpha = 0.5),
rgb(0,0.5,0,alpha = 0.5))
cluster_color <- cluster_color[Z_latent_ex_5_1$cluster]
cluster_pch <- c(19,15,17)[Z_latent_ex_5_1$cluster]
par(mfrow=c(2,2))
par(mar=c(4,5,2,2))
scatterplot3d::scatterplot3d(x=Z_latent_ex_5_1$Z1,y = Z_latent_ex_5_1$Z2, z=Z_latent_ex_5_1$Z3,
color=cluster_color,pch=cluster_pch,
xlab="Z1",ylab="Z2",zlab="Z3",
main="Simulated data in 3 clusters"
)
par(mar=c(4,5,2,2))
plot(Z_latent_ex_5_1[,c("Z2","Z3")],col=cluster_color,pch=cluster_pch,xlab="Z2",ylab="Z3")
par(mar=c(4,5,2,2))
plot(Z_latent_ex_5_1[,c("Z1","Z3")],col=cluster_color,pch=cluster_pch,xlab="Z1",ylab="Z3")
par(mar=c(4,5,2,2))
plot(Z_latent_ex_5_1[,c("Z1","Z2")],col=cluster_color,pch=cluster_pch,xlab="Z1",ylab="Z2")
```
We test the method by considering different scenarios for the observed data.It is possible to recover the cluster structure even assuming that we can only observe discretized versions of the latent continuous data.
Here we illustrate the most straightforward scenario (I), where we assume that the three continuous variables are directly observable.
```{r sim_study_5_1_scenario}
# Observable data
# Choose scenario: 1, 2, or 3
ex_i <- 1
head( Y_ex_5_1[[ ex_i ]] )
```
We select hyper-parameters that define a prior specification which promotes a small number of groups.
```{r sim_study_5_1_hyperpars}
# Prior specification
# Choose "a", "b" or "c"
param_j <- "c"
meta_param_ex[ param_j, ]
```
The function `MIXclustering` performs MCMC to allocate clusters to each individual.
```{r sim_study_5_1_mcmc}
set.seed(0)
# Specify the data type that is being provided to the method
var_type_Y_ex_5_1 <- list( c("c","c","c"),
c("o","o"),
c("o","o","o","c") )
cluster_ex <- MIXclustering( Y = as.matrix(Y_ex_5_1[[ ex_i ]]),
var_type=var_type_Y_ex_5_1[[ ex_i ]],
n_iter_out=1500,
n_burn=200,
n_thin=3,
alpha = meta_param_ex[ param_j, "alpha" ],
d_0_a = meta_param_ex[ param_j, "d_0_a"],
d_1_a = meta_param_ex[ param_j, "d_1_a" ],
d_0_b = meta_param_ex[ param_j, "d_0_b" ],
d_1_b = meta_param_ex[ param_j, "d_1_b" ],
eta = meta_param_ex[ param_j, "eta" ],
kappa = meta_param_ex[ param_j, "kappa" ],
delta = meta_param_ex[ param_j, "delta" ],
d_0_z = meta_param_ex[ param_j, "d_0_z" ],
d_1_z = meta_param_ex[ param_j, "d_1_z" ],
d_0_mu = meta_param_ex[ param_j, "d_0_mu" ],
d_1_mu = meta_param_ex[ param_j, "d_1_mu" ] )
```
### Summary of clustering results
```{r sim_study_5_1_summary}
summary(cluster_ex)
```
### Visualizing clustering results.
Int the first two plots we show the results across all the MCMC samples. First, a heatmap showing the joint allocation of individuals to clusters. Second, the number of clusters obtained along iterations of
the MCMC.
```{r sim_study_5_1_plot_1, out.width="50%", fig.align="center"}
# Heatmap of group allocation
plot(cluster_ex,type="heatmap")
```
```{r sim_study_5_1_plot_2, out.width="50%", fig.align="center"}
# Number of clusters obtain across the MCMC chain
plot(cluster_ex,type="chain")
```
### Comparison of resulting cluster configurations.
Once the MCMC has converged, each iteration corresponds to a different clustering configuration for each data point. Here we show some configurations of interest.
The jittered scatter plot below compares:
1. X-axis: Labels from configuration that minimizes the distance with average MCMC iterations
2. Y-axis: Labels from configuration that minimizes an Heterogeneity Measure (HM)
The plot confirms that both configurations agree on the allocation of individuals to their corresponding cluster (Notes: jitter added for clarity, label switching of clusters is natural).
```{r sim_study_5_1_plot_3, out.width="50%", fig.align="center"}
plot( x=jitter(cluster_ex$cluster),y=jitter(cluster_ex$clusterHMmin), col="#FF000080", pch=20,
main=paste("Comparison of two relevant cluster configurations"),
xlab="minimizes distance to average MCMC grouping", ylab="minimizes Heterogeneity Measure" )
```
We compare the true (generative) cluster allocation to the configuration suggested by the model:
1. X-axis: Original Labels in the generative model.
2. Y-axis: Labels assigned by the clustering method.
```{r sim_study_5_1_plot_4, out.width="50%", fig.align="center"}
plot(x=jitter(Z_latent_ex_5_1$cluster),
y=jitter(cluster_ex$cluster),
main=paste("Comparison real configuration with the model results"),
xlab="True cluster",
ylab="Model cluster",
pch=19, col="#FF000080")
```