-
Notifications
You must be signed in to change notification settings - Fork 1
/
IS.tex
579 lines (545 loc) · 29.2 KB
/
IS.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
\section{Importance Sampling for Marginal Likelihood Estimation}
\label{sec:IS}
Beyond conjugate models, the integrals defining the marginal
likelihood in Equation (\ref{marginal_lik}) are generally intractable
and must be estimated by numerical methods. Importance sampling (IS)
is a well known Monte Carlo technique for calculating the integrals
that arise in Bayesian inference \citep{Kloe:vanD:1978, Gewe:1989}. In
the case of (\ref{marginal_lik}), the marginal likelihood may be
re-expressed as an expectation with respect to an importance
distribution $Q$ that is abolutely continuous with respect to the
target posterior distribution,
\begin{equation}\label{IS_for_marginal_lik}
p(\v \mid \M_p ) = \mathbb{E}_Q\left[ \frac{p(\v \mid \theta, \M_p
)p(\theta \mid \M_p)} {q(\theta \mid \M_p)}\right]
\end{equation}
where the ``importance function'' $q(\theta \mid \M_p)$ is the
associated probability density function for distribution $Q$. To
simplify notation in the following, we will drop the explicit
dependence on $\M_p$. Ideally it should be straight-forward to
generate a sample of size $N$, $\theta^{(1)}, \ldots, \theta^{(N)}$
from $Q$, leading to an unbiased and simulation consistent Monte Carlo
estimate of the marginal likelihood
\begin{equation}\label{IS_solution_marlik}
\hat{p}(\v)=\frac{1}{N} \sum_{i=1}^N\frac{p(\v \mid \theta^{(i)},
p(\theta^{(i)})} {q(\theta^{(i)})} = \frac{1}{N}\sum_{i=1}^N
w_i = \bar{w}
\end{equation}
where $w_i$ are known as the IS weights.
The variance of the IS estimator is
\begin{equation*}
\mbox{Var}_Q\left[\hat{p}(\v) \right] =\frac{\sigma_q^2}{N}
\end{equation*} where
\begin{equation*}
\label{eq:ISvar}
\sigma_q^2
=\mbox{Var}_q\left\{\frac{p(\v \mid \theta, \M)p(\theta|\M)}{q(\theta)}\right\}
%& =\int_{R^d}\left(\frac{p(\v \mid \theta, \M)p(\theta|\M)}{q(\theta)}-m(\M)\right)^2
%q(\theta)d\theta\\
=m(\M)^2\int \left(\frac{\pi(\theta
\mid \v, \M)-q(\theta)}{q(\theta)}\right)^2 \, q(\theta)\, d\theta
\end{equation*}
depends on how ``close'' the
importance function $q$ is to the posterior distribution $\pi(\theta \mid
\v)$.
For estimation of arbitrary functions,
\citet{Kong:Liu:Wong:1994} define the effective sample size (ESS) as
a measure of the efficiency of the importance sampler
based on the normalized weights
$w^*_i = w_i/ p(\v)$,
\begin{equation*}
\ESS \equiv \frac{N}{1 + \text{var}_q[w^*_i]}.
\end{equation*}
In general, this is impossible to obtain, but they show that
$\text{var}_q[w^*_i]$ is equal to the square of the coefficient of
variation of $w_i$. Substituting the estimated coefficient of variation
into the above expression and simplifying leads to an approximate \ESS
of the form
\begin{equation*}
\ESS \approx \frac{1}{\sum_i \tilde{w}_i^2}
\end{equation*}
where $\tilde{w}_i \equiv w_i/\sum_i w_i$ are the normalized weights.
Ideally, if $q$ is equal to the target, the $\ESS = N$, thus the ratio
$\ESS/N$ provides a simple measure of the efficiency of the
importance function.
If the importance function has lighter tails than the target, the IS
estimate will tend to be unstable because the weights can be
arbitrarily large, leading to a large variance and low \ESS. Heavy
tails can be obtained by using multivariate t density with location
equal to the maximum likelihood estimate and scale matrix based on the
inverse Fisher information, which often works well for unimodal
targets for low to modest dimensional problems. For multi-modal or
posteriors exhibiting strong nonlinear structure, a single
multivariate t distribution is not adequate. As any continuous
probability density function can be approximated by a mixture model,
an importance function $q$ may be constructed using a non-parametric
density estimate using a mixture of parametric densities
\citep{Oh:Berg:1992,Oh:Berg:1993,West:1993}, such as a $t$ with
$\delta$ degrees of freedom
\begin{equation}
\label{eq:mixIS}
q(\cdot) = \sum_k \alpha_j t_\delta(\cdot; \mu_j, \Sigma_j),
\end{equation}
where $\omega_j$ are mixture weights, $\mu_j$ is a location parameter
and $\Sigma_j$ is the scale matrix for the $j$th mixture component.
As samples from $q$ are obtained, adaptive refinement of the IS
proposal $q$ to match the target is possible, leading to adaptive
important sampling (AIS). The number of components in the mixture in
\citeauthor{Oh:Berg:1993} are fixed at the number of modes in the
posterior, which assumes that one can identify the important modes of
the posterior distribution, but the weights, locations and scale
matrices adapted to minimze the coefficient of variation of the
weights, which can be seen as maximizing the \ESS.
\citeauthor{West:1993} allows the number of components to grow with
each new point being the center of a new Student $t$ density, but with
a ``global'' scale matrix estimated from the previous draws. The
number of components is determined by monitoring the relative entropy
of the normalized weights. Even with a mixture model, the importance
function may have much fatter tails than the target where $q$ may be
too over-dispersed relative to the posterior locally, leading to an
overabundance of small weights. To address this,
\citet{Give:Raft:1996} use a mixture of Student $t$ densities, but
utilize individual estimates of the scale matrix in each kernel,
providing ``local'' adaptation. \citet{Owen:Zhou:2000} suggest the use
of control variates to reduce the infinite variance that may arise
when $q$ is a close match to the target, but decreases to zero faster
than the target. \citet{Liu:2001} suggests rejection control as a way
of avoiding small weights.
Population Monte Carlo \citep{cappe2004pmc} may be viewed as an
iterated version of these approaches where a new population of draws
are obtained at each iteration from an importance function, which may
adapt based on the previously sampled populations. This allows the
introduction of MCMC updates as with sequential importance
samplers and particle filters
\citep{MacE:Clyd:Liu:1999,Pitt:Shep:1999} with adaptive IS.
\citet{Douc:Guil:Mari:Robe:2006} derive sufficient convergence
conditions for adaptive mixtures of importance sampling (AMIS)
and show that a Rao-Blackwellized version is
optimal asymptotically in terms of Kullback-Leibler divergence between
the limiting importance function and the target.
Finally, AAIS has connections to a bunch of existing adaptive IS
methods in the literature, for example,
\cite{evans1991chaining,cappe2008ais,,ardia2008adaptive},
to name just a few. The most distinguishing point lies in that our
method adopts the annealing strategy, which is well recognized as a
powerful approach to search separated, especially peaky, modes in the
posterior. From the mixture modeling point of view, the proposed
mixture adaptation technique can be seen as an intermediate between
the completely non-parametric method in \cite{west1993mixture} and the
EM based semi-parametric methods, which is used in the adaptive
mixture importance sampling (AMIS) approach \citep{cappe2008ais}. The
AMIS approach employs a fixed number of mixture components, while our
method and the nonparametric method \citep{west1993mixture} can let
the data speak for themselves how many components should be contained
in the mixture.
\section{Annealed Adaptive IS Method for Constructing Importance
Function} \label{sec:AAIS}
In this section, we propose the adaptive mixture modeling approach,
which is used for constructing importance function that resembles
the posterior and then is used in (\ref{IS_solution_marlik}) for
calculating the marginal likelihood.
Let $\pi(\theta)$ denote the posterior distribution, which is
proportional to $L(y|\theta)p(\theta)$. Note that computing
$L(y|\theta)p(\theta)$ for any $\theta$ is feasible, but that we are
not able to directly sample from the distribution it defines, i.e.
$\pi(\theta)$. The aim is to find a distribution, $q(\theta)$, that
approximates $\pi(\theta)$, which we are also able to evaluate.
We first select an initial proposal distribution $q_0(\theta)$, then
define a series of intermediate distributions between $q_0(\theta)$
and $\pi(\theta)$, by
\begin{equation}
\pi_t(\theta)\propto
q_0(\theta)^{1-\lambda_t}\pi(\theta)^{\lambda_t}, \quad\quad
t=1,\ldots,T, \quad \mbox{and}\quad
0<\lambda_1<\lambda_2<\dots<\lambda_T=1.
\end{equation}
Finally we propose an iterative method to construct $q(\theta)$, as
shown in Algorithm 1.
\begin{center}
\textbf{Algorithm 1. Sequential Proposal Construction}
\end{center}
\begin{itemize}
\item Use $\pi_1(\theta)$ as the target distribution, with $q_0(\theta)$
as the proposal. Tune parameters of $q_0(\theta)$ to obtain an
updated mixture density function, denoted $q_1(\theta)$, which
approximates $\pi_1(\theta)$.
\item Similarly as in step 1, tune parameters of $q_1(\theta)$ to
obtain an updated proposal,$q_2(\theta)$, which is used to
approximate $\pi_2(\theta)$.
\item iterates the above process, i.e., tuning parameters of $q_t(\theta)$ and then getting the updated mixture density function $q_{t+1}(\theta)$, from $t=2$ to $t=T-2$.
\item Use $\pi_T(\theta)$ as the target distribution, tune parameters of
$q_{T-1}(\theta)$ to obtain $q_{T}(\theta)$, which approximates
$\pi_T(\theta)=\pi(\theta)$. Then $q_{T}(\theta)$ is actually the
importance function which we want.
\end{itemize}
We see that the annealing strategy is adopted in the above iterative
procedure. In what follows, we propose an adaptive method to handle
the task of updating $q_{t-1}(\theta)$ to $q_{t}(\theta)$ in
Algorithm 1.
\subsection{Multivariate Student's T-mixture Model}
Considering possible complexities in the structure of
$\pi_t(\theta)$, we employ a mixture model to define
$q_{t}(\theta)$. In particular, we equip
$q_{t}(\theta)$ with the Student's T distribution as the mixture
component, making it easier to detect modes that are far apart
thanks to its fat tail structure. Let $\mathcal{S}_d(\mu,\Sigma,v)$
denote a $d$-variate Student's T distribution, where $\mu$ is the
center, $\Sigma$ the positive definite inner product matrix, and $v$
is the degrees of freedom. We just fix $v$ to be a constant, e.g., 5
in the following. So the mixture model we use is:
\begin{equation}\label{Definition_mixture}
q(\theta|\psi)=\sum\limits_{m=1}^M \alpha_{m}
\mathcal{S}_d(\theta|\mu_m,\Sigma_m)
\end{equation}
where
$\psi=\{M,\alpha_1,\ldots,\alpha_M,\mu_1,\ldots,\mu_M,\Sigma_1,\ldots,\Sigma_M\}$
is the mixture parameter. Here $\alpha_m$ is the probability mass of
the $m$th component. It satisfies that $\alpha_{m}\geq
0,\sum_{m=1}^M \alpha_{m}=1$. So the task of fitting a T-mixture
model to a probability density translates into seeking suitable
parameter values for $\psi$.
\subsection {Proposal Adaptation via IS plus EM}\label{sec:IS_EM}
Let's focus on iteration $t$ in Algorithm 1. We first simulate an
random sample, $\theta^n$, from $q_{t-1}(\theta)$, with the
importance weight calculated by
\begin{equation}
w^n=\frac{\frac{\pi_{t}(\theta^n)}{q_{t-1}(\theta^n)}}{\sum_{n=1}^N\frac{\pi_{t}(\theta^n)}{q_{t-1}(\theta^n)}},
\end{equation}
where $N$ denotes the sample size. The resulting weighted particle
set then can be seen as a particle approximation to the current
target distribution $\pi_t(\theta)$, i.e.
\begin{equation}\label{delta_mass_approx}
\pi_t(\theta)\simeq\sum\limits_{n=1}^Nw^i\delta(\theta^i),
\end{equation}
where $\delta(\cdot)$ is the delta mass function.
Here we adopt the Kullback-Leibler (KL) divergence to measure the
distance between any two distributions. Then the aim is to obtain a
proposal, $q_t(\theta)$, that minimizes the KL distance with respect
to $\pi_t(\theta)$, which is shown to be
\begin{equation}\label{KL}
\mathcal{D}[\pi_t(\theta)||q_t(\theta|\psi)]=\mathbb{E}_{\pi_t(\theta)}\left[\log\frac{\pi_t(\theta)}{\sum\limits_{m=1}^M
\alpha_m\mathcal{S}_d(\theta|\mu_m,\Sigma_m)}\right].
\end{equation}
where $\mathbb{E}_f$ denotes expectation with respect to a function
$f$. It's clear that minimizing (\ref{KL}) in $(\alpha,\mu,\Sigma)$
is equivalent to maximizing
\begin{equation}\label{MLE}
\int\pi_t(\theta)\log\left(\sum\limits_{m=1}^M
\alpha_m\mathcal{S}_d(\theta|\mu_m,\Sigma_m)\right).
\end{equation}
Substituting Equation (\ref{delta_mass_approx}) in the above, we
have
\begin{equation}\label{Max_log_lik}
\sum_{i=1}^N w^i\log\left(\sum\limits_{m=1}^M
\alpha_m\mathcal{S}_d(\theta^i|\mu_m,\Sigma_m)\right).
\end{equation}
At this stage, the task translates into a mixture-density parameter
estimation problem, that is, how to optimize the parameter values of
$\psi$ to make the resulting mixture density function fit the data
$\{\theta^n,w^n\}_{n=1}^N$ as well as possible.
Given data component $\theta^n$, the posterior probability of each
mixture component can be obtained according to the Bayes rule:
\begin{equation}\label{posterior_prob_component}
\rho_m(\theta^n)=\alpha_m
\mathcal{S}_d(\theta^n;\mu_m,\Sigma_m)/q(\theta^n),
\end{equation}
where $\alpha_m$ is regarded as the prior, and
$\mathcal{S}_d(\theta^n;\mu_m,\Sigma_m)$ is the associated
likelihood. The proportion mass of this component can then be
updated by
\begin{equation}\label{MLE_alpha}
\alpha_m=\sum\limits_{n=1}^N w^n \rho_m(\theta^n).
\end{equation}
And each pair of $\{\mu_m,\Sigma_m\}$ can be updated by the
following Equations:
\begin{equation}
\mu_m=\frac{\sum\limits_{n=1}^N w^n
\rho_m(\theta^n)u_m(\theta^n)\theta^n}{\sum\limits_{n=1}^N w^n
\rho_m(\theta^n)u_m(\theta^n)},
\end{equation}
\begin{equation}\label{MLE_Sigma}
\Sigma_m=\frac{\sum\limits_{n=1}^N w^n
\rho_m(\theta^n)u_m(\theta^n)C_n }{\sum\limits_{n=1}^N w^n
\rho_m(\theta^n)},
\end{equation}
where
\begin{equation}
u_m(\theta)=\frac{v+d}{v+(\theta-\mu_m)^T(\Sigma_m)^{-1}(\theta-\mu_m)}
\end{equation}
and $C_n=(\theta^n-\mu_m)(\theta^n-\mu_m)'$, where $'$ denotes
matrix transposition.
The operations from (\ref{posterior_prob_component}) to
(\ref{MLE_Sigma}) constitute one iteration of the EM algorithm,
which is used to do mixture density parameter estimation
\citep{peel2000rmm}. Let $q_t(\theta)$ be a T-mixture model, with
the updated parameter values $\{\alpha_m,\mu_m,\Sigma_m\}$, and it's
expected to be an approximation to $\pi_t(\theta)$. The efficiency
of $q_t(\theta)$, as an approximation of $\pi_t(\theta)$, can be
measured by the KL divergence using (\ref{Max_log_lik}), or the
effective sample size, which is defined to be
\begin{equation}\label{ESS}
\ESS=\frac{1}{\sum_{n=1}^N w_n^2}.
\end{equation}
Note that these importance weights are associated with the random
sample drawn from $q_t(\theta)$. The \ESS\ is related to the
coefficient of variation (CV) of the importance weights (see
\cite{kong1994sequential}), which finds applications in
\cite{ardia2008adaptive,cappe2008ais,cornebise2008adaptive}. It's
shown that \ESS is determined by the associated proposal and the
particle size, $N$, while, once $N$ is big enough, the ratio of \ESS
to $N$ should be stable. So in this paper, we use \ESS$/N$ as a
quantity to monitor the efficiency of $q_t(\theta)$. When \ESS$/N$
is big enough, let Algorithm 1 go to next iteration. If \ESS$/N$ is
smaller than a given threshold, we continue to update $q_t(\theta)$
until it satisfies the requirement, i.e. the resulting \ESS$/N$ is
big enough. In this way, we can make sure that the finally obtained
$q_t(\theta)$ is really a good approximation to $\pi_t(\theta)$.
Suppose that currently \ESS$/N$ is not big enough, we consider two
approaches that are used to update $q_t(\theta)$ again. First we
check whether the particle with the biggest importance weight,
denoted $\theta^{W}, W\in\{1,\ldots,N\}$, is located in the tail of
the proposal. If it is, which means $q_t(\theta^{W})$ is smaller
than the proposal densities of most other particles, we then select
to perform an operation called component-splitting (see Section
\ref{sec:split}), otherwise, we perform the IS plus EM procedure
again. The reason for doing this is that the IS plus EM procedure is
exactly a local search approach, because it fixes the number of
mixture components to be a constant and the EM itself is a local
optimization method, while the component-splitting procedure will
break the local model structure by splitting one mixture component
into two, thus facilitates to find the global or, at least a better
local optimum (see Section \ref{sec:split}).
\subsection {Online Approach to Tune the Number of the Mixture Components}
In the above IS plus EM procedure, the number of components $M$ is
imposed to be constant, while this assumption is of course not
reasonable. To deal with this issue, we propose a series of adaptive
mechanisms to adapt $M$ online, which include operations called
components-merging, components-splitting and components-deletion,
respectively. We'll provide respective conditions, on which each of
these procedures needs to be performed.
\subsubsection{Components Merging}
This step targets for merging mixture components that have much
mutual information among each other. To begin with, we first
introduce the concept, mutual information. Suppose that a set of
equally weighted data components $\{\theta^n\}_{n=1}^N$ are
available, whose distribution is modeled by a mixture density
function $q=\sum_{m=1}^M\alpha_mf_m$. Given any data component
$\theta^n$, the posterior probability of each mixture component
$f_m$ can be calculated by
\begin{equation}
\rho_m(\theta^n)=\alpha_m f_m(\theta^n)/q(\theta^n).
\end{equation}
If there are two mixture components $f_i$, $f_j$, for which it
satisfies that $\rho_i(\theta^n)=\rho_j(\theta^n)$ for any
$n\in\{1,\ldots,N\}$, then $f_i$ and $f_j$ are regarded to contain
completely overlapped information. Inspired by the components
merging criterion proposed in \cite{ueda2000sam,wang2004estimation},
we define the mutual information between $f_i$ and $f_j$ to be:
\begin{equation}\label{mutual_info}
\MI(i,j)=\frac{(\Upsilon_i-\bar{\Upsilon}_i)^T(\Upsilon_j-\bar{\Upsilon}_j)}{\|\Upsilon_i-\bar{\Upsilon}_i\|\cdot\|\Upsilon_j-\bar{\Upsilon}_j\|},
\end{equation}
where
$\Upsilon_m=[\rho_m(\theta^1),\ldots,\rho_m(\theta^N)]'$, $'$
denotes transposition of a matrix, $\|\cdot\|$ denotes the Euclidean
norm, and $\bar{\Upsilon}_m=\frac{1}{N}\sum_{n=1}^N
\rho_m(\theta^n)\textbf{1}_N$. Here $\textbf{1}_n$ denotes the
$n-$dimensional column vector with all elements being $1$.
Note that $\MI(i,j)\in[-1,1]$, and $\MI(i,j)=1$ iff $f_i$ and $f_j$
contain completely overlapped information.
In our algorithm framework, each data point $\theta^n$ is weighted
by $w^n$. Accordingly, $\Upsilon_m$ should be $\sum_{n=1}^N w^n
\rho_m(\theta^n)$, and the mutual information between $i$, $j$ turns
out to be:
\begin{equation}\label{Merge_criterion_correlation}
\MI(i,j)=\frac{(\Upsilon_i-\bar{\Upsilon}_i)'
D(w)(\Upsilon_j-\bar{\Upsilon}_j)}{\sqrt{\sum_{n=1}^N
w^n(\Upsilon_i(\theta^{n})-\bar{\Upsilon}_i)^2}\sqrt{\sum_{n=1}^N
w^n(\Upsilon_j(\theta^{n})-\bar{\Upsilon}_j)^2}}
\end{equation}
where
\begin{equation}
D(w)=diag([w^1,\ldots,w^N])
\end{equation}
We judge if two components, $i$ and $j$, need to be merged into one
by comparing $\MI(i,j)$ with a threshold $0<T_{merge}<1$. If
$\MI(j,k)>T_{merge}$, we then merge the $i$th component into $j$,
and set $M=M-1$; otherwise, we maintain the original mixture. The
merging operation is specified to be:
\begin{eqnarray}\label{Linear_combination_components}
\alpha_{j}&=&\alpha_{i}+\alpha_{j}\\
\mu_{j}&=&\frac{\alpha_{i}\mu_{i}+\alpha_{j}\mu_{j}}{\alpha_{j}}\\
\Sigma_{j}&=&\frac{\Sigma_{i}\mu_{i}+\Sigma_{j}\mu_{j}}{\alpha_{j}}
\end{eqnarray}
which is a linear combination of the two old components, and has
been used in \cite{wang2004estimation}. We traverse each pair of the
original components to do the merging judgements and perform the
above merging operation if the merging condition satisfies.
We execute the above merging operation when superfluous components
may exist, e.g., when $q_0(\theta)$ is initialized with too many
overlapped components, or too many new components appear at one
iteration of Algorithm 1 via component-splitting (see Section
\ref{sec:split}).
\subsubsection{Component-splitting}\label{sec:split}
As mentioned in Section \ref{sec:IS_EM}, it may happen that, after
performing the IS plus EM procedure, \ESS$/N$ was not big enough,
and the maximum weight particle was located in the tail of
$q_t(\theta)$. In that case, we perform the mechanism proposed here,
called component-splitting. This procedure is used to break the
local structure of $q_t(\theta)$ by splitting one component into
two, whereby the component number, $M$, will increase to $M+1$.
This Component-splitting approach is performed as follows. First,
when we simulate draws from $q_t(\theta)$ for evaluating the
efficiency of $q_t(\theta)$, we record the component origination of
each draw. We then see the parent component of the maximum weight
sample, $\theta^W$, denoted
$f_p=\mathcal{S}_d(\theta|\mu_p,\Sigma_p)$, and know that which
samples were originated from $f_p$. We encapsulate the samples that
come from $f_p$ to a data array,
$\mathcal{\theta}_{local}=\{\theta^{1},\ldots,\theta^{pN}\}$, then
split $f_p$ into two components$-$
$f_{c1}=\mathcal{S}_d(\theta|\mu_{c1},\Sigma_{c1})$ and
$f_{c2}=\mathcal{S}_d(\theta|\mu_{c2},\Sigma_{c2})$, and finally
replace $f_p$ in $q_t(\theta)$ by $f_{c1}$ and $f_{c2}$.
Next we give operations to get
$\mu_{c1},\Sigma_{c1},\mu_{c2},\Sigma_{c2}$. We first initialize
$\mu_{c1}=\theta^W$, $\mu_{c2}=\mu_p$,
$\Sigma_{c1}=\Sigma_{c2}=\Sigma_p$, and then specify a local mixture
$q_{local}=\alpha_{c1}f_{c1}+\alpha_{c2}f_{c2}$, where
$\alpha_{c1}=\alpha_{c1}=0.5$, finally we perform a local IS plus EM
procedure to update
$\alpha_{c1},\alpha_{c2},\mu_{c1},\Sigma_{c1},\mu_{c2},\Sigma_{c2}$.
Such local IS plus EM procedure starts by evaluating the local
weight for each $\theta^{i}$ in $\mathcal{\theta}_{local}$ via
\begin{equation}
w^i=\frac{\frac{\pi_{t}(\theta^{i})}{q_{local}(\theta^{i})}}{\sum_{n=1}^{pN}\frac{\pi_{t}(\theta^n)}{q_{local}(\theta^n)}}.
\end{equation}
Substituting $\{\theta^i,w^i\}_{i=1}^{pN}$ and the local mixture
parameters,
$\alpha_{c1},\alpha_{c2},\mu_{c1},\Sigma_{c1},\mu_{c2},\Sigma_{c2}$
in the Equations from (\ref{MLE_alpha}) to (\ref{MLE_Sigma}), we
then can obtain the updated parameter values for the local mixture
$q_{local}$. Suppose that $f_p$ is assigned a proportion mass
$\alpha_p$ in $q_t(\theta)$, we assign $f_{c1}$ and $f_{c2}$
proportion masses $\alpha_{c1}=\alpha_p\alpha_{c1}$ and
$\alpha_p\alpha_{c2}$, respectively, for the updated $q_t(\theta)$,
in which $f_p$ is replaced by $f_{c1}$ and $f_{c2}$.
This component-splitting method makes use of the local behavior of
the target distribution around the maximum weight sample. It adds
mixture proposal probability mass to those areas of the parameter
space which is near the maximum weight sample, where there is
relatively little mixture proposal probability mass. As the local IS
plus EM procedure is actually a local mixture learning process, by
which the local structure around the maximum weight sample will be
modeled well by the updated local mixture, which then will elicit a
better global mixture representation of the whole target
distribution, which may lead to a satisfactory \ESS$/N$ value. Note
that the above component-splitting procedure may be repeated
provided that the condition of executing it satisfies.
\subsubsection{Component-Deletion} \label{sec:deletion}
As mentioned above, when simulating draws from $q_t(\theta)$, we can
record the component origination of each sample. For any mixture
component that generates no sample, we deem it has taken no effect
in modeling the current target distribution, thus we just delete it
in current mixture proposal. Its probability masses is then
redistributed strengthening the remaining components.
\subsection{Practical Issues to be Considered}
EM based mixture modeling approach provides a convenient and
flexible utility for estimating or approximating distributions,
while the mixture log-likelihood presents a well known problem,
i.e., degeneracy toward infinity \citep{biernacki2003degeneracy}
(actually EM also suffers from the problem of slow convergence, so
it may converge to a local optimum \citep{biernacki2003degeneracy},
however, we don't need to worry about that, because on one side the
proposed component-splitting procedure facilitates to search the
global optimum, and, one the other side, the objective here is just
to find a good mixture importance function, which is not necessary
to be the global optimum). To prevent the mixture log-likelihood
from arising to infinity, we assign an inverse-wishart prior on the
covariance matrix of each mixture component, then perform EM to get
the \emph{maximum a posterior} (instead of maximum likelihood)
estimate of the covariance matrix.
In the component-splitting step, the parameter updating of $f_{c1}$
and $f_{c2}$ is based upon $pN$ particles originated from $f_p$. In
case of $pN$ being too small, e.g. smaller than a threshold
specified beforehand, denoted $N_s$, we need to generate another set
of $N_s-pN$ samples from $f_p$, then do the parameter updating for
$f_{local}$ based upon the $N_s$ particles. By doing this, we can
guarantee that we are using a sufficient number of random draws to
capture the local structure in the posterior, that will then be
represented by $f_{local}$.
In the above mentioned component-splitting step, when we replace
$f_p$ in $q_t(\theta)$ by $f_{c1}$ and $f_{c2}$, we assign the
proportion mass of $f_p$ to $f_{c1}$ and $f_{c2}$ proportionally to
their original weights, i.e. $\alpha_{c1}$ and $\alpha_{c2}$,
respectively. Note that if $\alpha_p$ is a very small value, e.g.,
$0.0001$, the new added mixture components, $f_{c1}$ and $f_{c2}$,
will only take an insignificant effect in the updated $q_t(\theta)$,
while $f_{c1}$ and $f_{c2}$ may contain new and important
information about the posterior. To guarantee that $f_{c1}$ and
$f_{c2}$ can play a role in $q_t(\theta)$, at lease to some extent,
we propose the following operations. We first specify a threshold
$\alpha_{threshold}$, which may be, e.g., 0.1 or 0.2. Then, if
$\alpha_p>\alpha_{threshold}$, we assign weights to $f_{c1}$ and
$f_{c2}$ as usual as shown in Section \ref{sec:split}. Otherwise, we
assign $\alpha_{threshold}\alpha_{c1}$ and
$\alpha_{threshold}\alpha_{c2}$ to $f_{c1}$ and $f_{c2}$,
respectively. So the sum of weights of $f_{c1}$ and $f_{c2}$ in
$q_t(\theta)$ will be $\alpha_{threshold}$. To make sure the sum of
the weights of the mixture components equals to 1, we reduce the
weights of the components, other than $f_{c1}$ and $f_{c2}$,
proportionally to their original weights.
\subsection{Relationships to Existing Methods}
As the proposed AAIS method utilizes the idea of 'annealing' as a
way of searching isolated modes, it has close relationships to a
series of MCMC methods, which are based on the "thermodynamic
integration" technique for computing marginal likelihoods, and are
discussed under the name of Bridge and Path sampling in
\cite{gelman1998simulating}, parallel tempering MCMC in
\cite{gregory2010bayesian,gregory2005bayesian} and Annealed
Importance Sampling (AIS) in \cite{neal2001ais}. The central idea of
these methods is to divide the ratio of two normalization constants,
which we can think of for our purposes as the ratio of marginal
likelihoods, $\Z_t$ and $\Z_0$, which are associated to
$\pi(\theta)$ and $q_0(\theta)$, respectively, into a series of
easier ones, parameterised by inverse temperature, $\lambda$. In
detail:
\begin{equation}
\frac{\Z_t}{\Z_0}=\frac{\Z_1}{\Z_0}\frac{\Z_2}{\Z_1}\dots\frac{\Z_t}{\Z_{t-1}}.
\end{equation}
Each of the ratios is estimated using samples from a MCMC method.
For example, to compute $\frac{\Z_t}{\Z_{t-1}}$, we should have
samples that are drawn from equilibrium from the distribution
defined by
$\pi_{t-1}(\theta)\propto\pi(\theta)^{\lambda_{t-1}}q_0(\theta)^{1-\lambda_{t-1}}$.
Therefore, these methods require us to devise a single Metropolis
proposal at each temperature, which should be a troublesome issue in
practice, since it is not prior known the target distribution's
structure at each temperature. In addition, the involvement of MCMC
results in annoying questions, e.g. how long should the 'burn-in'
period last, and how many samples are needed to get a reliable
estimate, which should be considered at each temperature.
Striking differently, the proposed AAIS is developed from the
perspective of mixture modeling, that is, how to adapt a mixture
model to resemble the posterior, while the similar temperature idea,
i.e., 'annealing', is also involved target for searching isolated
peaky modes. We only need to select an initial mixture proposal
$q_0$, the annealing temperatures $\lambda_t$, and up to three
thresholds, after which, the mixture adaptation process is
completely adaptive, and finally the marginal likelihood is simply
obtained by substituting the resulting mixture as the importance
function in (\ref{IS_for_marginal_lik}). Such property of
easy-to-implementation results from the adaptive techniques involved
in the algorithm, including the EM based parameter estimation for
each mixture component, and the proposed online approach used for
adjusting the cardinality of the mixture, i.e. the number of mixture
components.
This AAIS algorithm falls within a general algorithmic framework,
called Sequential Monte Carlo Sampler (SMCS), which is proposed in
\cite{del2007sequential,del2006sequential}. However, SMCS utilizes
the same "thermodynamic integration" technique for calculating
marginal likelihoods, so it inherits all drawbacks mentioned above.
Exactly, the proposed AAIS method can be seen as a special case of
SMCS, which adopts adaptive independent mixture proposals as the
transition kernels (see details in
\cite{del2007sequential,del2006sequential}), and the IS, instead of
MCMC, for particle generation.