AIS_MarLik.tex.bak

% Use pdfLatex to compile
\documentclass[12pt]{article}
\usepackage[dvipsnames]{xcolor}
\usepackage[letterpaper,margin=1in]{geometry}
\usepackage{charter}
\usepackage{latexsym,subfig, amssymb, amsmath, amsfonts, graphicx, color,fancyvrb,natbib}
\usepackage{amsthm,wrapfig,url,bm,rotating,multirow}
\usepackage{url,setspace,epsfig}
\def\ci{\perp\!\!\!\perp}
\def\bibfontsize{\small}
\def\authorfmt#1{\textsc{#1}}
\def\dcuand{\&}
\def\MI{\textsf{MI}}
\def\PI{\textsf{PI}}
\def\KL{\textsf{KL}}
\def\ESS{\textsf{ESS }}
\def\Z{\textsf{Z}}
\def\BF{\textsf{BF}}
\renewcommand{\baselinestretch}{1.0}

\makeatletter
%\newcommand{\figcaption}{\def\@captype{figure}\caption}
\newcommand{\tabcaption}{\def\@captype{table}\caption}
\makeatother

\begin{document}
\vspace{-1in}
\title{Adaptive Annealed Importance Sampling Method for Calculating Marginal likelihood with Application to Bayesian Exo-planets Detection}
\author{Bin Liu, Merlise Clyde, Tom Loredo, Jim Berger}
\maketitle

\begin{abstract}
Motivated by an astrophysics problem, namely, extra-solar planet
searching, we propose an adaptive annealed importance sampling
method, which facilitates simulating draws from multi-modal joint
posterior distribution, and provides an effective and
easy-to-implement way for estimating marginal likelihood that is
required for Bayesian model comparison. This method is inspired by
the idea of 'annealing' as a way of searching isolated peaky modes
in the posterior and the fact that any continuous probability
density can be approximated well by a finite mixture model, provided
that it is affiliated with a sufficient number of components along
with correctly estimated parameters. The annealing strategy is
performed within a sequential importance sampling (SIS) framework,
in which a mixture function that consists of a batch of student's T
densities is gradually fit to approximate the posterior. We estimate
the parameters of each mixture component by an
Expectation-Maximization (EM) algorithm. We also propose online
approaches to tune the number of the mixture components, whose
invocation is guided by the effective sample size (ESS) when
performing IS. Both simulation studies and real data analysis
demonstrate the great efficiency of this method.
\end{abstract}

\section{Introduction}
In Bayesian statistics, accurately calculating the marginal
likelihood required in Bayesian model selection can be difficult.
The importance sampling method, for example, often breaks down in
practice because its accuracy depends on designing an importance
function that mimics the integrand, and building such a function can
be quite difficult even in lower dimensional settings.

In this paper, we are concerned with the Importance sampling (IS)
method for calculating marginal likelihoods. In particular, we
propose an adaptive method for constructing an importance function
that resembles the posterior. Section 2 outlines the scientific
issues that inspired this work, namely inferring the number of
planets around distant stars. The exoplanet application uses models
with $2+5p$ parameters where $p$ is the number of planets in the
model, so even a single planet model has a seven dimensional
parameter space to integrate over. Section 3 describes in detail how
importance sampling can be used to calculate marginal likelihoods
for model selection. Section 4 explains the method we developed. Our
method employs the annealing idea to deal with isolated peaky modes
in the posterior and the mixture modeling technique to encapsulate
the information on the posterior obtained so far into a mixture
model. It starts by defining a set of intermediate target
distributions between an easy-to-handle distribution to the
posterior. It then adapts a mixture model to fit these distributions
one by one. The resulting mixture model will finally mimic the
posterior. We detail the relationships between our method and the
existing methods in the literature at the end. Section 5 presents
two simulations, one in three dimensions and the other in seven, in
which the target functions were built to be difficult to integrate.
Section 6 reports real data analysis results and Section 7 discusses
the method and then concludes this paper.

\section {Bayesian Exoplanet Searches Based on RV technique}
In this section, we describe the radial velocity models and priors
used in this paper. We then formulate the problem of exoplanet
searches as a Bayesian model selection problem. See details in
\citep{bullard2009edc,Crooks_2007,ford2006improving}).
\subsection {Radial Velocity Model}
In a two-body system such as a star and planet, the pair rotate
together about a point lying somewhere on the line connecting their
centers of mass. If one of the bodies (the star) radiates light, the
frequency of this light measured by a distant observer will
cyclically vary with a period equal to the orbital period. This
Dopplar Effect is understood well enough that astronomers can
translate between frequency shift and the star's velocity toward or
away from Earth. The data collected by astronomers are
$y=\{t_i,v_i,\sigma_i\}_{i=1}^o$, where $t_i$ is the time of the
$i$th observation, $v_i$ is the observed velocity component at time
$t_i$ and $\sigma_i$ is the estimated error of the velocity
observation. Additionally, there is a source of (presumably)
Gaussian error from "stellar jitter", the random fluctuations in a
star's luminosity. The observed velocities are related to the
orbital parameters through
\begin{equation}\label{Velocity_Model}
v_i\thicksim \mathcal{N}\left(C+\bigtriangleup
V(t_i|\phi),\sigma_i^2+s^2\right),
\end{equation}
where $C$ is the constant center-of-mass velocity of the star
relative to Earth, $\bigtriangleup V(t_i|\phi)$ is a function to be
defined below, and $s^2$ is the 'stellar jitter' variance. The full
set of parameters $\{C,s^2,\phi\}$ will be denoted $\theta$.

Given the data set $y$, we consider a set of competing models
$\mathcal{M}_i$, $i\in\{0,1,\ldots\}$, where $\mathcal{M}_i$ has $i$
planet(s).

If a star does not host any orbiting planets, then the RV
measurements will be roughly constant over any period of time,
varying only due to the stellar jitter. Hence the function
$\bigtriangleup V(t_i|\phi)$ is zero for any value of $t$:
\begin{equation}\label{Velocity_0p_Model}
\bigtriangleup V(t|\phi)=0, \forall t\in \mathbb{R}.
\end{equation}
In this case, $\theta$ only has two elements: $C$ and $s^2$, and
Equation (\ref{Velocity_Model}) becomes
\begin{equation}\label{0p_Model}
v_i\thicksim \mathcal{N}\left(C,\sigma_i^2+s^2\right).
\end{equation}

Because of the orbital eccentricity and the fact that we do not
normally observe planetary systems side-on, the velocity shift vs.
time plot will not but a sinusoid, but will be a member of a more
complicated family of curves parameterized by
$\phi=(K,P,e,\omega,\mu_0)$, where $K\geq 0$ is the velocity
semi-amplitude, $P>0$ is the orbital period, $0\leq e\leq1$ denotes
eccentricity, $0\leq\omega\leq2\pi$ is argument of periastron and
$0\leq \mu_0\leq2\pi$ denotes the mean anomaly at time $t=0$. Given
these parameters, the velocity shift, due to the single planet, is
then just
\begin{equation}\label{Velocity_1p_Model}
\bigtriangleup V(t|\phi)=K[\cos(\omega+T(t))+e\cos(\omega)].
\end{equation}
Here $T(t)$ is the "true anomaly at time $t$" and is itself given by
\begin{equation}\label{true_anomaly}
T(t)=2\arctan\left[\tan(\frac{E(t)}{2})\sqrt{\frac{1+e}{1-e}}\right].
\end{equation}
where $E(t)$ is called the "eccentric anomaly at time $t$"; it's the
solution to the transcendental equation
\begin{equation}\label{transcendental_equation}
E(t)-e\sin(E(t))=\mbox{mod}\left(\frac{2\pi}{P}t+\mu_0,2\pi\right).
\end{equation}

If there are $j\geq2$ planets, the overall velocity shift
$\bigtriangleup V$ is simply the sum of the independent velocity
shifts of the individual planets:
\begin{equation}\label{Velocity_1p_Model}
\bigtriangleup V(t|\phi_1,\ldots,\phi_j)=\sum_{i=1}^j
K_i[\cos(\omega_i+T_i(t))+e_i\cos(\omega_i)].
\end{equation}
Note that these planets' mutual gravitational interactions are
negligible here.

Of course we do not know how many planets there are \emph{a priori}
- indeed, finding the number is a major aim. If we assume there are
at most $J$ planets, then there are a total $2+5j$ parameters,
$\theta_j=\{\mathcal{C},s^2,\phi_1,\ldots,\phi_j\}$ for each
possible number of planets $j\in\{1,\dots,J\}$.

\subsection{Priors}
We adopt the priors recommended by \cite{ford2006bms}, which is
based in part on mathematical convenience, and in part on their
approximate realism.

For $\mathcal{M}_0$ in (\ref {0p_Model}), the prior is set such that
the parameters are independent, and their marginal distributions are
given by:
\begin{equation}\label{prior_C}
p_C(C)=\left\{\begin{aligned}
\frac{1}{C_{max}-C_{min}} &~ \quad\quad\mbox{for} &~C_{min}\leq C\leq C_{max}\\
0\quad\quad &~ \quad\quad\mbox{otherwise} & \\
\end{aligned}\right.
\end{equation}
and
\begin{equation}\label{prior_sigma}
p_{s}(s)=\left\{\begin{aligned}
\frac{1}{\log\left(1+\frac{s_{max}}{s_0}\right)}\cdot\frac{1}{s_0+s} &~ \quad\quad\mbox{for} &~0<s\leq s_{max}\\
0\quad\quad &~ \quad\quad\mbox{otherwise} & \\
\end{aligned}\right.
\end{equation}
Finally the complete prior is given by:
\begin{equation}\label{prior_0p}
p_0(\theta_0)=p_0(C,s)=\frac{1}{C_{max}-C_{min}}\cdot\frac{1}{\log\left(1+\frac{s_{max}}{s_0}\right)}\cdot\frac{1}{s_0+s},
\end{equation}
where $C_{min}\leq C\leq C_{max}, 0<s \leq s_{max}.$

Before introducing prior setting for the parameter of
$\mathcal{M}_1$, we present the parameter transformations
recommended by \citep {bullard2009edc}:
\begin{itemize}
%\item Translate the data $t_i$ (essentially reparameterizing $\mu_0$)
%so that $t=0$ occurs in the middle of the observations$-$here we
%place it at the weighted mean of the observation times, with the
%weights being inversely proportional to the measurement errors. This
%reduces correlations between $\mu_0$ and $P$.
\item Use Poincar$\acute{\mbox{e}}$ variables $x\equiv e\cos\omega$ and
$y\equiv e\sin\omega$ instead of $e$ and $\omega$ to reduce (very
strong) correlations between $\mu_0$ and $\omega$. This is
particularly important for low eccentricity orbits.
\item Use $z\equiv (\omega+\mu_0) \mbox{mod}2\pi$ instead of $\mu_0$ to further reduce correlations between these two
parameters when $e\ll1$; for larger $e$, it still doesn't hurt.
\end{itemize}
These transformations aim to reduce the correlations between the
elements in the parameter space defined by
$(C,K,P,e,\omega,\mu_0,s^2)$, and make the posterior as Gaussian as
possible \citep {bullard2009edc}. It's also recommended to work in
$\dot{P}\equiv \log P$ rather than in $P$, in $\dot{K}\equiv \log K$
rather than in $K$, and in $\dot{s}\equiv \log s$ rather than in $s$
\citep{bullard2009edc}, to make the posterior be more Gaussian. In
that case, the model parameter becomes
$\theta=\{C,\dot{K},\dot{P},x,y,z,\dot{s}\}$. We adopt the above
transformed parameters, but note that the efficiency of the proposed
method will not require the posterior to be Gaussian or depend on
any parameter transformation.

In the transformed parameter space, the prior is set as follows:
\begin{equation}\label{prior_1p}
p_1(\theta_1)=\kappa_1\cdot\exp\dot{K}\cdot\frac{1}{1+\frac{\exp\dot{K}}{K_0}}\cdot\frac{1}{\sqrt{x^2+y^2}}\cdot\exp\dot{s}\cdot\frac{1}{1+\frac{\exp\dot{s}}{s_0}},
\end{equation}
where
\begin{equation}\label{kappa_in_prior_1p}
\kappa_1=\frac{1}{C_{max}-C_{min}}\cdot\frac{1}{\log\left(1+\frac{K_{max}}{K_0}\right)}\cdot\frac{1}{K_0}\cdot\frac{1}{\log\left(\frac{P_{max}}{P_{min}}\right)}\cdot\left(\frac{1}{2\pi}\right)^2\cdot\frac{1}{\log\left(1+\frac{s_{max}}{s_0}\right)}\cdot\frac{1}{s_0},
\end{equation}
and $C_{min}\leq C\leq C_{max}$, $\dot{K}\leq\log(K_{max})$,
$\log(P_{min})\leq\dot{P}\leq\log(P_{max})$, $x^2+y^2<1$, $0\leq
z\leq2\pi$, and $\dot{s}\leq\log(s_{max})$.

The involved constants are set based upon the physical realities
(e.g., an orbit with too small a period will result in the planet
getting consumed by the star) \citep{bullard2009edc}, as follows,
\begin{eqnarray}
P_{min} & \equiv & 1\mbox{day}\\
P_{max} & \equiv & 1000\mbox{years} \\
K_0 & \equiv & 1\mbox{ms}^{-1}\\
K_{max} & \equiv & 2128 \mbox{ms}^{-1} \\
C_{min} & \equiv & -K_{max}\\
C_{max} & \equiv & K_{max}\\
s_0 &  \equiv & 1 \mbox{ms}^{-1}\\
s_{max} & \equiv & K_{max}
\end{eqnarray}
\subsection{Using Bayesian Method to Count Planets}
From an astronomer's point of view, the most important question is
how to schedule telescope time in order to maximize the probability
of doing something useful such as determining the total number of
planets in the system or detecting at least one planet. Let us
examine the first of these aims. Finding the number of planets in a
system is, from a statistics point of view, a model selection
problem. Bayesian model selection requires calculation of marginal
likelihoods of two or more models in order to calculate Bayes
Factors,
\begin{equation}
BF(\mathcal{M}_i,\mathcal{M}_j)=\frac{m(y|\mathcal{M}_i)}{m(y|\mathcal{M}_j)}.
\end{equation}
Here $m(y|\mathcal{M}_i)$ is the marginal likelihood of the model
with $i$ planets, which is the likelihood integrated over the prior:
\begin{equation}\label{marginal_lik}
m(y|\mathcal{M}_i=\int
L(y|\theta_i,\mathcal{M}_i)p(\theta_i|\mathcal{M}_i)d\theta_i.
\end{equation}

\section{Importance Sampling for Marginal Likelihood Estimation}
In many situations the integral in Equation (\ref{marginal_lik}) is
intractable. However, it can be calculated approximately using
importance sampling (IS). IS involves designing an easy-to-use
probability density called the importance function, $q(\theta)$.

Let $\Z=p(y|\mathcal{M})$, we have
\begin{equation}\label{IS_for_marginal_lik}
\Z=\mathbb{E}_q\left\{\frac{L(\theta|y)p(\theta|\mathcal{M})}{q(\theta)}\right\}
\end{equation}
Then, given a random sample from $q$, $Z$ can be unbiasedly and
consistently estimated by
\begin{equation}\label{IS_solution_marlik}
\hat{\Z}=N^{-1}\sum_{i=1}^N\frac{L(\theta^i|y)p(\theta^i|\mathcal{M})}{q(\theta^i)}=N^{-1}\sum_{i=1}^N
w^i, \mbox{where}\quad \theta^i\sim Q, i=1,\ldots,N
\end{equation}
Here we just let
$w^i=\frac{L(\theta^i|y)p(\theta^i|\mathcal{M})}{q(\theta^i)}$.
Therefore, for large values of $N$, $\hat{\Z}$ can be expected to be
a good approximation of $\Z$. The distribution of the error
$|\hat{\Z}-\Z|$ can be approximated via the central limit theorem
\begin{equation}
\mbox{Var}_q\left\{\hat{\Z}\right\}=\mathbb{E}_q\left\{\left(\hat{\Z}-\Z\right)^2\right\}=\frac{\sigma_q^2}{N}<\infty,
\end{equation}
when $\hat{\Z}$ has finite variance. Here
\begin{equation}
\sigma_q^2=\mbox{Var}_q\left\{\frac{L(\theta|y)p(\theta|\mathcal{M})}{q(\theta)}\right\}=\int_{R^d}\left(\frac{L(\theta|y)p(\theta|\mathcal{M})}{q(\theta)}-\Z\right)^2
q(\theta)d\theta=\Z^2\mathbb{E}_q\left\{\left(\frac{\pi(\theta)-q(\theta)}{q(\theta)}\right)^2\right\}.
\end{equation}
If the importance density is such that $\sigma_q^2<\infty$, then via
the central limit theorem
\begin{equation}
\frac{\hat{\Z}-\Z}{\sigma_q/\sqrt{N}}\rightarrow X
\end{equation}
as $N\rightarrow\infty$, where $X\sim \mathcal{N}(0,1)$. The above
results then implies that
$\hat{Z}-Z=\mathcal{O}(\frac{1}{\sqrt{N}})$.

One can construct confidence intervals to assess the quality of a
particular estimate of $\Z$ \citep{lefebvre2009path}. With a large
degree of certainty, the interval $\hat{\Z}\pm3\sigma_q/\sqrt{N}$
contains $\Z$ when $N$ is large and so the half length
$3\sigma_q/\sqrt{N}$ indicates how accurate $\hat{\Z}$ is. Since
$\sigma_q$ is unknown, a natural estimator of $\sigma_q^2$ is the
usual consistent variance estimator
\begin{equation}
\sigma_q^2\simeq\frac{1}{N}\sum\limits_{i=1}^N\left(w^i-\hat{\Z}\right)^2.
\end{equation}

The efficiency of importance sampling depends largely on how closely
the importance function mimics the shape of the target,
$L(y|\theta)p(\theta)$. A number of things can go wrong. If the
importance function has thinner tails than the target, the estimate
will tend to be unstable because the weights can be arbitrarily
large; on the other hand if the importance function has much fatter
tails than the target, too many of the $\theta_i$'s will be drawn
from the tails, resulting in a seemingly stable but biased estimate.
The best case is an importance function with slightly heavier tails
than the target. Of course asymptotically it should not matter which
importance function is used; for samples of practical size, though,
a poor choice can be disastrous.

These problems are only compounded in high-dimensional settings
where the target is often extremely complicated. Unless one is very
lucky indeed it is usually impossible to get accurate results in
real-life situations with dimension greater than 3.

\section{Annealed Adaptive IS Method for Constructing Importance Function}
In this section, we propose the adaptive mixture modeling approach,
which is used for constructing importance function that resembles
the posterior and then is used in (\ref{IS_solution_marlik}) for
calculating the marginal likelihood.

Let $\pi(\theta)$ denote the posterior distribution, which is
proportional to $L(y|\theta)p(\theta)$. Note that computing
$L(y|\theta)p(\theta)$ for any $\theta$ is feasible, but that we are
not able to directly sample from the distribution it defines, i.e.
$\pi(\theta)$. The aim is to find a distribution, $q(\theta)$, that
approximates $\pi(\theta)$, which we are also able to evaluate.

We first select an initial proposal distribution $q_0(\theta)$, then
define a series of intermediate distributions between $q_0(\theta)$
and $\pi(\theta)$, by
\begin{equation}
\pi_t(\theta)\propto
q_0(\theta)^{1-\lambda_t}\pi(\theta)^{\lambda_t}, \quad\quad
t=1,\ldots,T, \quad \mbox{and}\quad
0<\lambda_1<\lambda_2<\dots<\lambda_T=1.
\end{equation}
Finally we propose an iterative method to construct $q(\theta)$, as
shown in Algorithm 1.
\begin{center}
\textbf{Algorithm 1. Iterative Proposal Construction}
\end{center}
\begin{itemize}
\item Use $\pi_1(\theta)$ as the target distribution, with $q_0(\theta)$
as the proposal. Tune the parameters of $q_0(\theta)$ to obtain an
updated probability, denoted $q_1(\theta)$, that approximates
$\pi_1(\theta)$.
\item Similarly as in step 1, tune parameters of $q_1(\theta)$ to
obtain an updated proposal,$q_2(\theta)$, which is used to
approximate $\pi_2(\theta)$.
\item $\dots$.
\item Use $\pi_T(\theta)$ as the target distribution, tune parameters of
$q_{T-1}(\theta)$ to obtain $q_{T}(\theta)$, that approximates
$\pi_T(\theta)=\pi(\theta)$. Then $q_{T}(\theta)$ is actually the
importance function which we want.
\end{itemize}

We see that the annealing strategy is adopted in the above iterative
procedure. In what follows, we propose an adaptive method to handle
the task of updating $q_{t-1}(\theta)$ to $q_{t}(\theta)$ in
Algorithm 1.

\subsection{Multivariate Student's T-mixture Model}
Considering possible complexities in the structure of
$\pi_t(\theta)$, we employ a mixture model to define
$q_{t}(\theta)$. Note that any continuous probability density
function can be approximated by a finite mixture model, provided
that the mixture has a sufficient number of components along with
correctly estimated parameters
\citep{bishop2005neural,zeevi1997det}. In particular, we equip
$q_{t}(\theta)$ with the Student's T distribution as the mixture
component, making it easier to detect modes that are far apart
thanks to its fat tail structure. Let $\mathcal{S}_d(\mu,\Sigma,v)$
denote a $d$-variate Student's T distribution, where $\mu$ is the
center, $\Sigma$ the positive definite inner product matrix, and $v$
is the degrees of freedom. We just fix $v$ to be a constant, e.g., 5
in the following. So the mixture model we use is:
\begin{equation}\label{Definition_mixture}
q(\theta|\psi)=\sum\limits_{m=1}^M \alpha_{m}
\mathcal{S}_d(\theta|\mu_m,\Sigma_m)
\end{equation}
where
$\psi=\{M,\alpha_1,\ldots,\alpha_M,\mu_1,\ldots,\mu_M,\Sigma_1,\ldots,\Sigma_M\}$
is the mixture parameter. Here $\alpha_m$ is the probability mass of
the $m$th component. It satisfies that $\alpha_{m}\geq
0,\sum_{m=1}^M \alpha_{m}=1$. So the task of fitting a T-mixture
model to a probability density translates into seeking suitable
parameter values for $\psi$.

\subsection {Proposal Adaptation via IS plus EM}\label{sec:IS_EM}
Let's focus on iteration $t$ in Algorithm 1. We first simulate an
random sample, $\theta^n$, from $q_{t-1}(\theta)$, with the
importance weight calculated by
\begin{equation}
w^n=\frac{\frac{\pi_{t}(\theta^n)}{q_{t-1}(\theta^n)}}{\sum_{n=1}^N\frac{\pi_{t}(\theta^n)}{q_{t-1}(\theta^n)}},
\end{equation}
where $N$ denotes the sample size. The resulting weighted particle
set then can be seen as a particle approximation to the current
target distribution $\pi_t(\theta)$, i.e.
\begin{equation}\label{delta_mass_approx}
\pi_t(\theta)\simeq\sum\limits_{n=1}^Nw^i\delta(\theta^i),
\end{equation}
where $\delta(\cdot)$ is the delta mass function.

Here we adopt the Kullback-Leibler (KL) divergence to measure the
distance between any two distributions. Then the aim is to obtain a
proposal, $q_t(\theta)$, that minimizes the KL distance with respect
to $\pi_t(\theta)$, which is shown to be
\begin{equation}\label{KL}
\mathcal{D}[\pi_t(\theta)||q_t(\theta|\psi)]=\mathbb{E}_{\pi_t(\theta)}\left[\log\frac{\pi_t(\theta)}{\sum\limits_{m=1}^M
\alpha_m\mathcal{S}_d(\theta|\mu_m,\Sigma_m)}\right].
\end{equation}
where $\mathbb{E}_f$ denotes expectation with respect to a function
$f$. It's clear that minimizing (\ref{KL}) in $(\alpha,\mu,\Sigma)$
is equivalent to maximizing
\begin{equation}\label{MLE}
\int\pi_t(\theta)\log\left(\sum\limits_{m=1}^M
\alpha_m\mathcal{S}_d(\theta|\mu_m,\Sigma_m)\right).
\end{equation}
Substituting Equation (\ref{delta_mass_approx}) in the above, we
have
\begin{equation}\label{Max_log_lik}
\sum_{i=1}^N w^i\log\left(\sum\limits_{m=1}^M
\alpha_m\mathcal{S}_d(\theta^i|\mu_m,\Sigma_m)\right).
\end{equation}

At this stage, the task translates into a mixture-density parameter
estimation problem, that is, how to optimize the parameter values of
$\psi$ to make the resulting mixture density function fit the data
$\{\theta^n,w^n\}_{n=1}^N$ as well as possible.

Given data component $\theta^n$, the posterior probability of each
mixture component can be obtained according to the Bayes rule:
\begin{equation}\label{posterior_prob_component}
\rho_m(\theta^n)=\alpha_m
\mathcal{S}_d(\theta^n;\mu_m,\Sigma_m)/q(\theta^n),
\end{equation}
where $\alpha_m$ is regarded as the prior, and
$\mathcal{S}_d(\theta^n;\mu_m,\Sigma_m)$ is the associated
likelihood. The proportion mass of this component can then be
updated by
\begin{equation}\label{MLE_alpha}
\alpha_m=\sum\limits_{n=1}^N w^n \rho_m(\theta^n).
\end{equation}
And each pair of $\{\mu_m,\Sigma_m\}$ can be updated by the
following Equations:
\begin{equation}
\mu_m=\frac{\sum\limits_{n=1}^N w^n
\rho_m(\theta^n)u_m(\theta^n)\theta^n}{\sum\limits_{n=1}^N w^n
\rho_m(\theta^n)u_m(\theta^n)},
\end{equation}
\begin{equation}\label{MLE_Sigma}
\Sigma_m=\frac{\sum\limits_{n=1}^N w^n
\rho_m(\theta^n)u_m(\theta^n)C_n }{\sum\limits_{n=1}^N w^n
\rho_m(\theta^n)},
\end{equation}
where
\begin{equation}
u_m(\theta)=\frac{v+d}{v+(\theta-\mu_m)^T(\Sigma_m)^{-1}(\theta-\mu_m)}
\end{equation}
and $C_n=(\theta^n-\mu_m)(\theta^n-\mu_m)'$, where $'$ denotes
matrix transposition.

The operations from (\ref{posterior_prob_component}) to
(\ref{MLE_Sigma}) constitute one iteration of the EM algorithm,
which is used to do mixture density parameter estimation
\citep{peel2000rmm}. Let $q_t(\theta)$ be a T-mixture model, with
the updated parameter values $\{\alpha_m,\mu_m,\Sigma_m\}$, and it's
expected to be an approximation to $\pi_t(\theta)$. The efficiency
of $q_t(\theta)$, as an approximation of $\pi_t(\theta)$, can be
measured by the KL divergence using (\ref{Max_log_lik}), or the
effective sample size, which is defined to be
\begin{equation}\label{ESS}
\ESS=\frac{1}{\sum_{n=1}^N w_n^2}.
\end{equation}
Note that these importance weights are associated with the random
sample drawn from $q_t(\theta)$. The \ESS was originally proposed by
\cite{kong1994sequential} to measure the overall efficiency of an IS
algorithm. Heuristically it measures how many i.i.d samples are
equivalent to the $N$ weighted samples. The \ESS is related to the
coefficient of variation (CV) of the importance weights (see
\cite{kong1994sequential}), which finds applications in
\cite{ardia2008adaptive,cappe2008ais,cornebise2008adaptive}. It's
shown that \ESS is determined by the associated proposal and the
particle size, $N$, while, once $N$ is big enough, the ratio of \ESS
to $N$ should be stable. So in this paper, we use \ESS$/N$ as a
quantity to monitor the efficiency of $q_t(\theta)$. When \ESS$/N$
is big enough, let Algorithm 1 go to next iteration. If \ESS$/N$ is
smaller than a given threshold, we continue to update $q_t(\theta)$
until it satisfies the requirement, i.e. the resulting \ESS$/N$ is
big enough. In this way, we can make sure that the finally obtained
$q_t(\theta)$ is really a good approximation to $\pi_t(\theta)$.

Suppose that currently \ESS$/N$ is not big enough, we consider two
approaches that are used to update $q_t(\theta)$ again. First we
check whether the particle with the biggest importance weight,
denoted $\theta^{W}, W\in\{1,\ldots,N\}$, is located in the tail of
the proposal. If it is, which means $q_t(\theta^{W})$ is smaller
than the proposal densities of most other particles, we then select
to perform an operation called component-splitting (see Section
\ref{sec:split}), otherwise, we perform the IS plus EM procedure
again. The reason for doing this is that the IS plus EM procedure is
exactly a local search approach, because it fixes the number of
mixture components to be a constant and the EM itself is a local
optimization method, while the component-splitting procedure will
break the local model structure by splitting one mixture component
into two, thus facilitates to find the global or, at least a better
local optimum (see Section \ref{sec:split}).


\subsection {Online Approach to Tune the Number of the Mixture Components}

In the above IS plus EM procedure, the number of components $M$ is
imposed to be constant, while this assumption is of course not
reasonable. To deal with this issue, we propose a series of adaptive
mechanisms to adapt $M$ online, which include operations called
components-merging, components-splitting and components-deletion,
respectively. We'll provide respective conditions, on which each of
these procedures needs to be performed.

\subsubsection{Components Merging}
This step targets for merging mixture components that have much
mutual information among each other. To begin with, we first
introduce the concept, mutual information. Suppose that a set of
equally weighted data components $\{\theta^n\}_{n=1}^N$ are
available, whose distribution is modeled by a mixture density
function $q=\sum_{m=1}^M\alpha_mf_m$. Given any data component
$\theta^n$, the posterior probability of each mixture component
$f_m$ can be calculated by
\begin{equation}
\rho_m(\theta^n)=\alpha_m f_m(\theta^n)/q(\theta^n).
\end{equation}
If there are two mixture components $f_i$, $f_j$, for which it
satisfies that $\rho_i(\theta^n)=\rho_j(\theta^n)$ for any
$n\in\{1,\ldots,N\}$, then $f_i$ and $f_j$ are regarded to contain
completely overlapped information. Inspired by the components
merging criterion proposed in \cite{ueda2000sam,wang2004estimation},
we define the mutual information between $f_i$ and $f_j$ to be:
\begin{equation}\label{mutual_info}
\MI(i,j)=\frac{(\Upsilon_i-\bar{\Upsilon}_i)^T(\Upsilon_j-\bar{\Upsilon}_j)}{\|\Upsilon_i-\bar{\Upsilon}_i\|\cdot\|\Upsilon_j-\bar{\Upsilon}_j\|},
\end{equation}
 where
$\Upsilon_m=[\rho_m(\theta^1),\ldots,\rho_m(\theta^N)]'$, $'$
denotes transposition of a matrix, $\|\cdot\|$ denotes the Euclidean
norm, and $\bar{\Upsilon}_m=\frac{1}{N}\sum_{n=1}^N
\rho_m(\theta^n)\textbf{1}_N$. Here $\textbf{1}_n$ denotes the
$n-$dimensional column vector with all elements being $1$.

Note that $\MI(i,j)\in[-1,1]$, and $\MI(i,j)=1$ iff $f_i$ and $f_j$
contain completely overlapped information.

In our algorithm framework, each data point $\theta^n$ is weighted
by $w^n$. Accordingly, $\Upsilon_m$ should be $\sum_{n=1}^N w^n
\rho_m(\theta^n)$, and the mutual information between $i$, $j$ turns
out to be:
\begin{equation}\label{Merge_criterion_correlation}
\MI(i,j)=\frac{(\Upsilon_i-\bar{\Upsilon}_i)'
D(w)(\Upsilon_j-\bar{\Upsilon}_j)}{\sqrt{\sum_{n=1}^N
w^n(\Upsilon_i(\theta^{n})-\bar{\Upsilon}_i)^2}\sqrt{\sum_{n=1}^N
w^n(\Upsilon_j(\theta^{n})-\bar{\Upsilon}_j)^2}}
\end{equation}
where
\begin{equation}
D(w)=diag([w^1,\ldots,w^N])
\end{equation}
We judge if two components, $i$ and $j$, need to be merged into one
by comparing $\MI(i,j)$ with a threshold $0<T_{merge}<1$. If
$\MI(j,k)>T_{merge}$, we then merge the $i$th component into $j$,
and set $M=M-1$; otherwise, we maintain the original mixture. The
merging operation is specified to be:
\begin{eqnarray}\label{Linear_combination_components}
\alpha_{j}&=&\alpha_{i}+\alpha_{j}\\
\mu_{j}&=&\frac{\alpha_{i}\mu_{i}+\alpha_{j}\mu_{j}}{\alpha_{j}}\\
\Sigma_{j}&=&\frac{\Sigma_{i}\mu_{i}+\Sigma_{j}\mu_{j}}{\alpha_{j}}
\end{eqnarray}
which is a linear combination of the two old components, and has
been used in \cite{wang2004estimation}. We traverse each pair of the
original components to do the merging judgements and perform the
above merging operation if the merging condition satisfies.

We execute the above merging operation when superfluous components
may exist, e.g., when $q_0(\theta)$ is initialized with too many
overlapped components, or too many new components appear at one
iteration of Algorithm 1 via component-splitting (see Section
\ref{sec:split}).

\subsubsection{Component-splitting}\label{sec:split}
As mentioned in Section \ref{sec:IS_EM}, it may happen that, after
performing the IS plus EM procedure, \ESS$/N$ was not big enough,
and the maximum weight particle was located in the tail of
$q_t(\theta)$. In that case, we perform the mechanism proposed here,
called component-splitting. This procedure is used to break the
local structure of $q_t(\theta)$ by splitting one component into
two, whereby the component number, $M$, will increase to $M+1$.

This Component-splitting approach is performed as follows. First,
when we simulate draws from $q_t(\theta)$ for evaluating the
efficiency of $q_t(\theta)$, we record the component origination of
each draw. We then see the parent component of the maximum weight
sample, $\theta^W$, denoted
$f_p=\mathcal{S}_d(\theta|\mu_p,\Sigma_p)$, and know that which
samples were originated from $f_p$. We encapsulate the samples that
come from $f_p$ to a data array,
$\mathcal{\theta}_{local}=\{\theta^{1},\ldots,\theta^{pN}\}$, then
split $f_p$ into two components$-$
$f_{c1}=\mathcal{S}_d(\theta|\mu_{c1},\Sigma_{c1})$ and
$f_{c2}=\mathcal{S}_d(\theta|\mu_{c2},\Sigma_{c2})$, and finally
replace $f_p$ in $q_t(\theta)$ by $f_{c1}$ and $f_{c2}$.

Next we give operations to get
$\mu_{c1},\Sigma_{c1},\mu_{c2},\Sigma_{c2}$. We first initialize
$\mu_{c1}=\theta^W$, $\mu_{c2}=\mu_p$,
$\Sigma_{c1}=\Sigma_{c2}=\Sigma_p$, and then specify a local mixture
$q_{local}=\alpha_{c1}f_{c1}+\alpha_{c2}f_{c2}$, where
$\alpha_{c1}=\alpha_{c1}=0.5$, finally we perform a local IS plus EM
procedure to update
$\alpha_{c1},\alpha_{c2},\mu_{c1},\Sigma_{c1},\mu_{c2},\Sigma_{c2}$.
Such local IS plus EM procedure starts by evaluating the local
weight for each $\theta^{i}$ in $\mathcal{\theta}_{local}$ via
\begin{equation}
w^i=\frac{\frac{\pi_{t}(\theta^{i})}{q_{local}(\theta^{i})}}{\sum_{n=1}^{pN}\frac{\pi_{t}(\theta^n)}{q_{local}(\theta^n)}}.
\end{equation}
Substituting $\{\theta^i,w^i\}_{i=1}^{pN}$ and the local mixture
parameters,
$\alpha_{c1},\alpha_{c2},\mu_{c1},\Sigma_{c1},\mu_{c2},\Sigma_{c2}$
in the Equations from (\ref{MLE_alpha}) to (\ref{MLE_Sigma}), we
then can obtain the updated parameter values for the local mixture
$q_{local}$. Suppose that $f_p$ is assigned a proportion mass
$\alpha_p$ in $q_t(\theta)$, we assign $f_{c1}$ and $f_{c2}$
proportion masses $\alpha_{c1}=\alpha_p\alpha_{c1}$ and
$\alpha_p\alpha_{c2}$, respectively, for the updated $q_t(\theta)$,
in which $f_p$ is replaced by $f_{c1}$ and $f_{c2}$.

This component-splitting method makes use of the local behavior of
the target distribution around the maximum weight sample. It adds
mixture proposal probability mass to those areas of the parameter
space which is near the maximum weight sample, where there is
relatively little mixture proposal probability mass. As the local IS
plus EM procedure is actually a local mixture learning process, by
which the local structure around the maximum weight sample will be
modeled well by the updated local mixture, which then will elicit a
better global mixture representation of the whole target
distribution, which may lead to a satisfactory \ESS$/N$ value. Note
that the above component-splitting procedure may be repeated
provided that the condition of executing it satisfies.

\subsubsection{Component-Deletion} \label{sec:deletion}
As mentioned above, when simulating draws from $q_t(\theta)$, we can
record the component origination of each sample. For any mixture
component that generates no sample, we deem it has taken no effect
in modeling the current target distribution, thus we just delete it
in current mixture proposal. Its probability masses is then
redistributed strengthening the remaining components.

\subsection{Practical Issues to be Considered}
EM based mixture modeling approach provides a convenient and
flexible utility for estimating or approximating distributions,
while the mixture log-likelihood presents a well known problem,
i.e., degeneracy toward infinity
\citep{biernacki2003degeneracy,biernacki2007degeneracy} (actually EM
also suffers from the problem of slow convergence, so it may
converge to a local optimum \citep{biernacki2003degeneracy},
however, we don't need to worry about that, because on one side the
proposed component-splitting procedure facilitates to search the
global optimum, and, one the other side, the objective here is just
to find a good mixture importance function, which is not necessary
to be the global optimum). To prevent the mixture log-likelihood
from arising to infinity, we assign an inverse-wishart prior on the
covariance matrix of each mixture component, then perform EM to get
the \emph{maximum a posterior} (instead of maximum likelihood)
estimate of the covariance matrix.

In the component-splitting step, the parameter updating of $f_{c1}$
and $f_{c2}$ is based upon $pN$ particles originated from $f_p$. In
case of $pN$ being too small, e.g. smaller than a threshold
specified beforehand, denoted $N_s$, we need to generate another set
of $N_s-pN$ samples from $f_p$, then do the parameter updating for
$f_{local}$ based upon the $N_s$ particles. By doing this, we can
guarantee that we are using a sufficient number of random draws to
capture the local structure in the posterior, that will then be
represented by $f_{local}$.

In the above mentioned component-splitting step, when we replace
$f_p$ in $q_t(\theta)$ by $f_{c1}$ and $f_{c2}$, we assign the
proportion mass of $f_p$ to $f_{c1}$ and $f_{c2}$ proportionally to
their original weights, i.e. $\alpha_{c1}$ and $\alpha_{c2}$,
respectively. Note that if $\alpha_p$ is a very small value, e.g.,
$0.0001$, the new added mixture components, $f_{c1}$ and $f_{c2}$,
will only take an insignificant effect in the updated $q_t(\theta)$,
while $f_{c1}$ and $f_{c2}$ may contain new and important
information about the posterior. To guarantee that $f_{c1}$ and
$f_{c2}$ can play a role in $q_t(\theta)$, at lease to some extent,
we propose the following operations. We first specify a threshold
$\alpha_{threshold}$, which may be, e.g., 0.1 or 0.2. Then, if
$\alpha_p>\alpha_{threshold}$, we assign weights to $f_{c1}$ and
$f_{c2}$ as usual as shown in Section \ref{sec:split}. Otherwise, we
assign $\alpha_{threshold}\alpha_{c1}$ and
$\alpha_{threshold}\alpha_{c2}$ to $f_{c1}$ and $f_{c2}$,
respectively. So the sum of weights of $f_{c1}$ and $f_{c2}$ in
$q_t(\theta)$ will be $\alpha_{threshold}$. To make sure the sum of
the weights of the mixture components equals to 1, we reduce the
weights of the components, other than $f_{c1}$ and $f_{c2}$,
proportionally to their original weights.

\subsection{Relationships to Existing Methods}

As the proposed AAIS method utilizes the idea of 'annealing' as a
way of searching isolated modes, it has close relationships to a
series of MCMC methods, which are based on the "thermodynamic
integration" technique for computing marginal likelihoods, and are
discussed under the name of Bridge and Path sampling in
\cite{gelman1998simulating}, parallel tempering MCMC in
\cite{gregory2010bayesian,gregory2005bayesian} and Annealed
Importance Sampling (AIS) in \cite{neal2001ais}. The central idea of
these methods is to divide the ratio of two normalization constants,
which we can think of for our purposes as the ratio of marginal
likelihoods, $\Z_t$ and $\Z_0$, which are associated to
$\pi(\theta)$ and $q_0(\theta)$, respectively, into a series of
easier ones, parameterised by inverse temperature, $\lambda$. In
detail:
\begin{equation}
\frac{\Z_t}{\Z_0}=\frac{\Z_1}{\Z_0}\frac{\Z_2}{\Z_1}\dots\frac{\Z_t}{\Z_{t-1}}.
\end{equation}
Each of the ratios is estimated using samples from a MCMC method.
For example, to compute $\frac{\Z_t}{\Z_{t-1}}$, we should have
samples that are drawn from equilibrium from the distribution
defined by
$\pi_{t-1}(\theta)\propto\pi(\theta)^{\lambda_{t-1}}q_0(\theta)^{1-\lambda_{t-1}}$.
Therefore, these methods require us to devise a single Metropolis
proposal at each temperature, which should be a troublesome issue in
practice, since it is not prior known the target distribution's
structure at each temperature. In addition, the involvement of MCMC
results in annoying questions, e.g. how long should the 'burn-in'
period last, and how many samples are needed to get a reliable
estimate, which should be considered at each temperature.

Striking differently, the proposed AAIS is developed from the
perspective of mixture modeling, that is, how to adapt a mixture
model to resemble the posterior, while the similar temperature idea,
i.e., 'annealing', is also involved target for searching isolated
peaky modes. We only need to select an initial mixture proposal
$q_0$, the annealing temperatures $\lambda_t$, and up to three
thresholds, after which, the mixture adaptation process is
completely adaptive, and finally the marginal likelihood is simply
obtained by substituting the resulting mixture as the importance
function in (\ref{IS_for_marginal_lik}). Such property of
easy-to-implementation results from the adaptive techniques involved
in the algorithm, including the EM based parameter estimation for
each mixture component, and the proposed online approach used for
adjusting the cardinality of the mixture, i.e. the number of mixture
components.

This AAIS algorithm falls within a general algorithmic framework,
called Sequential Monte Carlo Sampler (SMCS), which is proposed in
\cite{del2007sequential,del2006sequential}. However, SMCS utilizes
the same "thermodynamic integration" technique for calculating
marginal likelihoods, so it inherits all drawbacks mentioned above.
Exactly, the proposed AAIS method can be seen as a special case of
SMCS, which adopts adaptive independent mixture proposals as the
transition kernels (see details in
\cite{del2007sequential,del2006sequential}), and the IS, instead of
MCMC, for particle generation.

Finally, AAIS has connections to a bunch of existing adaptive IS
methods in the literature, for example,
\cite{evans1991chaining,oh1993imf,cappe2008ais,cappe2004pmc,west1993mixture,ardia2008adaptive},
to name just a few. The most distinguishing point lies in that our
method adopts the annealing strategy, which is well recognized as a
powerful approach to search separated, especially peaky, modes in
the posterior. From the mixture modeling point of view, the proposed
mixture adaptation technique can be seen as an intermediate between
the completely non-parametric method in \cite{west1993mixture} and
the EM based semi-parametric methods, which is used in the adaptive
mixture importance sampling (AMIS) approach \citep{cappe2008ais}.
The AMIS approach employs a fixed number of mixture components,
while our method and the nonparametric method
\citep{west1993mixture} can let the data speak for themselves how
many components should be contained in the mixture.

\section{Simulation Tests}
\label{sec:simulation}

In this Section, we present two simulations, one in three dimensions
and the other in seven, in which the target functions were built to
be difficult to integrate, but their shapes can be inferred through
their mathematical expressions. The objective of the simulation
study is to evaluate the efficiency of the importance function
yielded by AAIS, via comparing its shape with that of the real
target, and comparing the calculated marginal likelihood with the
real answer.

\subsection{Example 1: Three Dimensional Flared
Helix}\label{sec:example1}

We used a cylindrical function with standard Gaussian cross-section
(in ($x$, $y$)) that has been deformed (in $z$) into a helix with
varying diameter. The helix performs three complete turns. The
target function is defined to be
\begin{equation}\label{helix_func}
p(x,y,z)=\delta(-30<z\leq30)\mathcal{N}([x,y]|\mu,\Sigma)
\end{equation}
where $\mu(1)=(z+35)\times\cos(\beta)$ and
$\mu(2)=(z+35)\times\sin(\beta)$, where $\beta=(z+30)\times\pi/10$.
Here $\Sigma=\mbox{diag}([1,1])$. Fig.\ref{fig:flared_helix_target}
shows the shape of this function. And the integration of this
function is shown to be
\begin{equation}
\int\int\int p(x,y,z)dxdydz=\int\int\int p(x,y|z)dxdy
p(z)dz=\int_{-30}^{30} 1 dz=60
\end{equation}
Treating (\ref{helix_func}) to be $L(y|\theta)p(\theta)$, the
marginal likelihood in this simulation case is then 60.

\begin{figure}[!htb]
\centerline{\includegraphics[width=3in,height=2.4in]{Fig/flared_helix_target.png}}
\caption{The simulated 3-D flared helix target distribution}
\label{fig:flared_helix_target}
\end{figure}

Before running the proposed Adaptive Annealed IS (AAIS) algorithm,
we select the initial proposal to be a T-mixture that is composed of
10 equally weighted components. The degree of freedom for each T
component is fixed to be 5. The centers of these components are
selected uniformly from a region restricted by
$x\in[-100,100],y\in[-100,100]$ and $z\in[-30,30]$. Then the
covariance matrix for every component is identical to
$\mbox{diag}[\sigma_x^2,\sigma_y^2,\sigma_z^2]$, where $\sigma_a$ is
the standard error of the argument $a$ in the components' centers
that has been just specified. The particle size $N=2000$,
$\lambda_t=0.1t$, and $t=1,\ldots,10$.

The adaptive mixture importance sampling (AMIS) approach
\cite{cappe2008ais} is involved for performance comparison. We
initialize it using the same setting as for the AAIS, and let it run
10 iterations to give the final mixture importance function.

Fig. \ref{fig:helix_scatter} shows the resulting posterior samples
(which are equally weighted via resampling) corresponding to AAIS
and AMIS, respectively. It's shown that the samples resulted from
AAIS are distributed in wider posterior area, which indicates that
the importance function given by AAIS captures more structures in
the posterior.

\begin{figure}[!htb]
\begin{tabular}{c}
\centerline{\includegraphics[width=3in,height=2.4in]{Fig/flared_helix_scatter.png}\includegraphics[width=3in,height=2.4in]{Fig/AIS_3d_felix_scatter.png}}
\end{tabular}
\caption{Three dimensional flared helix example. Left:posterior
samples produced by the proposed AAIS algorithm. Right:posterior
samples produced by the AMIS algorithm \citep{cappe2008ais}. In the
diagonal, the curves are kernel density estimates of the posterior
samples.} \label{fig:helix_scatter}
\end{figure}

A quantitative comparison between the importance functions obtained
by AAIS and AMIS is shown in Table \ref{comparison_3D}. Three
quantities are used for comparison, that are the \ESS, the KL
distance and the estimated marginal likelihood. Note that, we can
easily calculate the KL distance via a simple Monte Carlo for this
simulation case. As is shown in Table \ref{comparison_3D}, AAIS
gives correct answer to the marginal likelihood, closer KL distance
with respect to the real target, and bigger \ESS. So it further
demonstrates that the AAIS is much better than the AMIS in dealing
with this simulation case.

\begin{center}
\begin{tabular}{c||c|c|c}
 & \ESS/N & $\KL$ & Marginal Likelihood (real answer:60) \\
\hline AMIS \citep{cappe2008ais} & 0.1525 & 6.7830 & $47.2\pm3.3$\\
\hline AAIS proposed here & 0.4459 & 0.1586 & $59.7\pm2.0$\\
\hline
\end{tabular}
\tabcaption{Three dimensional flared helix example. Quantitative
Performance comparison.}\label{comparison_3D}
\end{center}

\subsection{Example 2: Outer Product of Seven Univariate Densities}
To test the concern whether good results from a 3-D example imply
good results in higher dimensions, we used as a target function the
outer product of seven univariate distributions normalized to
integrate to 1. These seven distributions are:
\begin{enumerate}
\item $\frac{3}{5}\mathcal{G}(10+x|2,3)+\frac{2}{5}\mathcal{G}(10-x|2,5)$
\item $\frac{3}{4}sk\mathcal{N}(x|3,1,5)+\frac{1}{4}sk\mathcal{N}(x|-3,3,-6)$
\item $\mathcal{S}(x|0,9,4)$
\item $\frac{1}{2}\mathcal{B}(x+3|3,3)+\frac{1}{2}\mathcal{N}(x|0,1)$
\item $\frac{1}{2}\varepsilon(x|1)+\frac{1}{2}\varepsilon(-x|1)$
\item $sk\mathcal{N}(x|0,8,-3)$
\item $\frac{1}{8}\mathcal{N}(x|-10,0.1)+\frac{1}{4}\mathcal{N}(x|0,0.15)+\frac{5}{8}\mathcal{N}(x|7,0.2)$
\end{enumerate}
Here $\mathcal{G}(\cdot|\alpha,\beta)$ denotes the gamma
distribution, $\mathcal{N}(\cdot|\mu,\sigma)$ denotes the normal
distribution, $sk\mathcal{N}(\cdot|\mu,\sigma,\alpha)$ denotes the
skew-normal distribution, $\mathcal{S}(\cdot|\mu,s,df)$ denotes the
student-T distribution, $\mathcal{B}(\cdot|\alpha,\beta)$ denotes
the beta distribution, and $\varepsilon(\cdot|\lambda)$ denotes the
exponential distribution. Dimension 2 has two modes bracketing a
deep ravine, dimension 4 has one low, broad mode that overlaps a
second sharper mode, and dimension 7 has three distinct,
well-separated modes. Only dimension 5 is symmetric. There is a
range of tail behavior as well, from Gaussian to heavy-tailed.

In this case, we initialize the proposed AAIS algorithm with a
T-mixture that is composed of 50 equally weighted components. The
degree of freedom for each T component is still fixed to be 5. The
centers of these components for each dimension are selected
uniformly $[-10,10]$. The procedure to initialize the covariance
matrix for every mixture component is identical to that used for the
example shown in Section \ref{sec:example1}. We specify a particle
size $N=8000$, and annealing schedule $\lambda_t=0.1t$, with
$t=1,\ldots,10$.

The AMIS method \citep{cappe2008ais} is also involved for
performance comparison. The AMIS is initialized identically as for
AAIS, and will run 10 iterations to give the final result.

Fig.\ref{fig:7D_scatter} depicts the scatterplot for the resulting
posterior samples (which has been equally weighted by resampling).
It's shown that, the AMIS missed modes in the 2nd and 7th dimension,
while the proposed AAIS manages to capture all the modes in the
posterior, which should be benefited from the annealing idea being
involved.

\begin{figure}[!htb]
\begin{tabular}{c}
\centerline{\includegraphics[width=4in,height=3.2in]{Fig/seven_outer_product_scatter.png}\includegraphics[width=4in,height=3.2in]{Fig/AIS_7d_outter_scatter.png}}
\end{tabular}
\caption{Seven dimensional outer product example. Left:posterior
samples produced by the proposed AAIS algorithm. Right:posterior
samples produced by the AMIS algorithm \citep{cappe2008ais}. In the
diagonal, the curves are kernel density estimates of the posterior
samples.} \label{fig:7D_scatter}
\end{figure}

Similarly as for example 1, we give a quantitative comparison for
the involved algorithms in Table \ref{comparison_7d}. We see that
again the proposed AAIS algorithm leads to much better results for
every comparison criterion.

\begin{center}
\begin{tabular}{c||c|c|c}
 & \ESS/$N$ & $\KL$ & Marginal likelihood: (true value:1) \\
\hline AMIS \citep{cappe2008ais}& 0.2454 & 10.8804 & $0.4675\pm0.0246$\\
\hline AAIS proposed here & 0.4948 & 0.4075  & $1.0011\pm0.0303$\\
\hline
\end{tabular}
\tabcaption{Seven dimensional outer product example. Quantitative
Performance comparison}\label{comparison_7d}
\end{center}

\section{Real Data Tests}
In this section, we perform the proposed AAIS method to deal with
two real RV data set.

\subsection{HD73526 Data reported in \cite{tinney2003four}}

This data set contains 18 data components, and were claimed to
support an orbit of $190.5\pm3.0$ days \citep{tinney2003four}.
\cite{gregory2005bayesian} did a Bayesian re-analysis on this data
set using a parallel tempering MCMC algorithm, and reported three
possible orbits, with periods $127.88_{-0.09}^{+0.37}$,
$190.4_{-2.1}^{+1.8}$ and $376.2_{-4.3}^{+1.4}$ days, respectively.
\cite{gregory2005bayesian} also discussed the possibility of having
an additional planet, and reported that the Bayes factor is less
than 1 when comparing $\mathcal{M}_2$ with the $\mathcal{M}_1$.

We use the proposed AAIS algorithm to deal with $\mathcal{M}_0$,
$\mathcal{M}_1$ and $\mathcal{M}_2$, respectively. The marginal
likelihood estimation result is summarized in Table
\ref{marginal_likelihood_data1}. These estimates are reliable
indicated by their corresponding \ESS$/N$. The resulting Bayes
factors are shown in Table \ref{Bayes_factor_data1}. As indicated by
the result, this data set support the one-planet hypothesis, which
coincides the conclusions made by
\cite{tinney2003four,gregory2005bayesian}.

\begin{center}
\begin{tabular}{c|c|c}
 & Marginal Likelihood & \ESS$/N$\\
\hline $\mathcal{M}_0$ & $5.9013\times10^{-50}\pm5.1325\times10^{-52}$ & 0.9320\\
\hline $\mathcal{M}_1$ & $4.4886\times10^{-41}\pm3.2093\times10^{-42}$ & 0.5698\\
\hline $\mathcal{M}_2$ & $1.5511\times10^{-42}\pm3.2878\times10^{-43}$ & 0.3458\\
\hline
\end{tabular}
\tabcaption{HD73526 \cite{tinney2003four} Data Case. The calculated
marginal likelihoods}\label{marginal_likelihood_data1}
\end{center}

\begin{center}
\begin{tabular}{c|c}
 \BF$\{\mathcal{M}_1:\mathcal{M}_0\}$ & \BF$\{\mathcal{M}_2:\mathcal{M}_1\}$\\
\hline $7.606\times10^8$& 0.03456\\
\hline
\end{tabular}
\tabcaption{HD73526 \citep{tinney2003four} Data Case. The calculated
Bayes Factors}\label{Bayes_factor_data1}
\end{center}

Specifically for $\mathcal{M}_1$, we show the posterior samples
(equally weighted by resampling) in Fig. \ref{fig:scatter_1p_data1}
. We obtain two modes in $P$, that are $P_1=190.1_{-1.5}^{+2.2}$ and
$P_2=375.5_{-2.5}^{+2.0}$, respectively.

\begin{figure}[!htb]
\centerline{\includegraphics[width=4in,height=3.2in]{Fig/scatter_1p_data1.png}}
\caption{HD73526 \cite{tinney2003four} Data Case. Scatter plot of
the posterior samples of
$\mathcal{M}_1$}\label{fig:scatter_1p_data1}
\end{figure}

\subsection{The HD73526 data set reported in \cite{tinney20062}}
This data set contains 30 RV data components and was claimed to have
two planets \citep{tinney20062}.

We calculate the marginal likelihoods using the proposed AAIS
algorithm, and the result is shown in Table
\ref{marginal_likelihood_data2}. Indicated by the criterion
\ESS$/N$, we deem that the result is reliable. The resulting Bayes
factors can be seen in Table \ref{Bayes_factor_data2}. So our result
supports the argument in \cite{tinney20062}$-$there are two planets
underlying this data set.

\begin{center}
\begin{tabular}{c|c|c}
 & Marginal Likelihood & \ESS$/N$\\
\hline $\mathcal{M}_0$ & $8.9566\times10^{-77}\pm1.0852\times10^{-78}$ & 0.9510\\
\hline $\mathcal{M}_1$ & $5.8519\times10^{-70}\pm1.7077\times10^{-71}$ & 0.6545\\
\hline $\mathcal{M}_2$ & $4.8122\times10^{-65}\pm1.4284\times10^{-66}$ & 0.2034 \\
\hline
\end{tabular}
\tabcaption{HD73526 \citep{tinney20062} Data Case. The calculated
marginal likelihoods}\label{marginal_likelihood_data2}
\end{center}

\begin{center}
\begin{tabular}{c|c}
 \BF$\{\mathcal{M}_1:\mathcal{M}_0\}$ & \BF$\{\mathcal{M}_2:\mathcal{M}_1\}$\\
\hline $6.534\times10^6$& $8.233\times10^4$\\
\hline
\end{tabular}
\tabcaption{HD73526 \citep{tinney20062} Data Case. The calculated
Bayes Factors}\label{Bayes_factor_data2}
\end{center}

The posterior samples (equally weighted by resampling) corresponding
to $\mathcal{M}_1$ is shown in Fig.\ref{fig:scatter_1p_data2}, and
the algorithm again captures two modes in $P$, that are
$P_1=193.1162_{-3.7}^{+2.1}$ and $P_2=374.8732_{-5.8}^{+6.9}$.

When dealing with $\mathcal{M}_2$, we still restrict the period of
the first planet to be smaller than that of the second planet.
Fig.\ref{fig:scatter_2p_data2} shows the posterior samples of
$\mathcal{M}_2$. Given these posterior samples, we get the periods
of the two planets, that are $P_1=187.9379_{-0.8}^{+2.1}$ and
$P_2=377.3030_{-4.5}^{+5.2}$, respectively.
\begin{figure}[!htb]
\centerline{\includegraphics[width=4in,height=3.2in]{Fig/scatter_1p_data2.png}}
\caption{scatterplot of the posterior samples for $\mathcal{M}_1$
for the newer data set}\label{fig:scatter_1p_data2}
\end{figure}

\begin{figure}[!htb]
\begin{tabular}{c}
\centerline{\includegraphics[width=4in,height=3.2in]{Fig/scatter1_2p_data2.png}\includegraphics[width=4in,height=3.2in]{Fig/scatter2_2p_data2.png}}\\
\end{tabular}
\caption{HD73526 \cite{tinney20062} Data Case. Scatter plot of the
posterior samples of $\mathcal{M}_2$. Left: Scatter plot of the
posterior samples of the first planet parameter; Right: Scatter plot
of the posterior samples of the second planet parameter}
\label{fig:scatter_2p_data2}
\end{figure}


Given samples from the posterior, we can easily get the minimum mean
squared estimate (MMSE) of the RV at any given time, and then plot
the RV curve. In Fig.\ref{fig:rv_comparison_data2}, we plot the
estimated RV curves on basis of $\mathcal{M}_0$, $\mathcal{M}_1$ and
$\mathcal{M}_2$, respectively, meanwhile the real RV observations
are depicted there for reference. As in shown, the RV estimation
yielded by $\mathcal{M}_2$ fits the real data best.

\begin{figure}[!htb]
\centerline{\includegraphics[width=4in,height=3.2in]{Fig/rv_comparison_data2.png}}
\caption{Measured velocities (filled circles) vs. time for HD73526
\cite{tinney20062}. Error bars only includes internal uncertainties.
The curves are MMSE estimates of the RV yielded by $\mathcal{M}_0$,
$\mathcal{M}_1$ and $\mathcal{M}_2$,
respectively}\label{fig:rv_comparison_data2}
\end{figure}

\subsection{The HD217107 Data Case}
Star HD217107 was claimed to have two planets \citep{vogt2005five}.

We analyze the observed RV data for HD217107 using our AAIS method,
the calculated marginal likelihoods are shown in Table
\ref{marginal_likelihood_HD217107}, each corresponding to a
sufficiently big value of \ESS$/N$. The resulting Bayes factors can
be seen in Table \ref{Bayes_factor_HD217107}. It's shown that
$\mathcal{M}_2$ beats both $\mathcal{M}_0$ and $\mathcal{M}_1$.

\begin{center}
\begin{tabular}{c|c|c}
 & Marginal Likelihood & \ESS$/N$\\
\hline $\mathcal{M}_0$ & $3.6492\times10^{-168}\pm7.0150\times10^{-169}$ & 0.9644\\
\hline $\mathcal{M}_1$ & $3.7389\times10^{-141}\pm0.6077\times10^{-142}$ & 0.6448\\
\hline $\mathcal{M}_2$ & $3.0897\times10^{-108}\pm1.0011\times10^{-109}$ & 0.4119 \\
\hline
\end{tabular}
\tabcaption{HD217107 \citep{vogt2005five} Data Case. The calculated
marginal likelihoods}\label{marginal_likelihood_HD217107}
\end{center}

\begin{center}
\begin{tabular}{c|c}
 \BF$\{\mathcal{M}_1:\mathcal{M}_0\}$ & \BF$\{\mathcal{M}_2:\mathcal{M}_1\}$\\
\hline $1.025\times10^{27}$& $8.264\times10^{32}$\\
\hline
\end{tabular}
\tabcaption{HD217107 \citep{vogt2005five} Data Case. The calculated
Bayes Factors}\label{Bayes_factor_HD217107}
\end{center}

\subsection{The HD37124 Data Case}
Star HD37124 was claimed to have three planets \citep{vogt2005five}.

We analyze the associated RV data set using our AAIS method, the
marginal likelihoods calculated by our AAIS algorithm are shown in
Table \ref{marginal_likelihood_37124}. Indicated by the criterion
\ESS$/N$, the results are reliable. The resulting Bayes factors can
be seen in Table \ref{Bayes_factor_37124}, which support the
conclusion of \cite{tinney20062}, i.e. this data set supports the
three-planet hypothesis.

\begin{center}
\begin{tabular}{c|c|c}
 & Marginal Likelihood & \ESS$/N$\\
\hline $\mathcal{M}_0$ & $2.0889\times10^{-110}\pm2.1857\times10^{-112}$ & 0.9427\\
\hline $\mathcal{M}_1$ & $1.0717\times10^{-106}\pm1.3077\times10^{-107}$ & 0.1737\\
\hline $\mathcal{M}_2$ & $1.9830\times10^{-97}\pm1.1451\times10^{-98}$ & 0.1798 \\
\hline $\mathcal{M}_3$ & $2.9228\times10^{-84}\pm1.2563\times10^{-85}$ & 0.1305 \\
\hline
\end{tabular}
\tabcaption{HD37124 \citep{vogt2005five} Data Case. The calculated
marginal likelihoods}\label{marginal_likelihood_37124}
\end{center}

\begin{center}
\begin{tabular}{c|c|c}
\BF$\{\mathcal{M}_1:\mathcal{M}_0\}$ &\BF$\{\mathcal{M}_2:\mathcal{M}_1\}$ & \BF$\{\mathcal{M}_3:\mathcal{M}_2\}$\\
\hline $5.13\times10^3$ & $1.850\times10^9$ & $1.474\times10^{13}$\\
\hline
\end{tabular}
\tabcaption{HD37124 \citep{vogt2005five} Data Case. The calculated
Bayes Factors}\label{Bayes_factor_37124}
\end{center}

In Fig.\ref{fig:hd37124_fit}, we plot the estimated RV curves on
basis of $\mathcal{M}_1$, $\mathcal{M}_2$ and $\mathcal{M}_3$,
respectively, where the real RV observations is also depicted for
reference. As is shown, the RV estimation yielded by $\mathcal{M}_3$
fits the real data best.



\begin{figure}[!htb]
\begin{tabular}{c}
\centerline{\includegraphics[width=4in,height=3.2in]{Fig/hd37124_1p_fit.png}\includegraphics[width=4in,height=3.2in]{Fig/hd37124_2p_fit.png}}\\
\centerline{\includegraphics[width=4in,height=3.2in]{Fig/hd37124_3p_fit.png}}
\end{tabular}
\caption{Measured velocities (filled circles) vs. time for HD37124
\cite{tinney20062}. Error bars only includes internal uncertainties.
The MMSE estimates of the RV curves yielded by using
$\mathcal{M}_1$, $\mathcal{M}_2$ and $\mathcal{M}_3$ are shown in
the top left, top right and the bottom panels, respectively.}
\label{fig:hd37124_fit}
\end{figure}

\subsection{The 47 Ursae Majoris (47 UMa) Data Case}
This 47 UMa data recently was analyzed in
\cite{gregory2010bayesian}.

We analyze the associated RV data set using our AAIS method, the
marginal likelihoods calculated by our AAIS algorithm are shown in
Table \ref{marginal_likelihood_uma47}. We obtain reliable estimates
for $\mathcal{M}_0$, $\mathcal{M}_1$ and $\mathcal{M}_2$, while the
estimate for $\mathcal{M}_3$ is unreliable, the \ESS$/N$ is 0.0089,
which is not big enough. Indicated by the criterion \ESS$/N$, the
results are reliable. So we argue that this data set supports
$\mathcal{M}_2$ more than $\mathcal{M}_0$ and $\mathcal{M}_1$, but
we are not sure about the relative strength between $\mathcal{M}_2$
and $\mathcal{M}_3$. \cite{gregory2010bayesian} concluded that this
data set has three planets, however, gave an much ambiguous estimate
for the third planet's period.

\begin{center}
\begin{tabular}{c|c|c}
 & Marginal Likelihood & \ESS$/N$\\
\hline $\mathcal{M}_0$ & $2.0198\times10^{-1004}\pm9.2572\times10^{-1006}$ & 0.1002\\
\hline $\mathcal{M}_1$ & $3.4400\times10^{-896}\pm3.10\times10^{-897}$ & 0.5643\\
\hline $\mathcal{M}_2$ & $1.3500\times10^{-816}\pm1.77\times10^{-817}$ & 0.3324 \\
\hline $\mathcal{M}_3$ & $2.8970\times10^{-825}\pm9.1623\times10^{-825}$ & 0.0089 \\
\hline
\end{tabular}
\tabcaption{uma47 \citep{gregory2010bayesian} Data Case. The
calculated marginal likelihoods}\label{marginal_likelihood_uma47}
\end{center}

\begin{center}
\begin{tabular}{c|c|c}
 \BF$\{\mathcal{M}_1:\mathcal{M}_0\}$ & \BF$\{\mathcal{M}_2:\mathcal{M}_1\}$ & \BF$\{\mathcal{M}_3:\mathcal{M}_2\}$\\
\hline $1.703\times10^{108}$ & $3.924\times10^{79}$ & $ ?$\\
\hline
\end{tabular}
\tabcaption{uma47 \citep{gregory2010bayesian} Data Case. The
calculated Bayes Factors}\label{Bayes_factor_uma47}
\end{center}

In Fig.\ref{fig:uma47_fit}, we plot the estimated RV curves on basis
of $\mathcal{M}_1$, $\mathcal{M}_2$ respectively, and again the real
RV observations are depicted there for reference. As is shown, the
RV estimation yielded by $\mathcal{M}_2$ fits better.

\begin{figure}[!htb]
\begin{tabular}{c}
\centerline{\includegraphics[width=4in,height=3.2in]{Fig/uma47_1p_fit.png}\includegraphics[width=4in,height=3.2in]{Fig/uma47_2p_fit.png}}
\end{tabular}
\caption{Measured velocities (filled circles) vs. time for the 47
Ursae Majoris (47 UMa) \citep{gregory2010bayesian}. Error bars only
includes internal uncertainties. The MMSE estimates of the RV curves
yielded by using $\mathcal{M}_1$ and $\mathcal{M}_2$ are shown in
the left and the right panels, respectively.} \label{fig:uma47_fit}
\end{figure}

\section{Discussions and Conclusions}
In this paper, we propose an adaptive mixture modeling approach to
construct importance function for IS. The resulting mixture model is
demonstrated to be capable of capturing separated peaky structures
in the posterior, even when it is high-dimensional (an at most
17-dimensional posterior is considered in this paper).
Straightforwardly, this approach facilitates simulating draws from
multi-modal joint posterior distribution by IS, and provides an
effective and easy-to-implement way for estimating marginal
likelihood that is required for Bayesian model comparison.
\bibliographystyle{jasa}
\bibliography{sais_bib}
\end{document}