04-roadmap.Rmd

# (PART) Part 2: Foundations {-}

# Learning from Data: A Roadmap {#roadmap}

:::: {.infobox .tlverse data-latex=""}
:::{.center data-latex=""}
**Learning Objectives**
:::

1. Translate scientific questions to statistical questions.
2. Define a statistical model based on knowledge about the scientific experiment
   or study that generated the data.
3. Identify a causal parameter as a function of the observed data distribution.
4. Explain the following statistical and causal assumptions alongside their
   implications: independent and identically distributed (i.i.d.), consistency,
   no unmeasured confounding, interference, positivity.
::::

## Introduction {-}

The roadmap of statistical learning is concerned with the process of translating
real-world scientific questions to mathematical formalisms
necessary for formulating relevant statistical inference problems.
This involves viewing data as a random variable (complete with its own
underlying probability distribution), incorporating scientific knowledge into
the choice of statistical model, selecting a statistical target parameter that
represents an answer to the scientific question of interest, and developing
efficient estimators of the statistical estimand.

## The Roadmap {#roadmap-steps}

The roadmap is a six-stage process:

1. Define the data as a random variable with a probability distribution, $O \sim
   P_0$
2. Specify the statistical model $\M$ realistically, such that $P_0 \in \M$
3. Translate the scientific question of interest into a statistical target
   parameter $\Psi$ and establish the target population
4. Choose an estimator $\hat{\Psi}$ for $\Psi$ under realistic $\M$
5. Construct a measure of uncertainty for the estimate $\hat{\Psi}(P_n)$
6. Make substantive conclusion

### (1) Data: A random variable with a probability distribution, $O \sim P_0$ {-}

The dataset we are confronted with is the collection of the results of a
scientific (or natural) experiment. We can view the data as a _random variable_;
that is, if the same experiment were to be repeated, we should expect to see a
different realization of the data generated by the same underlying law governing
the experiment in question. In particular, if the experiment were repeated many
times, the underlying probability distribution generating the data, $P_0$, would
be revealed. The observed data on a single unit, $O$, may be thought of as being
drawn from this probability distribution $P_0$. Most often, we have $n$
_independent and identically distributed_ (i.i.d.) observations of the random
variable $O$ in our dataset. Then, the observed data is the collection $O_1,
\ldots, O_n$, where the subscripts denote the individual observational units.
While not all data are i.i.d., this is certainly the most common case in applied
data analysis. There are a number of techniques for handling non-i.i.d. data,
including establishing conditional independence, such that conditional on some
variable (e.g., subject ID for repeated measures data) the i.i.d. assumption
holds, and incorporating inferential corrections for repeated or clustered
observations, to name but a few.

#### The empirical probability measure, $P_n$ {-}

With $n$ i.i.d. observations in hand, we can define an empirical probability
measure, $P_n$. The empirical probability measure is an approximation of the
true probability measure, $P_0$, allowing us to learn from the observed data.
For example, we can define the empirical probability measure of a set of
variables, say $W$, to be the proportion of observations that belong in $W$.
That is,
\begin{equation*}
  P_n(W) = \frac{1}{n}\sum_{i=1}^{n} \I(O_i \in W)
\end{equation*}

In order to understand the scope for learning from a particular dataset, we next
need to ask *"What do we know about the process that led to the data's
generation?"* This brings us on to Step 2.

### (2) Defining the statistical model $\M$ such that $P_0 \in \M$ {-}

The statistical model $\M$ is the set of all possible probability distributions
that could describe the process by which our observed data have been generated,
appropriately constrained by background scientific knowledge. Often, $\M$ is
necessarily very large (i.e., non-parametric), reflecting the fact that
statistical knowledge about $P_0$ is limited.

If $P_0$ is described by a finite number of parameters, then the statistical
model is referred to as _parametric_. Such an assumption is made, for example,
by the proposition that $O$ has a Normal distribution with mean $\mu$ and
variance $\sigma^2$. More generally, a parametric model may be defined as

\begin{equation*}
  \M(\theta) = \{P_{\theta} : \theta \in \R^d \},
\end{equation*}
which describes a constrained statistical model consisting of all distributions
$P_{\theta}$ that are indexed by some finite, $d$-dimensional parameter
$\theta$.

The assumption that $P_0$ has a specific, parametric form is made quite
commonly. Unfortunately, this is even the case when such assumptions are not
supported by domain knowledge about the data-generating process. This practice
of oversimplification in the current, and traditional, culture of statistical
data analysis typically complicates or entirely thwarts any attempt to reliably
answer the scientific question at hand. Why, you ask? Consider how much
knowledge one must have to _know_ (beyond a shadow of a doubt) that the
data-generating distribution underlying a given dataset is, in fact, governed by
just two parameters, as is the case with the ubiquitously relied upon Normal
distribution. Similarly, main terms Cox proportional hazards, logistic
regression, and linear models imply a highly constrained statistical model, and
if any of the assumptions are unwarranted then there will be bias in their
result (except when treatment is randomized). The philosophy used to justify
parametric assumptions is rooted in misinterpretations of the often-quoted
saying of George Box, that "All models are wrong but some are useful," which has
been irresponsibly used to encourage the data analyst to make arbitrary modeling
choices. However, when one makes such unfounded assumptions, it is more likely
that $\M$ does not contain $P_0$, in which case the statistical model is said to
be misspecified. Statistical model misspecification introduces a bias that leads
to misleading, unreliable results and inference.

The result of unwarranted assumptions and oversimplifications is a practice of
statistical data science in which starkly disparate answers to the same
scientific problem emerge. Practically, this is owed to the application of
distinct statistical techniques under differing modeling decisions and
assumptions made (but not communicated well) by different data analysts. Even in
the nascent days of statistical data analysis, it was recognized that it is "far
better [to develop] an approximate answer to the right question...than an exact
answer to the wrong question, which can always be made precise"
[@tukey1962future], though traditional statistics failed to heed this advice for
a number of decades [@donoho2017fifty]. The roadmap avoids this bias by defining
the statistical model through a representation of the true data-generating
distribution underlying the observed data. The ultimate goal is to formulate the
statistical estimation problem _precisely_ (up to the constraints imposed by
available scientific knowledge), so that one can then tailor the estimation
procedure to the motivating scientific problem.

It is crucial that the domain scientist(s) have absolute clarity about what is
_actually known_ about the process/experiment that generated the data, and that
this is communicated to data scientists with as much detail as possible. This
knowledge is rarely ground truth itself, but instead comes in the form of
scientific conventions, accepted hypotheses, and operational assumptions.
<!--
rp: This is vague. What's an example of scientific conventions, example of
accepted hypotheses, and example of operational assumptions that would be
incorporated in M?
-->
It is then the data scientist's responsibility to translate the domain knowledge
into statistical knowledge about $P_0$, and then to define the statistical model
$\M$ so that it respects what is known about $P_0$ and makes no further
restrictions. In this manner, we can ensure that $P_0$ is contained in $\M$,
which we refer to generally as defining a _realistic_ statistical model $\M$.

Defining $\M$ realistically requires a shift in the paradigm of statistical
problem solving. Instead of considering the methods/software one is familiar
with and then trying to solve most problems with that toolbox, one must obtain a
deep understanding of the experiment and scientific question first and then
formulate a plan for learning from the data in a way that respects this. This
requires statisticians to have not only solid methodological and theoretical
foundations, but good communication skills, as several meetings with domain
experts are typically required to review details of the study, possibly refine
of the question of interest, translate technical details, and interpret the
findings in a way that is statistically correct and agreeable with
non-statistician domain experts. Unfortunately, communication between
statisticians and non-statistician researchers is often fraught with
misinterpretation. This is to be expected, as each have their own expertise, but
proper communication about the underlying science and the motivating study can
help to ensure each have appropriate context for a given statistical data
analysis. The roadmap provides a principled mechanism for learning from data
realistically, so that what is learned from the data represents a reliable and
reproducible approximation of the answer to the scientific question of interest.
As the roadmap provides a rigorous method for translating scientific knowledge
and questions into a statistical framework that can be used to learn from data,
it is an invaluable tool to guide communication between statisticians and
non-statistician domain scientists. This brings us to our next step in the
roadmap, *"What are we trying to learn from the data?"*

### (3) The statistical target parameter $\Psi$ and statistical estimand $\psi_0$ {-}

The statistical target parameter, $\Psi$, is defined as a mapping from the
statistical model, $\M$, to the parameter space. Usually, the parameter space is
a real number (but not necessarily so), in which case we can formally define the
target parameter as the mapping $\Psi: \M \rightarrow \R$. The statistical
estimand may be seen as a representation of the quantity that we wish to learn
from the data, the answer to a well-specified -- often causal -- question of
interest about a particular target population. In contrast to ordinary
statistical estimands, causal estimands require an extra set of assumptions to
allow for their _identification from the observed data_. Based on causal models
[@pearl2009causality; @hernan2022causal], identification assumptions are
untestable and must be justified through a combination of knowledge about the
system under study or the process by which the experiment was conducted. These
assumptions are described in greater detail in the following section on [causal
target parameters](#causal).

For a simple example, consider a dataset containing observations of a survival
time on every adult, for which our question of interest is "What's the
probability that an adult lives longer than five years?" We have,

\begin{equation*}
  \psi_0 = \Psi(P_0) = \E_{P_0}(O > 5) = \int_5^{\infty} dP_0(o).
\end{equation*}

This answer to this question is the **statistical estimand,
$\Psi(P_0)=\psi_0$**, which is the quantity we wish to learn from the data. As
discussed above, back-and-forth communication between domain scientists and
statisticians is often required to define $\M$ realistically, and to finalize
$\Psi$ and the target population such that the question is supported in the
data. For instance, say we are interested in learning the average effect of a
headache medication for treating migraines in adults and we learn that no one
with high blood pressure can receive the medication. In the next meeting with
domain scientists, we might suggest that the target population be modified to
adults without high blood pressure or ask a question involving a dynamic
treatment such that within $\Psi$ adults with high blood pressure are never
considered as individuals who could receive treatment. Once we have defined $O$,
$\M$ realistically and $\Psi$, we have formally defined the statistical
estimation problem. Next comes Step 4: "_How do we learn from the data the
approximate answer to the question of interest?_"

### (4) The estimator $\hat{\Psi}$ and estimate $\psi_n$ {-}

To obtain a good approximation of the statistical estimand, we need an estimator
$\hat{\Psi}$, an _a priori_-specified algorithm defined as a mapping from the set
of the set of possible empirical distributions $P_n$ (which live in a
non-parametric statistical model $\M_{NP}$) to the parameter space for our
target parameter of interest: $\hat{\Psi} : \M_{NP} \rightarrow \R$. In other
words, $\hat{\Psi}$ is a function that takes as input the observed data, a
realization of $P_n$, and then outputs a value in the parameter space. Where the
estimator may be seen as an operator that maps the observed data's corresponding
empirical distribution to a value in the parameter space, the numerical output
produced by such a function is the **estimate, $\hat{\Psi}(P_n)=\psi_n$**. Thus,
$\psi_n$ is an element of the parameter space as informed by the empirical
probability distribution $P_n$ of the observed data $O_1, \ldots, O_n$. If we
plug in a realization of $P_n$ (based on a sample size $n$ of the random
variable $O$, we get back an estimate $\psi_n$ of the true parameter value
$\psi_0$.
<!--
nh: the use of $\hat{\Psi}(\cdot)$ seems to imply an approximate mapping, i.e.,
$\hat{\Psi}(\cdot)$ is _not_ $\Psi()$--almost sounds as though we're answering
an approximate version of the question represented by the mapping $\Psi(\cdot)$

rp: Yes, with $\hat{\Psi}$ we are approximating $\Psi$, $\hat{\Psi}$ is an
estimator of $\Psi$
-->
As we have motivated in step 2, it is imperative to consider realistic
statistical models for estimation. Therefore, flexible estimators that allow for
parts of the data-generating process to be unrestricted are necessary.
Semiparametric theory and empirical process theory provide a framework for
constructing, benchmarking, and understanding the behavior of estimators that
depend on flexible estimation strategies in realistic statistical models. In
general, desirable properties of an estimator are that it is regular
asymptotically linear (RAL) and efficient, thereby admitting a Normal limit
distribution that has minimal variance. Substitution/plug-in RAL estimators are
also advantageous: they are guaranteed to remain within the bounds of $\M$ and,
relative to estimators that are not plug-in, have improved bias and variance in
finite samples. In-depth discussion of the theory and these properties are
available in the literature [e.g., @kennedy2016semiparametric;
@vdl2011targeted]. We review a few key concepts in the following step.

In order to quantify the uncertainty in our estimate of the target parameter,
part of the process of conducting statistical inference, an understanding of the
sampling distribution of our estimator is necessary. This brings us to Step 5:
"_How confident should we be in our statistical answer to the scientific
question?_"

### (5) A measure of uncertainty for the estimate $\psi_n$ {-}

Since the estimator $\hat{\Psi}$ is a function of the empirical distribution
$P_n$, the estimator itself is a random variable with a sampling distribution.
Therefore, if we repeat the experiment of drawing $n$ observations, we would
every time end up with a different realization of our estimate. The hypothetical
distribution of these estimates is the sampling distribution of the estimator.

A primary goal in the construction of estimators is to be able to derive their
asymptotic sampling distribution through a theoretical analysis involving
empirical process theory. In this regard, an important property of the
estimators on which we focus is their asymptotic linearity. In particular,
asymptotic linearity states that the difference between the estimator and the
target parameter (i.e., the truth) can be represented, asymptotically, as an
average of i.i.d. random variables plus an asymptotically negligible remainder
term:

\begin{equation*}
  \hat{\Psi}(P_n) - \Psi(P_0) = \frac{1}{n} \sum_{i=1}^n IC(P_0)(O_i) +
    o_p(n^{-1/2}),
\end{equation*}
where the influence curve (IC) is a function of the observed data $O$ but the
function itself is defined by the underlying data-generating distribution $P_0$.
Based on this asymptotic approximation, the Central Limit Theorem can be used to
show

\begin{equation*}
  \sqrt{n} \left(\hat{\Psi}(P_n) - \Psi(P_0)\right) \sim N(0, \sigma^2_{IC}),
\end{equation*}
where $\sigma^2_{IC}$ is the variance of $IC(P_0)(O)$. Given an estimate of
$\sigma^2_{IC}$, it is then possible to construct classic, _asymptotically
accurate_ Wald-type confidence intervals (CIs) and hypothesis tests. For
example, a standard $(1 - \alpha)$ CI takes the form

\begin{equation*}
  \psi_n \pm Z \frac{\hat{\sigma}_{IC}}{\sqrt{n}} \ ,
\end{equation*}
where $Z$ is the $(1 - \alpha / 2)^\text{th}$ quantile of the standard Normal
distribution. Following convention, we will often be interested in constructing
95% two-tailed CIs, corresponding to probability mass $\alpha/2 = 0.025$ in each
tail of the limit distribution; thus, we will take $Z \approx 1.96$ as the
quantile.

Steps (1)--(5) of the roadmap define the statistical analysis plan, all of which
can be done before any data is revealed. The last step of the roadmap involves
interpreting the results obtained in step (4) and (5) and therefore requires the
data to be analyzed; however, any additional analysis that may take place as
part of step (6) can be pre-specified as well. This final step of the roadmap
addresses the question, "_what is the interpretation and robustness of the
study's findings, and what conclusions can be drawn from them?_"

### (6) Make substantive conclusion {-}

Making the substantive conclusion involves interpreting the study findings. It
also provides an opportunity to ask follow-up questions that might be addressed
later and/or discuss issues that can inform future studies. Statistical
estimands $\psi_0$ can have statistical (noncausal) and causal interpretations.
Both are often of interest and can be provided. The target population should be
clearly mentioned in the interpretation, regardless of whether it's a purely
statistical or causal interpretation, to curtail extrapolation of results.

The major distinction between statistical versus causal interpretations is that
the latter relies on untestable so-called "identifiability" assumptions. In the
following section, we review these assumptions one-by-one. Here, we focus on the
interpretation and robustness of the study findings with respect to them.
Specifically, causal target parameters cannot be estimated from observed data
without additional identifiability assumptions, and so the validity of a
result's causal interpretation hinges on them holding in the data. The more
these assumptions do not hold, the larger the _causal gap_, the difference
between the statistical estimand and the causal estimand. In a perfect
randomized control trial with no loss to follow-up, the causal gap will be zero
as the statistical and causal estimands are equivalent. In @diaz2013sensitivity,
a non-parametric sensitivity analysis for assessing the impact of a hypothesized
causal gaps on estimates and inference is proposed. In @gruber2022targeted and
@gruber2022evaluating, there are example implementations of the methods proposed
in @diaz2013sensitivity; in particular, the difference between adjusted and
unadjusted effect estimates is used to define a range of possible causal gaps
relative to this difference. If the question of interest is causal, then such a
model-free sensitivity analysis (possibly as a complement to other sensitivity
analyses) is recommended to assess the robustness of the study findings.

## Summary of the Roadmap {#roadmap-summary}

Data collected across $n$ i.i.d. units, $O_1, \ldots, O_n$, may be viewed as a
collection of random variables arising from the same underlying probability
distribution $\P_0$. This is expressed by denoting the collection of data as
being generated as $O_1, \ldots, O_n \sim P_0$. Domain knowledge about the
experiment that generated the data (e.g., if the treatment was randomized, if
the treatment decision or loss to follow-up depended on a subset of covariates,
time ordering in which the variables were added to the data) is translated by
the statistician / data scientist to define the statistical model $\M$, a
postulated space of candidate probability distributions that is supposed to
contain $P_0$. In particular, the roadmap emphasizes the critical role of
defining $\M$ such that $P_0$ is guaranteed to be encapsulated by it, $P_0 \in
\M$. By only limiting $\M$ based on domain knowledge about the experiment (i.e.,
reality) --- opposed to constraining it unrealistically (e.g., assuming a
restrictive functional form, like a main terms linear/logistic model, describes
$P_0$) --- it can be ensured that $P_0 \in \M$, and we refer to this as defining
a realistic statistical model. Often, knowledge that can be used to constrain
$\M$ is very limited, and so $\M$ must be very large to define it such that $P_0
\in \M$; hence, realistic statistical models are often termed semi- or
non-parametric, since they are too large to be indexed by a finite-dimensional
set of parameters. Necessarily, our statistical query must begin with, "What are
we trying to learn from the data?", a question whose answer is captured by the
statistical target parameter, $\Psi$, a function defined by the true
data-generating distribution $P_0$, that maps $\M$ into the statistical
estimand, $\psi_0$. At this stage, the statistical estimation problem is
formally defined, allowing for the use of statistical theory to guide the
construction of estimators, which are algorithms that approximate the answer the
question of interest by learning from the data. Desirable properties of an
estimator are that it is unbiased, efficient, plug-in, and robust in finite
samples. If the question of interest is causal, then a model-free sensitivity
analysis is recommended to assess the robustness of the study's findings under
various hypothesized causal gaps.

## Causal Target Parameters {#causal}

In many cases, we are interested in problems that ask questions regarding the
_causal effect_ of an intervention, whether an assigned treatment (e.g., a
prescribed drug) or a "naturally occurring" exposure (e.g., pollution from a
nearby factory), on a future outcome of interest. These causal effects may be
defined as summaries of the population of interest (e.g., population mean of a
particular outcome) under contrasting interventions (e.g., comparing the treated
to the untreated condition). For example, a causal effect could be defined as
the mean difference of a disease outcome between two _causal contrasts_,
counterfactual cases in which the study population were set to uniformly
experience low pollution levels for some pollutant, and in which the same
population were set to uniformly experience high levels of the same pollutant.

There are different ways of operationalizing the theoretical experiments that
generate the counterfactual data necessary for describing such causal contrasts
of interest. We could simply assume that the counterfactual outcomes exist in
theory for all treatment contrasts of interest [@neyman1938contribution;
@rubin2005causal; @imbens2015causal], which may be encoded in so-called "science
tables". Alternatively, we could consider interventions on structural causal
models (SCMs) [@pearl1995causal; @pearl2009causality], which may be represented
by directed acyclic graphs (DAGs). Both frameworks allow for the known or
hypothesized set of relationships between variables in the system under study to
be encoded and mathematically formalized.

### The Causal Model {-}

Throughout, we will focus on the use of DAGs and SCMs for the description of
causal parameters.  Estimators of statistical parameters that correspond, under
standard but untestable _identifiability_ assumptions, to these causal
parameters are introduced below. DAGs are a particularly useful tool for
visually expressing what we know about the causal relations among variables in
the system under study.  Ignoring exogenous $U$ terms (explained below), we
assume the following ordering of the variables that compose the observed data
$O$. We demonstrate the construction of a DAG below using `DAGitty`
[@textor2011dagitty]:

```{r simple-DAG, out.width = "60%"}
library(dagitty)
library(ggdag)

# make DAG by specifying dependence structure
dag <- dagitty("dag{
    W -> A ;
    W -> Y ;
    A -> Y ;
    W -> A -> Y ;
    W [confounders]
    A [exposure]
    Y [outcome]
  }")
tidy_dag <- tidy_dagitty(dag)

# visualize DAG
ggdag(tidy_dag) +
  theme_dag()
```

While DAGs like the above provide a convenient means by which to express the
causal relations between variables, these same causal relations can be
equivalently represented by an SCM:
$$\begin{align*}
  W &= f_W(U_W) \\
  A &= f_A(W, U_A) \\
  Y &= f_Y(W, A, U_Y),
\end{align*}$$
where the $f$'s are unspecified deterministic functions that generate the
corresponding random variables as a function of the variable's "parents" (i.e.,
upstream nodes with arrows into the given random variable) in the DAG, and the
unobserved, exogenous error terms (i.e., the $U$'s). An SCM may be thought of as
a representation of the algorithm that produces the data, $O$, in the population
of interest.  Much of statistics and data science is devoted to discovering
properties of this system of equations (e.g., estimation of the functional form
$f_Y$ governing the outcome variable $Y$).

<!--
where $U_W$, $U_A$, and $U_Y$ represent the unmeasured exogenous background
characteristics that influence the value of each variable. In the NPSEM, $f_W$,
$f_A$ and $f_Y$ denote that each variable (for $W$, $A$ and $Y$, respectively)
is a function of its parents and unmeasured background characteristics, but
note that there is no imposition of any particular functional constraints(e.g.,
linear, logit-linear, only one interaction, etc.). For this reason, they are
called non-parametric structural equation models (NPSEMs). The DAG and set of
nonparametric structural equations represent exactly the same information and
so may be used interchangeably.
-->

The first hypothetical experiment we will consider is assigning exposure to the
entire population and observing the outcome, and then withholding exposure to
the same population and observing the outcome. This corresponds to a comparison
of the outcome distribution in the population under two distinct interventions:

1. $A$ is set to $1$ for all individuals, and
2. $A$ is set to $0$ for all individuals.

These interventions may be thought of as operations that imply changes
to the structural equations in the system under study. For the case $A = 1$, we
have
$$\begin{align*}
  W &= f_W(U_W) \\
  A &= 1 \\
  Y(1) &= f_Y(W, 1, U_Y) \ ,
\end{align*}$$
while, for the case $A=0$,
$$\begin{align*}
  W &= f_W(U_W) \\
  A &= 0 \\
  Y(0) &= f_Y(W, 0, U_Y) \ .
\end{align*}$$

In these equations, $A$ is no longer a function of $W$ because the intervention
on the system set $A$ deterministically
to one of the values $1$ or $0$ consistent with the intervention performed.  The
new symbols $Y(1)$ and $Y(0)$ indicate the values the outcome variable would
take in the population of interest when it is generated by removing the
contribution of $A$ to $f_Y$ and instead setting $A$ to the values $1$ and $0$,
respectively. The variables $Y(1)$ and $Y(0)$ are often called counterfactuals
(since they arise from interventions that run contrary to fact) and are, in
other frameworks, called the _potential outcomes_ of $Y$
[@neyman1938contribution; rubin2005causal; imbens2015causal].  The difference in
the counterfactual means of the outcome under these two interventions defines a
well known causal parameter that is most often called the "average treatment
effect" (ATE) and is denoted

\begin{equation}
  ATE = \E_X[Y(1) - Y(0)],
  (\#eq:ate)
\end{equation}
where $\E_X(\cdot)$ is the expectation taken over the theoretical (unobservable)
full data (i.e., $X = (W, Y(1), Y(0))$) distribution $P_X$. Note that the full
data structure $X$ is, by its very definition, unobservable since one can never
observe both of $Y(1)$ and $Y(0)$ for the same observational unit.

We can define much more complicated interventions on SCMs, such as interventions
based upon dynamic rules (which assign particular interventions based on a
function of the covariates $W$), stochastic rules (which can even account for
the natural value of $A$ observed in the absence of the intervention), and much
more. Each results in a different target causal parameter and entails different
identifiability assumptions discussed below.

### Identifiability {-}

Since we can never simultaneously observe $Y(0)$, the counterfactual outcome
when $A=0$, and $Y(1)$, the counterfactual outcome when $A=1$, we cannot
estimate their difference $Y(1) - Y(0)$ (the individual treatment effect), which
appears in Equation \@ref(eq:ate) (inside the expectation $\E_X(\cdot)$ that
defines ATE). This is called the _Fundamental Problem of Causal Inference_
[@holland1986statistics]. Thus, one of the primary activities in causal
inference is to _identify_ the assumptions necessary to express causal
quantities of interest as functions of the data-generating distribution of the
observed data. To do this, we must make assumptions under which such quantities
may be estimated from the observed data $O \sim P_0$ and its corresponding
data-generating distribution $P_0$. Fortunately, given the causal model
specified in the SCM above, we can, with a handful of untestable assumptions,
estimate the ATE from observational data. These assumptions may be summarized as
follows.

::: {#consist-ass .definition name="Consistency"}
The outcome for unit $i$ is $Y_i(a)$ whenever $A_i = a$, which may be thought of
as "no other versions of treatment" or "no side effects of treatment."
:::

::: {#interf-ass .definition name="No Interference"}
The outcome for unit $i$, $Y_i$, cannot be affected by the exposure of unit $j$,
$A_j$, for all $i \neq j$.
:::

::: {#noconf-ass .definition name="No Unmeasured Confounding"}
$A \perp Y(a) \mid W$ for all $a \in \mathcal{A}$, which states that the
potential outcomes $(Y(a) : a \in \mathcal{A})$ arise independently from
exposure status $A$, conditional on the observed covariates $W$. This is the
analog of the _randomization_ assumption in data arising from natural
experiments, ensuring that the effect of $A$ on $Y$ can be disentangled from
that of $W$ on $Y$, even though $W$ affects both.
:::

::: {#posit-ass .definition name="Positivity/Overlap"}
All observed units, across strata defined by $W$, must have a bounded
probability of receiving treatment -- that is, $\epsilon < \P(A = a \mid W) < 1
- \epsilon$ for all $a$ and $W$ and for some $\epsilon > 0$) \ .
:::

Technically speaking, only the latter two of these assumptions are necessary
when working within the SCM framework, as the first two are implied properties
of an SCM for i.i.d. data (if you're really curious, see this commentary of
@pearl2010brief for an extended discussion). We introduce all four
identification assumptions because they are most often considered together, and
all four are necessary when working within the potential outcomes framework
[@rubin2005causal; @imbens2015causal].

Under these assumptions, the ATE may be re-written as a function of $P_0$, the
distribution of the observed data:

$$\begin{align}
  \psi_{\text{ATE}} &= \E_0[Y(1) - Y(0)] \\ \nonumber
                    &= \E_0 [\E_0[Y \mid A = 1, W] -
                       \E_0[Y \mid A = 0, W]] \ .
  (\#eq:estimand)
\end{align}$$
In words, the ATE is the mean difference in the predicted outcome values for
each subject, under the contrast of treatment conditions ($A = 0$ versus $A =
1$), in the population (when averaged over all observations). Thus, a parameter
of a theoretical complete (or "full") data distribution can be represented as an
estimand of the observed data distribution. Significantly, there is nothing
about the representation in Equation \@ref(eq:estimand) that requires
parameteric assumptions; thus, the regression functions on the right hand side
may be estimated without restrictive assumptions about their underlying
functional forms.  With different parameters, there will be potentially
different identifiability assumptions and the resulting estimands can be
functions of different components of $P_0$. We discuss several more complex
estimands in subsequent chapters.

<!--
### Missingness {-}

TODO: Ask Mark to add note about this.

-Ben (09/20/2022)-
As a specific methodological issue, why impute using median/mode, for example
compared to a multiple imputation approach? I suggest that the statistical
reviewer is consulted on the implications of this. Regardless of the
appropriateness of the method, for transparency there should be a table that
shows extent of missingness by cohort and variable. Also for transparency, a
complete case analysis should be presented as a sensitivity analysis.

-Mark (09/20/2022)-
Yes, that is the clear argument: i.e. if we have A is independent of Y_a given
W*, then there is no missigness issue, where W* is W_j,Deltaj, across all
components j, Delta_j indicator of W_j being measured. So then estimand is E
E(Y|A=1,W*)-E(Y|A=0,W*)) and it equals the ATE. If we can make the case that
this assumption is violated but for some particular component W_1, we would
have the randomizaiton given W*,W_1, but not given W*, then one can view W_1 as
subject to missing. In that case, we want to learn E_{W*,W_1}
E(Y|A=1,W*,W_1)-E(Y|A=1,W*,W_1), which is not yet a function of distribution of
data. So we need to assume that missingness of W_1 is MAR e..g misisngness
indicator Delta_{W_1} is a functio of observed Y and A,W* (e.g two stage
designs), or only a function of W*,A, beyond independent noise. We can then
obtain the desired identification result for this quantity and develop the
resulting TMLE. Or one could use the TMLE in world where W_1 is measured, with
inverse prob of missing weights (IPCW TMLE, e.g Rose, vdL for two stage
designs).

Regarding reference, just refer to ATE identification result and make clear
that it is not unreasonable to make this RA assumption given W*. Or that even
when there is a concern, there is no particular reason to think that given the
full W it would be resolved. If we feel that RA assumption is violated, and no
satisfactory few W_1 full observed resolve the problem, then one could address
it in the sensitivity analysis.

I would think various articles, e.g. Neugebauer, Schnitzer, Petersen, Hubbard,
Mertens, etc involving real data analysis, have done this by now and can thus
be referenced as well.

-Ben (09/20/2022)-
Regarding Mark's thorough explanation, I think I am comfortable with us simply
making the assumption:
if we have A is independent of Y_a given W*, then there is no missigness issue,
where W* is W_j,Deltaj, across all components j, Delta_j indicator of W_j being
measured. So then estimand is E E(Y|A=1,W*)-E(Y|A=0,W*)) and it equals the ATE.
-->