Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #64: Add vignette explaining models #514

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
189 changes: 189 additions & 0 deletions vignettes/model.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
---
title: "Guide to the statistical models implemented in epidist"
output:
bookdown::html_document2:
fig_caption: yes
code_folding: show
number_sections: true
pkgdown:
as_is: true
# csl: https://raw.githubusercontent.com/citation-style-language/styles/master/apa-numeric-superscript-brackets.csl
link-citations: true
vignette: >
%\VignetteIndexEntry{Model guide}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
bibliography: references.bib
---

```{r setup, include=FALSE}
# exclude compile warnings from cmdstanr
knitr::opts_chunk$set(
fig.path = file.path("figures", "epidist-"),
cache = TRUE,
collapse = TRUE,
comment = "#>",
message = FALSE,
warning = FALSE,
error = FALSE
)
```

The `epidist` package enables users to estimate delay distributions while accounting for common reporting biases.
This vignette first introduces the background [@park2024estimating] required to understand the models implemented in `epidist`.
It then goes on to explain each model in turn.

# Background {#maths}

Estimating a delay distribution may appear to be simple: one could imagine fitting a probability distribution to a set of observed delays.
However, observed delays are often biased during an ongoing outbereak.
We begin by presenting a formalism for characterizing delay distributions as well as two main biases (truncation and censoring) affecting them.
We then present statistical models for correcting for these biases.

Any epidemiological delay requires a primary (starting) and a secondary (ending) event.
For example, the incubation period measures the time between infection (primary event) and symptom onset (secondary event).
Here, we use $p$ to denote the time of primary event, and $s$ to denote time of secondary event.

We can measure any delay distribution in two different ways.
First, we can measure the forward distribution $f_p(\tau)$, starting from a cohort of individuals who experienced the primary at the same time $p$ and looking at when they experienced their secondary event.
Second, we can measure the backward distribution $b_s(\tau)$, starting from a cohort of individuals who experienced the secondary at the same time $s$ and looking at when they experienced their secondary event.
While the length of each delay $\tau = p-s$ remains constant whether we look at it forward or backward, the shape of the distribution is affected by the differences in perspectives.

To illustrate their differences, let's assume that primaryand secondary events occur at at rates, or equivalently incidences, $\mathcal{P}(p)$ and $\mathcal{S}(s)$, respectively.
Then, the total density $\mathcal{T}(p, s)$ of individuals with primary event at time $p$ and secondary event at time $s$ can be expressed equivalently in terms of both forward and backward distributions:
$$
\mathcal{T}(p, s) = \mathcal{P}(p) f_p(s - p) = \mathcal{S}(s) b_s(s - p). (\#eq:forwards-backwards)
$$
Rearranging Equation \@ref(eq:forwards-backwards) gives
$$
b_s(\tau) = \frac{\mathcal{P}(s - \tau) f_{s - \tau}(\tau)}{\mathcal{S}(s)}. (\#eq:forwards-backwards2)
$$
The denominator of Equation \@ref(eq:forwards-backwards2), which corresponds to the incidence of secondary events, may be expressed as the integral over all possible delays
$$
\mathcal{S}(s) = \int_{-\infty}^\infty \mathcal{P}(s - \tau) f_{s - \tau}(\tau) \text{d} \tau,
$$
such that
$$
b_s(\tau) = \frac{\mathcal{P}(s - \tau) f_{s - \tau}(\tau)}{\int_{-\infty}^\infty \mathcal{P}(s - \tau) f_{s - \tau}(\tau) \text{d} \tau}.
$$
Here, we see that $b_s(\tau)$ depends not only on $f_{s - \tau}(\tau)$ but also on $\mathcal{P}(s - \tau)$, meaning that past changes in the incidence pattern will affect the shape of the distribution.
Particularly, when an epidemic is growing, we are more likely to observe shorter delays, causing underestimation of the mean delay.
Therefore, we always want to characterize epidemiological delays from the forward perspective and estimate the forward distribution.
Hereafter, we focus on biases that affect the estimation of the forward distribution.

## Right truncation

Right truncation refers to the bias arising from inability to observe future events.
For example, assume the data are right truncated and we don't observe secondary events past time $T$.
Then, we will only observe delays whose secondary events occurred before time $T$, causing us to underestimate the mean of the distribution.

Let $P$ and $S$ be random variables.
Let $F_p$ be the forward cumulative distribution.
Then, the probability of observing a delay of length $\tau$ given that the primary event occurred at time $p$ and a truncation at time $T$ can be written as:
$$
\begin{aligned}
\mathbb{P}(S = P + \tau \, | \, P = p, S < T) &= \frac{\mathbb{P}(S = P + \tau, P = p, S < T)}{\mathbb{P}(P = p, S < T)} \\
&= \frac{\mathbb{P}(S = P + \tau < T \, | \, P = p)\mathbb{P}(P = p)}{\mathbb{P}(S < T \, | \, P = p)\mathbb{P}(P = p)} \\
&= \frac{\mathbb{P}(S = P + \tau < T \, | \, P = p)}{\mathbb{P}(S < T \, | \, P = p)} \\
&= \frac{f_p(\tau)}{\int_0^{T - p} f_p(x) \text{d}x} = \frac{f_p(\tau)}{F_p(T - p)}, \quad p + \tau < T. (\#eq:right-truncation)
\end{aligned}
$$

Examining Equation \@ref(eq:right-truncation) illustrates that (right) truncation renormalises the density by the values which are possible.
For example, if the distribution $x \sim \text{Unif}(0, 1)$ were right truncated by $T = 0.5$ then $x \sim \text{Unif}(0, 0.5)$.

## Interval censoring

The exact timing of epidemiological events is often unknown.
Instead, we may only know that the event happened within certain interval.
We refer to this as interval censoring.

Assume the secondary event $S$ is censored and so we don't know when the event exactly happened.
Instead, we only know that the secondary event happened between $S_L$ and $S_R$.
Then,
$$
\begin{aligned}
\mathbb{P}(S_L < S < S_R \, | \, P = p) &= \int_{S_L}^{S_R} f_p(y-p) dy\\
&= F_p(S_R-p) - F_p(S_L-p)
\end{aligned}
$$

Now, assume that both primary $P$ and secondary $S$ events are truncated.
We only know that the primary event happened between $P_L$ and $P_R$ and the secondary event happened between $S_L$ and $S_R$.
We now write $g_P$ to denote the unconditional distribution of primary events.
Then,
$$
\begin{aligned}
\mathbb{P}(S_L < S < S_R \, | \, P_L < P < P_R) &= \mathbb{P}(P_L < P < P_R, S_L < S < S_R \, | \, P_L < P < P_R)\\
&= \frac{\mathbb{P}(P_L < P < P_R, S_L < S < S_R)}{\mathbb{P}(P_L < P < P_R)}\\
&= \frac{\int_{P_L}^{P_R} \int_{S_L}^{S_R} g_P(x) f_x(y-x) \,dy\, dx}{\int_{P_L}^{P_R} g_P(z)\, dz }\\
&= \int_{P_L}^{P_R} \int_{S_L}^{S_R} g_P(x\,|\,P_L,P_R) f_x(y-x) \,dy\, dx
\end{aligned}
$$
where $ g_P(x\,|\,P_L,P_R)$ represents the conditional distribution of primary event given lower $P_L$ and upper $P_R$ bounds.

# The naive model

The simplest approach to modeling epidemiological delay distributions is to ignore both truncation and censoring biases.
Then, the likelihood of observing a delay $\mathbf{Y}_i$ given parameter $\boldsymbol{\theta}$ is straightforward:
$$
\mathcal{L}(\mathbf{Y}_i \, | \, \boldsymbol{\theta}) = f(y_i - x_i).
$$
where $y_i$ and $x_i$ are the observed primary and secondary event times.

# The latent model

Now, we incorporate both truncation and double censoring:
$$
\begin{aligned}
\mathbb{P}(S_L < S < S_R \, | \, P_L < P < P_R, S<T) &= \mathbb{P}(P_L < P < P_R, S_L < S < S_R, S<T \, | \, P_L < P < P_R, S<T)\\
&= \frac{\mathbb{P}(P_L < P < P_R, S_L < S < S_R, S<T)}{\mathbb{P}(P_L < P < P_R, S<T)}\\
&= \frac{\int_{P_L}^{P_R} \int_{S_L}^{S_R} g_P(x) f_x(y-x) \,dy\, dx}{\int_{P_L}^{P_R} \int_{z}^T g_P(z) f_z(w-z) \, dz \,dw }\\
&= \frac{\int_{P_L}^{P_R} \int_{S_L}^{S_R} g_P(x) f_x(y-x) \,dy\, dx}{\int_{P_L}^{P_R} g_P(z) F_z(T-z) \,dw }\\
&= \frac{\int_{P_L}^{P_R} \int_{S_L}^{S_R} g_P(x|P_L, P_R) f_x(y-x) \,dy\, dx}{\int_{P_L}^{P_R} g_P(z|P_L, P_R) F_z(T-z) \,dw }\\
\end{aligned}
$$
Using latent, variables, we can now rewrite down the observation likelihood as:
$$
\begin{aligned}
x_i &\sim g_P(x_i \, | \, p_{L, i}, p_{R, i}) \\
y_i &\sim \text{Unif}(s_{L, i}, s_{R, i}) \\
\mathcal{L}(\mathbf{Y} \, | \, \boldsymbol{\theta}) &= \prod_i \left[ \frac{f_{x_i}(y_i - x_i)}{\int_{P_{L, i}}^{P_{R, i}} g_P(z \, | \, p_{L, i}, p_{R, i}) F_z(T - z) \text{d}z} \right].
\end{aligned}
$$
Modelling $g_P$ $\iff$ modelling incidence in primary events.
This is a tough thing to do.

<!--Let $\mathbf{Y}$ be the data vector and $\boldsymbol{\theta}$ be parameters.
Then
$$
\mathcal{L}(\mathbf{Y} \, | \, \boldsymbol{\theta}) = \prod_i \mathbb{P}(S_{L, i} < S < S_{R, i} \, | \, P_{L, i} < P < P_{R, i}, S < T).
$$
>

<!-- Shouldn't this be a joint likelihood? So it includes the prior? I think I took this from the paper, and the paper might have an issue -->

<!-- This is just writing the model likelihood so I think it's ok to not have prior? -->

Instead, the approach of the latent model [@ward2022transmission] is to:

1. Assume a uniformly distributed primary event times
$$
\begin{aligned}
x_i &\sim \text{Unif}(p_{L, i}, p_{R, i}) \\
y_i &\sim \text{Unif}(s_{L, i}, s_{R, i}).
\end{aligned}
$$

2. Assume that censoring interval is narrow such that
$$
\int_{P_{L, i}}^{P_{R, i}} g_P(z \, | \, p_{L, i}, p_{R, i}) F_{x_i}(T - z) \text{d}z \approx F_{x_i}(T - x_i).
$$

Then
$$
\mathcal{L}(\mathbf{Y}_i \, | \, \boldsymbol{\theta}) = \frac{f_x(y_i - x_i)}{F_x(T - x_i)}.
$$


## Bibliography {-}
Loading