epinowcast · parksw3 · Jan 28, 2025
diff --git a/vignettes/model.Rmd b/vignettes/model.Rmd
@@ -0,0 +1,189 @@
+---
+title: "Guide to the statistical models implemented in epidist"
+output:
+  bookdown::html_document2:
+    fig_caption: yes
+    code_folding: show
+    number_sections: true
+pkgdown:
+  as_is: true
+# csl: https://raw.githubusercontent.com/citation-style-language/styles/master/apa-numeric-superscript-brackets.csl
+link-citations: true
+vignette: >
+  %\VignetteIndexEntry{Model guide}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+bibliography: references.bib
+---
+
+```{r setup, include=FALSE}
+# exclude compile warnings from cmdstanr
+knitr::opts_chunk$set(
+  fig.path = file.path("figures", "epidist-"),
+  cache = TRUE,
+  collapse = TRUE,
+  comment = "#>",
+  message = FALSE,
+  warning = FALSE,
+  error = FALSE
+)
+```
+
+The `epidist` package enables users to estimate delay distributions while accounting for common reporting biases.
+This vignette first introduces the background [@park2024estimating] required to understand the models implemented in `epidist`.
+It then goes on to explain each model in turn.
+
+# Background {#maths}
+
+Estimating a delay distribution may appear to be simple: one could imagine fitting a probability distribution to a set of observed delays.
+However, observed delays are often biased during an ongoing outbereak. 
+We begin by presenting a formalism for characterizing delay distributions as well as two main biases (truncation and censoring) affecting them.
+We then present statistical models for correcting for these biases.
+
+Any epidemiological delay requires a primary (starting) and a secondary (ending) event.
+For example, the incubation period measures the time between infection (primary event) and symptom onset (secondary event).
+Here, we use $p$ to denote the time of primary event, and $s$ to denote time of secondary event.
+
+We can measure any delay distribution in two different ways.
+First, we can measure the forward distribution $f_p(\tau)$, starting from a cohort of individuals who experienced the primary at the same time $p$ and looking at when they experienced their secondary event.
+Second, we can measure the backward distribution $b_s(\tau)$, starting from a cohort of individuals who experienced the secondary at the same time $s$ and looking at when they experienced their secondary event.
+While the length of each delay $\tau = p-s$ remains constant whether we look at it forward or backward, the shape of the distribution is affected by the differences in perspectives.
+
+To illustrate their differences, let's assume that primaryand secondary events occur at at rates, or equivalently incidences, $\mathcal{P}(p)$ and $\mathcal{S}(s)$, respectively.
+Then, the total density $\mathcal{T}(p, s)$ of individuals with primary event at time $p$ and secondary event at time $s$ can be expressed equivalently in terms of both forward and backward distributions:
+$$
+\mathcal{T}(p, s) = \mathcal{P}(p) f_p(s - p)  = \mathcal{S}(s) b_s(s - p).  (\#eq:forwards-backwards)
+$$
+Rearranging Equation \@ref(eq:forwards-backwards) gives
+$$
+b_s(\tau) = \frac{\mathcal{P}(s - \tau) f_{s - \tau}(\tau)}{\mathcal{S}(s)}. (\#eq:forwards-backwards2)
+$$
+The denominator of Equation \@ref(eq:forwards-backwards2), which corresponds to the incidence of secondary events, may be expressed as the integral over all possible delays
+$$
+\mathcal{S}(s) = \int_{-\infty}^\infty \mathcal{P}(s - \tau) f_{s - \tau}(\tau) \text{d} \tau,
+$$
+such that
+$$
+b_s(\tau) = \frac{\mathcal{P}(s - \tau) f_{s - \tau}(\tau)}{\int_{-\infty}^\infty \mathcal{P}(s - \tau) f_{s - \tau}(\tau) \text{d} \tau}.
+$$
+Here, we see that $b_s(\tau)$ depends not only on $f_{s - \tau}(\tau)$ but also on $\mathcal{P}(s - \tau)$, meaning that past changes in the incidence pattern will affect the shape of the distribution.
+Particularly, when an epidemic is growing, we are more likely to observe shorter delays, causing underestimation of the mean delay.
+Therefore, we always want to characterize epidemiological delays from the forward perspective and estimate the forward distribution.
+Hereafter, we focus on biases that affect the estimation of the forward distribution.
+
+## Right truncation
+
+Right truncation refers to the bias arising from inability to observe future events.
+For example, assume the data are right truncated and we don't observe secondary events past time $T$.
+Then, we will only observe delays whose secondary events occurred before time $T$, causing us to underestimate the mean of the distribution.
+
+Let $P$ and $S$ be random variables.
+Let $F_p$ be the forward cumulative distribution.
+Then, the probability of observing a delay of length $\tau$ given that the primary event occurred at time $p$ and a truncation at time $T$ can be written as:
+$$
+\begin{aligned}
+\mathbb{P}(S = P + \tau \, | \, P = p, S < T) &= \frac{\mathbb{P}(S = P + \tau, P = p, S < T)}{\mathbb{P}(P = p, S < T)} \\
+&= \frac{\mathbb{P}(S = P + \tau < T \, | \, P = p)\mathbb{P}(P = p)}{\mathbb{P}(S < T \, | \, P = p)\mathbb{P}(P = p)} \\
+&= \frac{\mathbb{P}(S = P + \tau < T \, | \, P = p)}{\mathbb{P}(S < T \, | \, P = p)} \\
+&= \frac{f_p(\tau)}{\int_0^{T - p} f_p(x) \text{d}x} = \frac{f_p(\tau)}{F_p(T - p)}, \quad p + \tau < T. (\#eq:right-truncation)
+\end{aligned}
+$$
+
+Examining Equation \@ref(eq:right-truncation) illustrates that (right) truncation renormalises the density by the values which are possible.
+For example, if the distribution $x \sim \text{Unif}(0, 1)$ were right truncated by $T = 0.5$ then $x \sim \text{Unif}(0, 0.5)$.
+
+## Interval censoring
+
+The exact timing of epidemiological events is often unknown.
+Instead, we may only know that the event happened within certain interval.
+We refer to this as interval censoring.
+
+Assume the secondary event $S$ is censored and so we don't know when the event exactly happened.
+Instead, we only know that the secondary event happened between $S_L$ and $S_R$.
+Then,
+$$
+\begin{aligned}
+\mathbb{P}(S_L < S < S_R \, | \, P = p) &= \int_{S_L}^{S_R} f_p(y-p) dy\\
+&= F_p(S_R-p) - F_p(S_L-p)
+\end{aligned}
+$$
+
+Now, assume that both primary $P$ and secondary $S$ events are truncated.
+We only know that the primary event happened between $P_L$ and $P_R$ and the secondary event happened between $S_L$ and $S_R$.
+We now write $g_P$ to denote the unconditional distribution of primary events.
+Then,
+$$
+\begin{aligned}
+\mathbb{P}(S_L < S < S_R \, | \, P_L < P < P_R) &= \mathbb{P}(P_L < P < P_R, S_L < S < S_R \, | \, P_L < P < P_R)\\
+&= \frac{\mathbb{P}(P_L < P < P_R, S_L < S < S_R)}{\mathbb{P}(P_L < P < P_R)}\\
+&= \frac{\int_{P_L}^{P_R} \int_{S_L}^{S_R} g_P(x) f_x(y-x) \,dy\, dx}{\int_{P_L}^{P_R} g_P(z)\, dz  }\\
+&= \int_{P_L}^{P_R} \int_{S_L}^{S_R} g_P(x\,|\,P_L,P_R) f_x(y-x) \,dy\, dx
+\end{aligned}
+$$
+where $ g_P(x\,|\,P_L,P_R)$ represents the conditional distribution of primary event given lower $P_L$ and upper $P_R$ bounds.
+
+# The naive model
+
+The simplest approach to modeling epidemiological delay distributions is to ignore both truncation and censoring biases.
+Then, the likelihood of observing a delay $\mathbf{Y}_i$ given parameter $\boldsymbol{\theta}$ is straightforward:
+$$
+\mathcal{L}(\mathbf{Y}_i \, | \, \boldsymbol{\theta}) = f(y_i - x_i).
+$$
+where $y_i$ and $x_i$ are the observed primary and secondary event times.
+
+# The latent model
+
+Now, we incorporate both truncation and double censoring:
+$$
+\begin{aligned}
+\mathbb{P}(S_L < S < S_R \, | \, P_L < P < P_R, S<T) &= \mathbb{P}(P_L < P < P_R, S_L < S < S_R, S<T \, | \, P_L < P < P_R, S<T)\\
+&= \frac{\mathbb{P}(P_L < P < P_R, S_L < S < S_R, S<T)}{\mathbb{P}(P_L < P < P_R, S<T)}\\
+&= \frac{\int_{P_L}^{P_R} \int_{S_L}^{S_R} g_P(x) f_x(y-x) \,dy\, dx}{\int_{P_L}^{P_R} \int_{z}^T g_P(z) f_z(w-z) \, dz \,dw  }\\
+&= \frac{\int_{P_L}^{P_R} \int_{S_L}^{S_R} g_P(x) f_x(y-x) \,dy\, dx}{\int_{P_L}^{P_R} g_P(z) F_z(T-z) \,dw  }\\
+&= \frac{\int_{P_L}^{P_R} \int_{S_L}^{S_R} g_P(x|P_L, P_R) f_x(y-x) \,dy\, dx}{\int_{P_L}^{P_R} g_P(z|P_L, P_R) F_z(T-z) \,dw  }\\
+\end{aligned}
+$$
+Using latent, variables, we can now rewrite down the observation likelihood as:
+$$
+\begin{aligned}
+x_i &\sim g_P(x_i \, | \, p_{L, i}, p_{R, i}) \\
+y_i &\sim \text{Unif}(s_{L, i}, s_{R, i}) \\
+\mathcal{L}(\mathbf{Y} \, | \, \boldsymbol{\theta}) &= \prod_i \left[ \frac{f_{x_i}(y_i - x_i)}{\int_{P_{L, i}}^{P_{R, i}} g_P(z \, | \, p_{L, i}, p_{R, i}) F_z(T - z) \text{d}z} \right].
+\end{aligned}
+$$
+Modelling $g_P$ $\iff$ modelling incidence in primary events.
+This is a tough thing to do.
+
+<!--Let $\mathbf{Y}$ be the data vector and $\boldsymbol{\theta}$ be parameters.
+Then
+$$
+\mathcal{L}(\mathbf{Y} \, | \, \boldsymbol{\theta}) = \prod_i \mathbb{P}(S_{L, i} < S < S_{R, i} \, | \, P_{L, i} < P < P_{R, i}, S < T).
+$$
+>
+
+<!-- Shouldn't this be a joint likelihood? So it includes the prior? I think I took this from the paper, and the paper might have an issue -->
+
+<!-- This is just writing the model likelihood so I think it's ok to not have prior? -->
+
+Instead, the approach of the latent model [@ward2022transmission] is to:
+
+1. Assume a uniformly distributed primary event times
+$$
+\begin{aligned}
+x_i &\sim \text{Unif}(p_{L, i}, p_{R, i}) \\
+y_i &\sim \text{Unif}(s_{L, i}, s_{R, i}).
+\end{aligned}
+$$
+
+2. Assume that censoring interval is narrow such that
+$$
+\int_{P_{L, i}}^{P_{R, i}} g_P(z \, | \, p_{L, i}, p_{R, i}) F_{x_i}(T - z) \text{d}z \approx F_{x_i}(T - x_i).
+$$
+
+Then
+$$
+\mathcal{L}(\mathbf{Y}_i \, | \, \boldsymbol{\theta}) = \frac{f_x(y_i - x_i)}{F_x(T - x_i)}.
+$$
+
+
+## Bibliography {-}