diff --git a/vignettes/model.Rmd b/vignettes/model.Rmd new file mode 100644 index 000000000..d0054c5af --- /dev/null +++ b/vignettes/model.Rmd @@ -0,0 +1,189 @@ +--- +title: "Guide to the statistical models implemented in epidist" +output: + bookdown::html_document2: + fig_caption: yes + code_folding: show + number_sections: true +pkgdown: + as_is: true +# csl: https://raw.githubusercontent.com/citation-style-language/styles/master/apa-numeric-superscript-brackets.csl +link-citations: true +vignette: > + %\VignetteIndexEntry{Model guide} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +bibliography: references.bib +--- + +```{r setup, include=FALSE} +# exclude compile warnings from cmdstanr +knitr::opts_chunk$set( + fig.path = file.path("figures", "epidist-"), + cache = TRUE, + collapse = TRUE, + comment = "#>", + message = FALSE, + warning = FALSE, + error = FALSE +) +``` + +The `epidist` package enables users to estimate delay distributions while accounting for common reporting biases. +This vignette first introduces the background [@park2024estimating] required to understand the models implemented in `epidist`. +It then goes on to explain each model in turn. + +# Background {#maths} + +Estimating a delay distribution may appear to be simple: one could imagine fitting a probability distribution to a set of observed delays. +However, observed delays are often biased during an ongoing outbereak. +We begin by presenting a formalism for characterizing delay distributions as well as two main biases (truncation and censoring) affecting them. +We then present statistical models for correcting for these biases. + +Any epidemiological delay requires a primary (starting) and a secondary (ending) event. +For example, the incubation period measures the time between infection (primary event) and symptom onset (secondary event). +Here, we use $p$ to denote the time of primary event, and $s$ to denote time of secondary event. + +We can measure any delay distribution in two different ways. +First, we can measure the forward distribution $f_p(\tau)$, starting from a cohort of individuals who experienced the primary at the same time $p$ and looking at when they experienced their secondary event. +Second, we can measure the backward distribution $b_s(\tau)$, starting from a cohort of individuals who experienced the secondary at the same time $s$ and looking at when they experienced their secondary event. +While the length of each delay $\tau = p-s$ remains constant whether we look at it forward or backward, the shape of the distribution is affected by the differences in perspectives. + +To illustrate their differences, let's assume that primaryand secondary events occur at at rates, or equivalently incidences, $\mathcal{P}(p)$ and $\mathcal{S}(s)$, respectively. +Then, the total density $\mathcal{T}(p, s)$ of individuals with primary event at time $p$ and secondary event at time $s$ can be expressed equivalently in terms of both forward and backward distributions: +$$ +\mathcal{T}(p, s) = \mathcal{P}(p) f_p(s - p) = \mathcal{S}(s) b_s(s - p). (\#eq:forwards-backwards) +$$ +Rearranging Equation \@ref(eq:forwards-backwards) gives +$$ +b_s(\tau) = \frac{\mathcal{P}(s - \tau) f_{s - \tau}(\tau)}{\mathcal{S}(s)}. (\#eq:forwards-backwards2) +$$ +The denominator of Equation \@ref(eq:forwards-backwards2), which corresponds to the incidence of secondary events, may be expressed as the integral over all possible delays +$$ +\mathcal{S}(s) = \int_{-\infty}^\infty \mathcal{P}(s - \tau) f_{s - \tau}(\tau) \text{d} \tau, +$$ +such that +$$ +b_s(\tau) = \frac{\mathcal{P}(s - \tau) f_{s - \tau}(\tau)}{\int_{-\infty}^\infty \mathcal{P}(s - \tau) f_{s - \tau}(\tau) \text{d} \tau}. +$$ +Here, we see that $b_s(\tau)$ depends not only on $f_{s - \tau}(\tau)$ but also on $\mathcal{P}(s - \tau)$, meaning that past changes in the incidence pattern will affect the shape of the distribution. +Particularly, when an epidemic is growing, we are more likely to observe shorter delays, causing underestimation of the mean delay. +Therefore, we always want to characterize epidemiological delays from the forward perspective and estimate the forward distribution. +Hereafter, we focus on biases that affect the estimation of the forward distribution. + +## Right truncation + +Right truncation refers to the bias arising from inability to observe future events. +For example, assume the data are right truncated and we don't observe secondary events past time $T$. +Then, we will only observe delays whose secondary events occurred before time $T$, causing us to underestimate the mean of the distribution. + +Let $P$ and $S$ be random variables. +Let $F_p$ be the forward cumulative distribution. +Then, the probability of observing a delay of length $\tau$ given that the primary event occurred at time $p$ and a truncation at time $T$ can be written as: +$$ +\begin{aligned} +\mathbb{P}(S = P + \tau \, | \, P = p, S < T) &= \frac{\mathbb{P}(S = P + \tau, P = p, S < T)}{\mathbb{P}(P = p, S < T)} \\ +&= \frac{\mathbb{P}(S = P + \tau < T \, | \, P = p)\mathbb{P}(P = p)}{\mathbb{P}(S < T \, | \, P = p)\mathbb{P}(P = p)} \\ +&= \frac{\mathbb{P}(S = P + \tau < T \, | \, P = p)}{\mathbb{P}(S < T \, | \, P = p)} \\ +&= \frac{f_p(\tau)}{\int_0^{T - p} f_p(x) \text{d}x} = \frac{f_p(\tau)}{F_p(T - p)}, \quad p + \tau < T. (\#eq:right-truncation) +\end{aligned} +$$ + +Examining Equation \@ref(eq:right-truncation) illustrates that (right) truncation renormalises the density by the values which are possible. +For example, if the distribution $x \sim \text{Unif}(0, 1)$ were right truncated by $T = 0.5$ then $x \sim \text{Unif}(0, 0.5)$. + +## Interval censoring + +The exact timing of epidemiological events is often unknown. +Instead, we may only know that the event happened within certain interval. +We refer to this as interval censoring. + +Assume the secondary event $S$ is censored and so we don't know when the event exactly happened. +Instead, we only know that the secondary event happened between $S_L$ and $S_R$. +Then, +$$ +\begin{aligned} +\mathbb{P}(S_L < S < S_R \, | \, P = p) &= \int_{S_L}^{S_R} f_p(y-p) dy\\ +&= F_p(S_R-p) - F_p(S_L-p) +\end{aligned} +$$ + +Now, assume that both primary $P$ and secondary $S$ events are truncated. +We only know that the primary event happened between $P_L$ and $P_R$ and the secondary event happened between $S_L$ and $S_R$. +We now write $g_P$ to denote the unconditional distribution of primary events. +Then, +$$ +\begin{aligned} +\mathbb{P}(S_L < S < S_R \, | \, P_L < P < P_R) &= \mathbb{P}(P_L < P < P_R, S_L < S < S_R \, | \, P_L < P < P_R)\\ +&= \frac{\mathbb{P}(P_L < P < P_R, S_L < S < S_R)}{\mathbb{P}(P_L < P < P_R)}\\ +&= \frac{\int_{P_L}^{P_R} \int_{S_L}^{S_R} g_P(x) f_x(y-x) \,dy\, dx}{\int_{P_L}^{P_R} g_P(z)\, dz }\\ +&= \int_{P_L}^{P_R} \int_{S_L}^{S_R} g_P(x\,|\,P_L,P_R) f_x(y-x) \,dy\, dx +\end{aligned} +$$ +where $ g_P(x\,|\,P_L,P_R)$ represents the conditional distribution of primary event given lower $P_L$ and upper $P_R$ bounds. + +# The naive model + +The simplest approach to modeling epidemiological delay distributions is to ignore both truncation and censoring biases. +Then, the likelihood of observing a delay $\mathbf{Y}_i$ given parameter $\boldsymbol{\theta}$ is straightforward: +$$ +\mathcal{L}(\mathbf{Y}_i \, | \, \boldsymbol{\theta}) = f(y_i - x_i). +$$ +where $y_i$ and $x_i$ are the observed primary and secondary event times. + +# The latent model + +Now, we incorporate both truncation and double censoring: +$$ +\begin{aligned} +\mathbb{P}(S_L < S < S_R \, | \, P_L < P < P_R, S + + + + + +Instead, the approach of the latent model [@ward2022transmission] is to: + +1. Assume a uniformly distributed primary event times +$$ +\begin{aligned} +x_i &\sim \text{Unif}(p_{L, i}, p_{R, i}) \\ +y_i &\sim \text{Unif}(s_{L, i}, s_{R, i}). +\end{aligned} +$$ + +2. Assume that censoring interval is narrow such that +$$ +\int_{P_{L, i}}^{P_{R, i}} g_P(z \, | \, p_{L, i}, p_{R, i}) F_{x_i}(T - z) \text{d}z \approx F_{x_i}(T - x_i). +$$ + +Then +$$ +\mathcal{L}(\mathbf{Y}_i \, | \, \boldsymbol{\theta}) = \frac{f_x(y_i - x_i)}{F_x(T - x_i)}. +$$ + + +## Bibliography {-}