simulation.Rmd

---
title: A sylized scenario
author: Gaebler et al.
date: Tuesday, October 11, 2020
output:
  github_document:
    pandoc_args: --webtex
---

```{r setup}
library(tidymodels)
library(tidyverse)
library(furrr)

# Set number of workers.
plan(multicore, workers = 4)

# Set the ggplot theme.
theme_set(theme_bw(base_size = 20))
```

# Overview

We construct a stylized example demonstrating that subset ignorability is
sufficient to estimate the $\text{SATE}_M$, the second-stage sample average treatment
effect. Our simulation models a city where police officers encounter many
residents. When an officer encounters a resident, the officer observes whether
the resident is actively engaging in a drug transaction ($A$) and forms an
impression of their race ($D$). The officer then decides whether or not to
arrest the individual ($M$). The officer generates a report about the encounter
which indictes whether there was a drug transaction ($R$). If the officer
arrests the resident, the report, in which the officer also records their
perception of the resident's race, is transmitted to the prosecutor, along with
the resident's criminal history ($X$). The prosecutor then makes a decision
about whether to charge the resident with a crime ($Y$) based only on the
information contained in these documents, viz., the individual's criminal
history, whether the individual was reported to be engaged in a drug
transaction, and the race indicated in the report ($Z$).

The object is to understand whether the prosecutor's charging decisions are
motivated by racial animus, subject to the limitation that only data on arrested
individuals ($M = 1$) are available.

# Setup

A "book of life" is a draw of each potential outcome for a population of a given
size. We'll define a function that generates a book of life for a given set of
parameters. The meanings of the parameters matches the manuscript, and are
re-explained as we go over each of the three steps in the data generating
process.


```{r book_of_life}
gen_book_of_life <- function(
  pop_size,
  mu_T,
  mu_X,
  delta,
  mu_A,
  gamma_,
  alpha_0,
  alpha_A,
  alpha_black,
  lambda_0,
  lambda_A,
  lambda_black,
  beta_0,
  beta_X,
  beta_R,
  beta_black,
  ...
) {
  # Define the structural equations.
  f_T <- identity
  f_D <- identity
  f_Z <- identity

  f_A <- function(t, u_A){
    val <- mu_A + t*gamma_
    return(as.integer(u_A <= val))
  }
  
  f_X <- function(t, u_X){
    val <- mu_X + t*delta
    return(as.integer(u_X <= val))
  }
  
  f_M <- function(d, a, u_M) {
    val <- alpha_0 + a*alpha_A + d*alpha_black
    return(as.integer(u_M <= val))
  }
  
  f_R <- function(d, a, u_R) {
    val <- lambda_0 + a*lambda_A + d*lambda_black
    return(u_R <= val)
  }
  
  f_Y <- function(z, m, r, x, u_Y) {
    val <- beta_0 + x*beta_X + r*beta_R + z*beta_black
    return(m * (u_Y <= val))
  }
  
  # Generate the book of life.
  book_of_life <- tibble(
      # Generate the exogenous variables.
      U_T = rbinom(pop_size, 1, mu_T),
      U_A = runif(pop_size),
      U_X = runif(pop_size),
      U_R = runif(pop_size),
      U_M = runif(pop_size),
      U_Y = runif(pop_size),
      
      # Generate the endogenous variables.
      TT = f_T(U_T),
      D = f_D(TT),
      Z = f_Z(D),
      A = f_A(TT, U_A),
      X = f_X(TT, U_X),
      M = f_M(D, A, U_M),
      R = f_R(D, A, U_R),
      Y = f_Y(Z, M, R, X, U_Y),

      # Generate the potential outcomes.
      M_w = f_M(0, A, U_M),
      M_b = f_M(1, A, U_M),
      Y_w_1 = f_Y(0, 1, R, X, U_Y),
      Y_b_1 = f_Y(1, 1, R, X, U_Y)
  )
}
```

Next, we define two convenience function which, given the book of life, produce
the stratified difference-in-means and linear regression estimates, both when
the covariate R is and is not observed. These functions also return the standard
error, computed according to Eq. (8) in the case of the stratified
difference-in-means estimator and according to the standard calculation for
linear models.

```{r convenience_functions}
extract_sdm <- function(book_of_life) {
  confounded <- book_of_life %>%
    filter(M == 1) %>%
    group_by(Z, X) %>%
    summarize(c_zx = mean(Y), S_zx = n(), .groups = 'drop') %>%
    group_by(X) %>%
    summarize(
      n_x = sum(S_zx),
      se_x = sum(c_zx * (1 - c_zx) / S_zx),
      diff_in_means = sum(Z * c_zx - (1 - Z) * c_zx),
      .groups='drop'
    ) %>%
    summarize(
      estimate = sum(n_x * diff_in_means) / sum(n_x),
      se = sqrt(sum((n_x / sum(n_x))^2 * se_x))
    )
  
  unconfounded <- book_of_life %>%
    filter(M == 1) %>%
    group_by(Z, X, R) %>%
    summarize(c_zxr = mean(Y), S_zxr = n(), .groups='drop') %>%
    group_by(X, R) %>%
    summarize(
      n_xr = sum(S_zxr),
      se_xr = sum(c_zxr * (1 - c_zxr) / S_zxr),
      diff_in_means = sum(Z * c_zxr - (1 - Z) * c_zxr),
      .groups='drop'
    ) %>%
    summarize(
      estimate = sum(n_xr * diff_in_means) / sum(n_xr),
      se = sqrt(sum((n_xr / sum(n_xr))^2 * se_xr))
    )
  
  rbind(confounded, unconfounded) %>%
    mutate(confounded = c('Confounded', 'Unconfounded'))
}

extract_lpm <- function(book_of_life) {
  confounded <- book_of_life %>%
    filter(M == 1) %>%
    lm(formula = Y ~ X + Z, data = .) %>%
    tidy() %>%
    filter(term == "Z") %>%
    select(estimate = estimate, se = std.error)
  
  unconfounded <- book_of_life %>%
    filter(M == 1) %>%
    lm(formula = Y ~ X + R + Z, data = .) %>%
    tidy() %>%
    filter(term == "Z") %>%
    select(estimate = estimate, se = std.error)
  
  rbind(confounded, unconfounded) %>%
    mutate(confounded = c('Confounded', 'Unconfounded'))
}
```

# Simulation

We now turn to the simulation itself.

## Estimates

We'll generate estimates using the following grid of parameters.

```{r parameters}
params <- expand_grid(
  pop_size = 1e5,
  mu_T = 0.3,
  mu_X = 0.3,
  delta = 0.1,
  mu_A = 0.3,
  gamma_ = -0.1,
  alpha_0 = 0.1,
  alpha_A = 0.3,
  alpha_black = seq(from = 0.2, to = 0.4, by = 0.05),
  lambda_0 = 0.2,
  lambda_A = 0.6,
  lambda_black = 0.1,
  beta_0 = 0.2,
  beta_X = 0.2,
  beta_R = 0.2,
  beta_black = seq(from = 0.2, to = 0.4, by = 0.05)
)
```

Our estimates are then as follows:

```{r estimates}
gen_sim_results <- function(params, n = 1) {
  one_run <- function(...) {
    params %>% 
      mutate(
        books_of_life = pmap(., gen_book_of_life),
        sdm = map(books_of_life, extract_sdm),
        lpm = map(books_of_life, extract_lpm),
      ) %>%
      unnest(cols = c("sdm", "lpm"), names_sep = "_") %>%
      select(-books_of_life, -sdm_confounded) %>%
      rename(confounded = lpm_confounded)
  }
  
  results <- future_map_dfr(
    1:n,
    one_run,
    .options = future_options(seed = TRUE)
  )
}

set.seed(1)
unisim <- gen_sim_results(params)
```

We can plot them to see how well our estimates perform. (Note that in this
setup, the $\text{SATE}_M$ is exactly equal to $\beta_{\text{black}}$.) We use a normal
approximation to generate our 95% confidence intervals.

```{r plot_single, fig.height = 6, fig.width = 12}
unisim %>%
  ggplot(aes(x = alpha_black, color = as_factor(beta_black))) +
  facet_wrap(vars(confounded)) + 
  geom_hline(aes(yintercept = beta_black, color = as_factor(beta_black))) + 
  geom_pointrange(
    aes(
      y = sdm_estimate,
      ymin = sdm_estimate - 1.96 * sdm_se,
      ymax = sdm_estimate + 1.96 * sdm_se
    ),
    position = position_nudge(x = 0.007)
  ) +
  geom_pointrange(
    aes(
      y = lpm_estimate,
      ymin = lpm_estimate - 1.96 * lpm_se,
      ymax = lpm_estimate + 1.96 * lpm_se
    ),
    position = position_nudge(x = -0.007),
    shape = 21,
    fill = "white"
  ) +
  geom_text(
    mapping = aes(
      x = max(params$alpha_black),
      y = beta_black,
      label = str_c("beta[black] == ", beta_black)
    ),
    hjust = 0,
    vjust = 1.5,
    size = 6,
    parse = TRUE,
    position = position_nudge(x = 0.05)
  ) +
  scale_x_continuous(
    limits = c(min(params$alpha_black) - 0.03, max(params$alpha_black) + 0.13),
    breaks = sort(unique(params$alpha_black))
  ) +
  scale_y_continuous(
    limits = c(min(params$beta_black) - 0.07, max(params$beta_black) + 0.03),
    breaks = sort(unique(params$beta_black))
  ) + 
  labs(
    x = expr(alpha[black]),
    y = expr(widehat(SATE)[M])
  ) +
  theme(
    legend.position = "none",
    axis.title = element_text(size = 20, face = "bold"),
    axis.text = element_text(color = "black"),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_blank()
  )
```

We also want to plot the results of many simulations to see what the
distribution of the estimator is. We'll do 10,000 simuations of size 100,000 to
see what happens.

```{r plot_many, fig.height = 6, fig.width = 12}
set.seed(2)
millisim <- gen_sim_results(params, 1e4)

# Calculate 95% centered coverage and plot for the estimates.
millisim %>%
  group_by(alpha_black, beta_black, confounded) %>%
  summarize(
    sdm_lower = quantile(sdm_estimate, probs = 0.025),
    sdm_upper = quantile(sdm_estimate, probs = 0.975),
    sdm_estimate = mean(sdm_estimate),
    lpm_lower = quantile(lpm_estimate, probs = 0.025),
    lpm_upper = quantile(lpm_estimate, probs = 0.975),
    lpm_estimate = mean(lpm_estimate),
    .groups = 'drop'
  ) %>%
  ggplot(aes(x = alpha_black, color = as_factor(beta_black))) +
  facet_wrap(vars(confounded)) +
  geom_hline(aes(yintercept = beta_black, color = as_factor(beta_black))) + 
  geom_pointrange(
    aes(y = sdm_estimate, ymin = sdm_lower, ymax = sdm_upper),
    position = position_nudge(x = 0.007)
  ) +
  geom_pointrange(
    aes(y = lpm_estimate, ymin = lpm_lower, ymax = lpm_upper),
    position = position_nudge(x = -0.007),
    shape = 21,
    fill = "white"
  ) +
  geom_text(
    mapping = aes(
      x = max(params$alpha_black),
      y = beta_black,
      label = str_c("beta[black] == ", beta_black)
    ),
    hjust = 0,
    vjust = 1.5,
    size = 6,
    parse = TRUE,
    position = position_nudge(x = 0.05)
  ) +
  scale_x_continuous(
    limits = c(min(params$alpha_black) - 0.03, max(params$alpha_black) + 0.13),
    breaks = sort(unique(params$alpha_black))
  ) +
  scale_y_continuous(
    limits = c(min(params$beta_black) - 0.07, max(params$beta_black) + 0.03),
    breaks = sort(unique(params$beta_black))
  ) + 
  labs(
    x = expr(alpha[black]),
    y = expr(widehat(SATE)[M])
  ) +
  theme(
    legend.position = "none",
    axis.title = element_text(size = 20, face = "bold"),
    axis.text = element_text(color = "black"),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_blank()
  )

ggsave('figures/millisim.pdf', height = 6, width = 12)
```

## Confidence Intervals

We want to check the true coverage rate of the confidence intervals we've
calculated. For each of the 10,000 simulations of each of the 25 
parameter combinations, we check whether the calculated value falls within 
the 95% confidence interval.

```{r confidence_intervals}
coverage <- millisim %>%
  group_by(confounded, alpha_black, beta_black) %>%
  mutate(
    sdm_lower = sdm_estimate - 1.96 * sdm_se,
    sdm_upper = sdm_estimate + 1.96 * sdm_se,
    sdm_coverage = sdm_lower <= beta_black & sdm_upper >= beta_black,
    lpm_lower = lpm_estimate - 1.96 * lpm_se,
    lpm_upper = lpm_estimate + 1.96 * lpm_se,
    lpm_coverage = lpm_lower <= beta_black & lpm_upper >= beta_black
  ) %>%
  summarize_at(vars(sdm_coverage, lpm_coverage), mean)

coverage %>% print(n=Inf)
```

Briefly reviewing our table of estimates shows that our coverage is roughly 95%
for all of the confidence intervals in the unconfounded scenario. In the confounded
scenario, the intervals rarely cover the true values, as expected.