Skip to content

Latest commit

 

History

History
508 lines (407 loc) · 14.6 KB

chicago_pollution.md

File metadata and controls

508 lines (407 loc) · 14.6 KB
title author date output editor_options
Many Models in Base-R
Brian High and Rachel Shaffer
09 May, 2020
ioslides_presentation
fig_caption fig_retina fig_width fig_height keep_md smaller incremental logo css template
true
1
5
3
true
true
false
img/logo_128.png
inc/deohs-ioslides-theme.css
inc/deohs-default-ioslides.html
chunk_output_type
console

Objectives

Many models:

  • Automate bootstrapping and "many models" using Chicago pollution data.
  • Produce a table of model estimates for various combinations of covariates.

Background processing:

  • Demonstrate alternative ways to render (knit) Rmd files.
  • Explore the use of the screen utility.

Performance tuning:

  • Compare performance of base-R functions with some more modern alternatives.
  • Compare performance when using multiple processors in parallel.

Modular programming:

  • Demonstrate use of the source function to read code from other R scripts.
  • Compare using source for functions to creating and using packages.

Setup

# Set options
knitr::opts_chunk$set(echo = TRUE, cache = FALSE)
options("kableExtra.html.bsTable" = TRUE, mc.cores = 1)
# Load packages
pacman::p_load(dlnm, ThermIndex, kableExtra)

Get the data

Import the chicagoNMMAPS dataset from the dlnm package:

data(chicagoNMMAPS)
df <- chicagoNMMAPS

From help('chicagoNMMAPS', dlnm):

date:   Date in the period 1987-2000.
time:   The sequence of observations
year:   Year
month:  Month (numeric)
doy:    Day of the year
dow:    Day of the week (factor)
death:  Counts of all cause mortality excluding accident
cvd:    Cardiovascular Deaths
resp:   Respiratory Deaths
temp:   Mean temperature (in Celsius degrees)
dptp:   Dew point temperature
rhum:   Mean relative humidity
pm10:   PM10
o3:     Ozone

Prepare the base models

Create base model formulas from vectors of outcomes and exposures.

# Calculate humidex
df$hmdx <- with(df, humidex(temp, rhum))

# Make vector of outcomes
outcomes <- c("cvd", "death", "resp")

# Make vector of exposures
exposures <- c("o3", "pm10")

# Make formula of outcomes associated w/exposures for each possible combination
base_models <- paste0(outcomes, " ~ ", rep(exposures, each = length(outcomes)))

Here are the resulting base model formulas:

base_models
## [1] "cvd ~ o3"     "death ~ o3"   "resp ~ o3"    "cvd ~ pm10"   "death ~ pm10"
## [6] "resp ~ pm10"

Add the covariates to models

From a vector of covariates, create a vector of model formulas.

# Make a vector of model covariates expanded as "A", "A + B", "A + B + C", etc.
sep <- ' + '
covariate <- c("temp", "hmdx", "dow")
covariates <- sapply(seq_along(covariate), 
                     function(i) paste(covariate[1:i], collapse = sep))

# Develop all combinations of exposures, outcomes, covariates 
models <- paste0(rep(base_models, each = length(covariates)), sep, covariates)

# Prepare a dataframe of models to display as a two-column table
models_df <- data.frame(`Exposure = O3` = models[grepl('o3', models)],
                        `Exposure = PM10` = models[grepl('pm10', models)], 
                        check.names = FALSE)

Define a function to show a table

Define a function that displays a dataframe as an HTML table.

show_html_table <- function(x) {
  rownames(x) <- NULL
  x_html <- knitr::kable(x = x, format = 'html')
  kable_styling(x_html, full_width = TRUE, bootstrap_options = 'condensed')
}

View the models

show_html_table(models_df)
Exposure = O3 Exposure = PM10
cvd ~ o3 + temp cvd ~ pm10 + temp
cvd ~ o3 + temp + hmdx cvd ~ pm10 + temp + hmdx
cvd ~ o3 + temp + hmdx + dow cvd ~ pm10 + temp + hmdx + dow
death ~ o3 + temp death ~ pm10 + temp
death ~ o3 + temp + hmdx death ~ pm10 + temp + hmdx
death ~ o3 + temp + hmdx + dow death ~ pm10 + temp + hmdx + dow
resp ~ o3 + temp resp ~ pm10 + temp
resp ~ o3 + temp + hmdx resp ~ pm10 + temp + hmdx
resp ~ o3 + temp + hmdx + dow resp ~ pm10 + temp + hmdx + dow

Create the bootstrap samples

Create boot_samples with sample function using lapply.

nBoot <- 10
set.seed(12345)
boot_samples <- lapply(1:nBoot, 
                       function(x) df[sample(1:nrow(df), replace = TRUE), ])

Examine the resulting list of dataframes.

length(boot_samples)
## [1] 10
dim(boot_samples[[1]])
## [1] 5114   15

Run the models

Define functions to run the models and combine the results.

get_coefs <- function(x, .formula) coef(lm(.formula, x))

get_stats <- function(x, alpha = 0.05) 
  c(mean = mean(x), CI = quantile(x, c(alpha/2,1 - alpha/2)))

get_model_results <- function(.data, .formula, alpha = 0.05) {
  coefs <- as.data.frame(t(sapply(.data, get_coefs, .formula)))
  stats <- t(sapply(coefs, get_stats, alpha))
  data.frame(model = rep(.formula, ncol(coefs)), variable = colnames(coefs),
             cbind(stats), check.names = FALSE)
}

Run the models.

df_results <- do.call('rbind', lapply(models, function(model)
  get_model_results(boot_samples, model)))

O3 exposure model estimates

show_html_table(df_results[df_results$variable == 'o3', ])
model variable mean CI.2.5% CI.97.5%
cvd ~ o3 + temp o3 0.0892533 0.0559730 0.1165423
cvd ~ o3 + temp + hmdx o3 0.0931739 0.0535099 0.1301772
cvd ~ o3 + temp + hmdx + dow o3 0.1017909 0.0633519 0.1402830
death ~ o3 + temp o3 0.0897555 0.0529863 0.1264115
death ~ o3 + temp + hmdx o3 0.0984191 0.0543167 0.1486057
death ~ o3 + temp + hmdx + dow o3 0.1136979 0.0676347 0.1620248
resp ~ o3 + temp o3 0.0108445 0.0052996 0.0192536
resp ~ o3 + temp + hmdx o3 0.0119815 0.0015424 0.0201529
resp ~ o3 + temp + hmdx + dow o3 0.0128436 0.0024441 0.0203854

PM10 exposure model estimates

show_html_table(df_results[df_results$variable == 'pm10', ])
model variable mean CI.2.5% CI.97.5%
cvd ~ pm10 + temp pm10 0.0520236 0.0393346 0.0668540
cvd ~ pm10 + temp + hmdx pm10 0.0427465 0.0304881 0.0552883
cvd ~ pm10 + temp + hmdx + dow pm10 0.0425363 0.0296936 0.0542660
death ~ pm10 + temp pm10 0.1145648 0.1014364 0.1313425
death ~ pm10 + temp + hmdx pm10 0.0974559 0.0801501 0.1137221
death ~ pm10 + temp + hmdx + dow pm10 0.0967002 0.0801824 0.1122437
resp ~ pm10 + temp pm10 0.0104443 0.0076588 0.0131852
resp ~ pm10 + temp + hmdx pm10 0.0095284 0.0059913 0.0129597
resp ~ pm10 + temp + hmdx + dow pm10 0.0098003 0.0062184 0.0129570

Appendix A: Alternate ways to render

In the R console:

  • rmarkdown::render("chicago_pollution.Rmd")
  • RStudio provides only a single console.

In the Terminal (Bash):

  • Rscript -e 'rmarkdown::render("chicago_pollution.Rmd")'
  • RStudio allows you to open multiple terminals.

In the Jobs tab:

  • Create a R script containing: rmarkdown::render("chicago_pollution.Rmd")
  • In the Jobs tab, click "Start Local Job" and configure to run R script.
  • You can run multiple jobs at once and you can launch jobs from the console.
  • rstudioapi::jobRunScript(path = "render.R", importEnv = TRUE)

Appendix B: The "screen" utility

With the "screen" utility you can run multiple shell processes simultaneously.

  • screen creates a new screen. (Use Ctrl+A D to detach.)
  • screen -S "NAME" creates a new screen named "NAME".
  • screen -dmS "NAME" CMD creates a screen "NAME", runs CMD, and detaches.
  • screen -r connects to a running screen. (Or use screen -r "NAME".)
  • screen -list or screen -ls shows running screens.

Use Ctrl+A D to detach from a screen.

Appendix C: Performance tuning

Tidyverse and data.table alternatives

This script might run a little faster if some base-R functions are replaced with more modern equivalents.

For example, use of dplyr::bind_rows() or data.table::rbindlist() in place of do.call('rbind', ...) might make the script run a few percent faster. However, in our tests, we found these changes to the get_results chunk increased the run time by 2% and 4%, respectively. (nBoot <- 400).

Further tests using rsample, purrr, and broom packages can be found in chicago_pollution_modeling.R.

Parallel processing with parallel

Further gains may come from running repetitive tasks in parallel. Using the parallel package and the mclapply() function in place of lapply(), we found that the run time was reduced by 34% with two processors (cores). With four cores it ran about twice as fast. (nBoot <- 400).

Appendix D: Read code with "source"

As an alternative to defining the function in the Rmd file as we did above, we can also save the function in a separate file and import it using the source function.

source('model_lib.R')

This might be useful if the Rmd was exceedingly long and unwieldy to work with.

However, storing code separately like this may make it harder for others to follow or to verify reproducibility. One school of thought favors modularity for managability and code reuse, while the other prefers a monolithic approach.

Discussion: "source" versus packages

If your functions are generally useful independently of this project, it may be wise to share them as a package insted of using source. What do you think?