title

author

date

output

editor_options

Many Models in Base-R

Brian High and Rachel Shaffer

09 May, 2020

ioslides_presentation

fig_caption	fig_retina	fig_width	fig_height	keep_md	smaller	incremental	logo	css	template
true	1	5	3	true	true	false	img/logo_128.png	inc/deohs-ioslides-theme.css	inc/deohs-default-ioslides.html

chunk_output_type
console

Objectives

Many models:

Automate bootstrapping and "many models" using Chicago pollution data.
Produce a table of model estimates for various combinations of covariates.

Background processing:

Demonstrate alternative ways to render (knit) Rmd files.
Explore the use of the screen utility.

Performance tuning:

Compare performance of base-R functions with some more modern alternatives.
Compare performance when using multiple processors in parallel.

Modular programming:

Demonstrate use of the source function to read code from other R scripts.
Compare using source for functions to creating and using packages.

Setup

# Set options
knitr::opts_chunk$set(echo = TRUE, cache = FALSE)
options("kableExtra.html.bsTable" = TRUE, mc.cores = 1)

# Load packages
pacman::p_load(dlnm, ThermIndex, kableExtra)

Get the data

Import the chicagoNMMAPS dataset from the dlnm package:

data(chicagoNMMAPS)
df <- chicagoNMMAPS

From help('chicagoNMMAPS', dlnm):

date:   Date in the period 1987-2000.
time:   The sequence of observations
year:   Year
month:  Month (numeric)
doy:    Day of the year
dow:    Day of the week (factor)
death:  Counts of all cause mortality excluding accident
cvd:    Cardiovascular Deaths
resp:   Respiratory Deaths
temp:   Mean temperature (in Celsius degrees)
dptp:   Dew point temperature
rhum:   Mean relative humidity
pm10:   PM10
o3:     Ozone

Prepare the base models

Create base model formulas from vectors of outcomes and exposures.

# Calculate humidex
df$hmdx <- with(df, humidex(temp, rhum))

# Make vector of outcomes
outcomes <- c("cvd", "death", "resp")

# Make vector of exposures
exposures <- c("o3", "pm10")

# Make formula of outcomes associated w/exposures for each possible combination
base_models <- paste0(outcomes, " ~ ", rep(exposures, each = length(outcomes)))

Here are the resulting base model formulas:

base_models

## [1] "cvd ~ o3"     "death ~ o3"   "resp ~ o3"    "cvd ~ pm10"   "death ~ pm10"
## [6] "resp ~ pm10"

Add the covariates to models

From a vector of covariates, create a vector of model formulas.

# Make a vector of model covariates expanded as "A", "A + B", "A + B + C", etc.
sep <- ' + '
covariate <- c("temp", "hmdx", "dow")
covariates <- sapply(seq_along(covariate), 
                     function(i) paste(covariate[1:i], collapse = sep))

# Develop all combinations of exposures, outcomes, covariates 
models <- paste0(rep(base_models, each = length(covariates)), sep, covariates)

# Prepare a dataframe of models to display as a two-column table
models_df <- data.frame(`Exposure = O3` = models[grepl('o3', models)],
                        `Exposure = PM10` = models[grepl('pm10', models)], 
                        check.names = FALSE)

Define a function to show a table

Define a function that displays a dataframe as an HTML table.

show_html_table <- function(x) {
  rownames(x) <- NULL
  x_html <- knitr::kable(x = x, format = 'html')
  kable_styling(x_html, full_width = TRUE, bootstrap_options = 'condensed')
}

View the models

show_html_table(models_df)

Exposure = O3	Exposure = PM10
cvd ~ o3 + temp	cvd ~ pm10 + temp
cvd ~ o3 + temp + hmdx	cvd ~ pm10 + temp + hmdx
cvd ~ o3 + temp + hmdx + dow	cvd ~ pm10 + temp + hmdx + dow
death ~ o3 + temp	death ~ pm10 + temp
death ~ o3 + temp + hmdx	death ~ pm10 + temp + hmdx
death ~ o3 + temp + hmdx + dow	death ~ pm10 + temp + hmdx + dow
resp ~ o3 + temp	resp ~ pm10 + temp
resp ~ o3 + temp + hmdx	resp ~ pm10 + temp + hmdx
resp ~ o3 + temp + hmdx + dow	resp ~ pm10 + temp + hmdx + dow

Create the bootstrap samples

Create boot_samples with sample function using lapply.

nBoot <- 10
set.seed(12345)
boot_samples <- lapply(1:nBoot, 
                       function(x) df[sample(1:nrow(df), replace = TRUE), ])

Examine the resulting list of dataframes.

length(boot_samples)

## [1] 10

dim(boot_samples[[1]])

## [1] 5114   15

Run the models

Define functions to run the models and combine the results.

get_coefs <- function(x, .formula) coef(lm(.formula, x))

get_stats <- function(x, alpha = 0.05) 
  c(mean = mean(x), CI = quantile(x, c(alpha/2,1 - alpha/2)))

get_model_results <- function(.data, .formula, alpha = 0.05) {
  coefs <- as.data.frame(t(sapply(.data, get_coefs, .formula)))
  stats <- t(sapply(coefs, get_stats, alpha))
  data.frame(model = rep(.formula, ncol(coefs)), variable = colnames(coefs),
             cbind(stats), check.names = FALSE)
}

Run the models.

df_results <- do.call('rbind', lapply(models, function(model)
  get_model_results(boot_samples, model)))

O3 exposure model estimates

show_html_table(df_results[df_results$variable == 'o3', ])

model	variable	mean	CI.2.5%	CI.97.5%
cvd ~ o3 + temp	o3	0.0892533	0.0559730	0.1165423
cvd ~ o3 + temp + hmdx	o3	0.0931739	0.0535099	0.1301772
cvd ~ o3 + temp + hmdx + dow	o3	0.1017909	0.0633519	0.1402830
death ~ o3 + temp	o3	0.0897555	0.0529863	0.1264115
death ~ o3 + temp + hmdx	o3	0.0984191	0.0543167	0.1486057
death ~ o3 + temp + hmdx + dow	o3	0.1136979	0.0676347	0.1620248
resp ~ o3 + temp	o3	0.0108445	0.0052996	0.0192536
resp ~ o3 + temp + hmdx	o3	0.0119815	0.0015424	0.0201529
resp ~ o3 + temp + hmdx + dow	o3	0.0128436	0.0024441	0.0203854

PM10 exposure model estimates

show_html_table(df_results[df_results$variable == 'pm10', ])

model	variable	mean	CI.2.5%	CI.97.5%
cvd ~ pm10 + temp	pm10	0.0520236	0.0393346	0.0668540
cvd ~ pm10 + temp + hmdx	pm10	0.0427465	0.0304881	0.0552883
cvd ~ pm10 + temp + hmdx + dow	pm10	0.0425363	0.0296936	0.0542660
death ~ pm10 + temp	pm10	0.1145648	0.1014364	0.1313425
death ~ pm10 + temp + hmdx	pm10	0.0974559	0.0801501	0.1137221
death ~ pm10 + temp + hmdx + dow	pm10	0.0967002	0.0801824	0.1122437
resp ~ pm10 + temp	pm10	0.0104443	0.0076588	0.0131852
resp ~ pm10 + temp + hmdx	pm10	0.0095284	0.0059913	0.0129597
resp ~ pm10 + temp + hmdx + dow	pm10	0.0098003	0.0062184	0.0129570

Appendix A: Alternate ways to render

In the R console:

rmarkdown::render("chicago_pollution.Rmd")
RStudio provides only a single console.

In the Terminal (Bash):

Rscript -e 'rmarkdown::render("chicago_pollution.Rmd")'
RStudio allows you to open multiple terminals.

In the Jobs tab:

Create a R script containing: rmarkdown::render("chicago_pollution.Rmd")
In the Jobs tab, click "Start Local Job" and configure to run R script.
You can run multiple jobs at once and you can launch jobs from the console.
rstudioapi::jobRunScript(path = "render.R", importEnv = TRUE)

Appendix B: The "screen" utility

With the "screen" utility you can run multiple shell processes simultaneously.

screen creates a new screen. (Use Ctrl+A D to detach.)
screen -S "NAME" creates a new screen named "NAME".
screen -dmS "NAME" CMD creates a screen "NAME", runs CMD, and detaches.
screen -r connects to a running screen. (Or use screen -r "NAME".)
screen -list or screen -ls shows running screens.

Appendix C: Performance tuning

Tidyverse and data.table alternatives

This script might run a little faster if some base-R functions are replaced with more modern equivalents.

For example, use of dplyr::bind_rows() or data.table::rbindlist() in place of do.call('rbind', ...) might make the script run a few percent faster. However, in our tests, we found these changes to the get_results chunk increased the run time by 2% and 4%, respectively. (nBoot <- 400).

Further tests using rsample, purrr, and broom packages can be found in chicago_pollution_modeling.R.

Parallel processing with parallel

Further gains may come from running repetitive tasks in parallel. Using the parallel package and the mclapply() function in place of lapply(), we found that the run time was reduced by 34% with two processors (cores). With four cores it ran about twice as fast. (nBoot <- 400).

Appendix D: Read code with "source"

As an alternative to defining the function in the Rmd file as we did above, we can also save the function in a separate file and import it using the source function.

source('model_lib.R')

This might be useful if the Rmd was exceedingly long and unwieldy to work with.

However, storing code separately like this may make it harder for others to follow or to verify reproducibility. One school of thought favors modularity for managability and code reuse, while the other prefers a monolithic approach.

Discussion: "source" versus packages

If your functions are generally useful independently of this project, it may be wise to share them as a package insted of using source. What do you think?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chicago_pollution.md

chicago_pollution.md

Objectives

Setup

Get the data

Prepare the base models

Add the covariates to models

Define a function to show a table

View the models

Create the bootstrap samples

Run the models

O3 exposure model estimates

PM10 exposure model estimates

Appendix A: Alternate ways to render

In the R console:

In the Terminal (Bash):

In the Jobs tab:

Appendix B: The "screen" utility

Appendix C: Performance tuning

Tidyverse and data.table alternatives

Parallel processing with parallel

Appendix D: Read code with "source"

Discussion: "source" versus packages

Files

chicago_pollution.md

Latest commit

History

chicago_pollution.md

File metadata and controls

Objectives

Setup

Get the data

Prepare the base models

Add the covariates to models

Define a function to show a table

View the models

Create the bootstrap samples

Run the models

O3 exposure model estimates

PM10 exposure model estimates

Appendix A: Alternate ways to render

In the R console:

In the Terminal (Bash):

In the Jobs tab:

Appendix B: The "screen" utility

Appendix C: Performance tuning

Tidyverse and data.table alternatives

Parallel processing with parallel

Appendix D: Read code with "source"

Discussion: "source" versus packages