title | author | date | output | editor_options | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Many Models in Base-R |
Brian High and Rachel Shaffer |
09 May, 2020 |
|
|
Many models:
- Automate bootstrapping and "many models" using Chicago pollution data.
- Produce a table of model estimates for various combinations of covariates.
Background processing:
- Demonstrate alternative ways to render (knit) Rmd files.
- Explore the use of the
screen
utility.
Performance tuning:
- Compare performance of base-R functions with some more modern alternatives.
- Compare performance when using multiple processors in parallel.
Modular programming:
- Demonstrate use of the
source
function to read code from other R scripts. - Compare using
source
for functions to creating and using packages.
# Set options
knitr::opts_chunk$set(echo = TRUE, cache = FALSE)
options("kableExtra.html.bsTable" = TRUE, mc.cores = 1)
# Load packages
pacman::p_load(dlnm, ThermIndex, kableExtra)
Import the chicagoNMMAPS
dataset from the dlnm
package:
data(chicagoNMMAPS)
df <- chicagoNMMAPS
From help('chicagoNMMAPS', dlnm)
:
date: Date in the period 1987-2000.
time: The sequence of observations
year: Year
month: Month (numeric)
doy: Day of the year
dow: Day of the week (factor)
death: Counts of all cause mortality excluding accident
cvd: Cardiovascular Deaths
resp: Respiratory Deaths
temp: Mean temperature (in Celsius degrees)
dptp: Dew point temperature
rhum: Mean relative humidity
pm10: PM10
o3: Ozone
Create base model formulas from vectors of outcomes and exposures.
# Calculate humidex
df$hmdx <- with(df, humidex(temp, rhum))
# Make vector of outcomes
outcomes <- c("cvd", "death", "resp")
# Make vector of exposures
exposures <- c("o3", "pm10")
# Make formula of outcomes associated w/exposures for each possible combination
base_models <- paste0(outcomes, " ~ ", rep(exposures, each = length(outcomes)))
Here are the resulting base model formulas:
base_models
## [1] "cvd ~ o3" "death ~ o3" "resp ~ o3" "cvd ~ pm10" "death ~ pm10"
## [6] "resp ~ pm10"
From a vector of covariates, create a vector of model formulas.
# Make a vector of model covariates expanded as "A", "A + B", "A + B + C", etc.
sep <- ' + '
covariate <- c("temp", "hmdx", "dow")
covariates <- sapply(seq_along(covariate),
function(i) paste(covariate[1:i], collapse = sep))
# Develop all combinations of exposures, outcomes, covariates
models <- paste0(rep(base_models, each = length(covariates)), sep, covariates)
# Prepare a dataframe of models to display as a two-column table
models_df <- data.frame(`Exposure = O3` = models[grepl('o3', models)],
`Exposure = PM10` = models[grepl('pm10', models)],
check.names = FALSE)
Define a function that displays a dataframe as an HTML table.
show_html_table <- function(x) {
rownames(x) <- NULL
x_html <- knitr::kable(x = x, format = 'html')
kable_styling(x_html, full_width = TRUE, bootstrap_options = 'condensed')
}
show_html_table(models_df)
Exposure = O3 | Exposure = PM10 |
---|---|
cvd ~ o3 + temp | cvd ~ pm10 + temp |
cvd ~ o3 + temp + hmdx | cvd ~ pm10 + temp + hmdx |
cvd ~ o3 + temp + hmdx + dow | cvd ~ pm10 + temp + hmdx + dow |
death ~ o3 + temp | death ~ pm10 + temp |
death ~ o3 + temp + hmdx | death ~ pm10 + temp + hmdx |
death ~ o3 + temp + hmdx + dow | death ~ pm10 + temp + hmdx + dow |
resp ~ o3 + temp | resp ~ pm10 + temp |
resp ~ o3 + temp + hmdx | resp ~ pm10 + temp + hmdx |
resp ~ o3 + temp + hmdx + dow | resp ~ pm10 + temp + hmdx + dow |
Create boot_samples
with sample
function using lapply
.
nBoot <- 10
set.seed(12345)
boot_samples <- lapply(1:nBoot,
function(x) df[sample(1:nrow(df), replace = TRUE), ])
Examine the resulting list of dataframes.
length(boot_samples)
## [1] 10
dim(boot_samples[[1]])
## [1] 5114 15
Define functions to run the models and combine the results.
get_coefs <- function(x, .formula) coef(lm(.formula, x))
get_stats <- function(x, alpha = 0.05)
c(mean = mean(x), CI = quantile(x, c(alpha/2,1 - alpha/2)))
get_model_results <- function(.data, .formula, alpha = 0.05) {
coefs <- as.data.frame(t(sapply(.data, get_coefs, .formula)))
stats <- t(sapply(coefs, get_stats, alpha))
data.frame(model = rep(.formula, ncol(coefs)), variable = colnames(coefs),
cbind(stats), check.names = FALSE)
}
Run the models.
df_results <- do.call('rbind', lapply(models, function(model)
get_model_results(boot_samples, model)))
show_html_table(df_results[df_results$variable == 'o3', ])
model | variable | mean | CI.2.5% | CI.97.5% |
---|---|---|---|---|
cvd ~ o3 + temp | o3 | 0.0892533 | 0.0559730 | 0.1165423 |
cvd ~ o3 + temp + hmdx | o3 | 0.0931739 | 0.0535099 | 0.1301772 |
cvd ~ o3 + temp + hmdx + dow | o3 | 0.1017909 | 0.0633519 | 0.1402830 |
death ~ o3 + temp | o3 | 0.0897555 | 0.0529863 | 0.1264115 |
death ~ o3 + temp + hmdx | o3 | 0.0984191 | 0.0543167 | 0.1486057 |
death ~ o3 + temp + hmdx + dow | o3 | 0.1136979 | 0.0676347 | 0.1620248 |
resp ~ o3 + temp | o3 | 0.0108445 | 0.0052996 | 0.0192536 |
resp ~ o3 + temp + hmdx | o3 | 0.0119815 | 0.0015424 | 0.0201529 |
resp ~ o3 + temp + hmdx + dow | o3 | 0.0128436 | 0.0024441 | 0.0203854 |
show_html_table(df_results[df_results$variable == 'pm10', ])
model | variable | mean | CI.2.5% | CI.97.5% |
---|---|---|---|---|
cvd ~ pm10 + temp | pm10 | 0.0520236 | 0.0393346 | 0.0668540 |
cvd ~ pm10 + temp + hmdx | pm10 | 0.0427465 | 0.0304881 | 0.0552883 |
cvd ~ pm10 + temp + hmdx + dow | pm10 | 0.0425363 | 0.0296936 | 0.0542660 |
death ~ pm10 + temp | pm10 | 0.1145648 | 0.1014364 | 0.1313425 |
death ~ pm10 + temp + hmdx | pm10 | 0.0974559 | 0.0801501 | 0.1137221 |
death ~ pm10 + temp + hmdx + dow | pm10 | 0.0967002 | 0.0801824 | 0.1122437 |
resp ~ pm10 + temp | pm10 | 0.0104443 | 0.0076588 | 0.0131852 |
resp ~ pm10 + temp + hmdx | pm10 | 0.0095284 | 0.0059913 | 0.0129597 |
resp ~ pm10 + temp + hmdx + dow | pm10 | 0.0098003 | 0.0062184 | 0.0129570 |
rmarkdown::render("chicago_pollution.Rmd")
- RStudio provides only a single console.
Rscript -e 'rmarkdown::render("chicago_pollution.Rmd")'
- RStudio allows you to open multiple terminals.
- Create a R script containing:
rmarkdown::render("chicago_pollution.Rmd")
- In the Jobs tab, click "Start Local Job" and configure to run R script.
- You can run multiple jobs at once and you can launch jobs from the console.
rstudioapi::jobRunScript(path = "render.R", importEnv = TRUE)
With the "screen" utility you can run multiple shell processes simultaneously.
screen
creates a new screen. (Use Ctrl+A D to detach.)screen -S "NAME"
creates a new screen named "NAME".screen -dmS "NAME" CMD
creates a screen "NAME", runs CMD, and detaches.screen -r
connects to a running screen. (Or usescreen -r "NAME"
.)screen -list
orscreen -ls
shows running screens.
This script might run a little faster if some base-R functions are replaced with more modern equivalents.
For example, use of dplyr::bind_rows()
or data.table::rbindlist()
in
place of do.call('rbind', ...)
might make the script run a few percent
faster. However, in our tests, we found these changes to the get_results
chunk increased the run time by 2% and 4%, respectively. (nBoot <- 400
).
Further tests using rsample, purrr, and broom packages can be found in chicago_pollution_modeling.R.
Further gains may come from running repetitive tasks in parallel. Using the
parallel package and the mclapply()
function in place of lapply()
, we
found that the run time was reduced by 34% with two processors (cores). With
four cores it ran about twice as fast. (nBoot <- 400
).
As an alternative to defining the function in the Rmd file as we did above, we
can also save the function in a separate file and import it using the source
function.
source('model_lib.R')
This might be useful if the Rmd was exceedingly long and unwieldy to work with.
However, storing code separately like this may make it harder for others to follow or to verify reproducibility. One school of thought favors modularity for managability and code reuse, while the other prefers a monolithic approach.
If your functions are generally useful independently of this project, it may
be wise to share them as a package insted of using source
. What do you think?