index.Rmd

--- 
title: "40 Days and 40 Nights"
author:
- Martin Morgan^[Roswell Park Comprehensive Cancer Center, Martin.Morgan@RoswellPark.org]
- L. Shawn Matott^[Roswell Park Comprehensive Cancer Center]
date: "`r Sys.Date()`"
bibliography:
- book.bib
- packages.bib
description: "An ad hoc course to learn R for bioinformatics, using the \nCOVID-19
  epidemic as an excuse.\n"
documentclass: book
link-citations: yes
site: bookdown::bookdown_site
biblio-style: apalike
---

# Motivation {-}

This is a WORK IN PROGRESS.

This course was suggested and enabled by Adam Kisailus and Richard Hershberger. It is available for Roswell Park graduate students.

## Introduction {-}

The word '[quarantine][quarantine-def]' is from the 1660's and refers to the fourty days (Italian _quaranta giorni_) a ship suspected of carrying disease was kept in isolation.

What to do in a quarantine? The astronaut Scott Kelly spent nearly a year on the International Space Station. In a New York Times [opinion piece][] he says, among other things, that 'you need a hobby', and what better hobby than a useful one? Let's take the opportunity provided by COVID-19 to learn R for statistical analysis and comprehension of data. Who knows, it may be useful after all this is over!

[quarantine-def]: https://www.etymonline.com/word/quarantine
[opinion piece]: https://www.nytimes.com/2020/03/21/opinion/scott-kelly-coronavirus-isolation.html

## What to expect {-}

We'll meet via zoom twice a week, Mondays and Fridays, for one hour. We'll use this time to make sure everyone is making progress, and to introduce new or more difficult topics. Other days we'll have short exercises and activities that hopefully provide an opportunity to learn at your own speed.

We haven't thought this through much, but roughly we might cover:

- Week \@ref(one): We'll start with the basics of installing and using R. We'll set up _R_ and _RStudio_ on your local computer, or if that doesn't work use a cloud-based RStudio. We'll learn the basics of _R_ -- numeric, character, logical, and other vectors; variables; and slightly more complicated representations of 'factors' and dates. We'll also use _RStudio_ to write a script that allows us to easily re-create an analysis, illustrating the power concept of *reproducible research*.

  ```{r}
  activity <- c("check e-mail", "breakfast", "conference call", "webinar", "walk")
  minutes_per_activity <- c(20, 30, 60, 60, 60)
  minutes_per_activity >= 60
  activity[minutes_per_activity >= 60]
  ```

- Week \@ref(two): The `data.frame`. This week is all about _R_'s `data.frame`, a versatile way of representing and manipulating a table (like an Excel spreadsheet) of data. We'll learn how to create, write, and read a `data.frame`; how to go from data in a spreadsheet in Excel to a `data.frame` in _R_; and how to perform simple manipulations on a `data.frame`, like creating a subset of data, summarizing values in a column, and summarizing values in one column based on a grouping variable in another column.

  ```{r}
  url = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
  cases <- read.csv(url)
  erie <- subset(cases, county == "Erie" & state == "New York")
  tail(erie)
  ```


- Week \@ref(three): Packages for extending _R_. A great strength of _R_ is its extensibility through packages. We'll learn about [CRAN][], and install and use the 'tidyverse' suite of packages. The tidyverse provides us with an alternative set of tools for working with tabular data, and We'll use publicly available data to explore the spread of COVID-19 in the US. We'll read, filter, mutate (change), and select subsets of the data, and group data by one column (e.g., 'state') to create summaries (e.g., cases per state). We'll also start to explore data visualization, creating our first plots of the spread of COVID-19.

  ```{r, message = FALSE}
  library(dplyr)
  library(ggplot2)
  ## ...additional commands
  ```

  ```{r, echo = FALSE, fig = TRUE, message = FALSE, warning = FALSE}
  covid = readr::read_csv(url) %>%
      mutate(date = lubridate::ymd(date))
  
  erie =
      covid %>% 
      filter(
          county %in% c("Erie"),
          state == "New York"
      ) %>% 
      mutate(
          new_cases = c(diff(cases + deaths), NA)
      ) %>%
      head(-1)
  
  ggplot(erie, aes(x = date, y = new_cases)) +
      scale_y_log10() + 
      geom_point() +
      geom_smooth() +
      ggtitle("Erie County, New York")
  ```

- Week \@ref(four): Machine learning. This week will develop basic machine learning models for exploring data. 

- Week \@ref(five): Bioinformatic analysis with [Bioconductor][]. _Bioconductor_ is a collection of more than 1800 _R_ packages for the statistical analysis and comprehension of high-throughput genomic data. We'll use _Bioconductor_ to look at COVID-19 genome sequences, and to explore emerging genomic data relevant to the virus.

    ```{r, out.width = "48%", echo = FALSE}
    knitr::include_graphics("images/05-SARS-CoV-2-phylogeny.png")
    knitr::include_graphics("images/05-ACE2.png")
    ```

- Week \@ref(six): COVID-19 has really shown the value of open data and collaboration. In the final week of our quarantine, we'll explore collaboration. We'll learn about writing 'markdown' vignettes (reports to) share our results with others, such as our lab colleagues. We'll write and document functions so that we can easily re-do steps in an analysis. And we will synthesize the vignettes and functions into a package for documenting and sharing our work.

[CRAN]: https://cran.r-project.org
[Bioconductor]: https://bioconductor.org