3.workflows.Rmd

---
title: "Towards Analysis Workflows in R"
author: "Mark Dunning"
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output: html_document
---

# Overview of this section

- Introducing piping
- **filter** verb
- **arrange** verb

```{r echo=FALSE}
suppressPackageStartupMessages(library(dplyr))
library(stringr)
library(tidyr)
patients <- tbl_df(read.delim("patient-data.txt"))
```


We've ended up with a long chain of steps to perform on our data. It is quite common to nest commands in R into a single line;

```{r}
patients <- tbl_df(read.delim("patient-data.txt"))
```

we read as "
> apply the `tbl_df` to the result of reading the file `patient-data.txt`

Could also do the same for our `mutate` statements, although this would quickly become tricky...


```{r}
patients_clean <- mutate(mutate(patients, Sex = factor(str_trim(Sex))),
                         ID = str_pad(ID,pad="0",width=3))
```

We always have to work out what the first statement was and work forwards from that.

Alternatively, we could write each command as a separate line

```{r}
patients_clean<- mutate(patients, Sex = factor(str_trim(Sex)))
patients_clean <- mutate(patients_clean, ID=str_pad(patients_clean$ID,pad = "0",width=3))
patients_clean <- mutate(patients_clean, Height= str_replace_all(patients_clean$Height,pattern = "cm",""))
```


- prone to error if we copy-and paste
- notice how the output of one line is the input to the following line


## Introducing piping

The output of one operations gets used as the input of the next

In computing, this is referring to as *piping* 
- unix commands use the `|` symbol

## magrittr

![not-a-pipe](images/pipe.jpeg)

![also-not-a-pipe](https://upload.wikimedia.org/wikipedia/en/b/b9/MagrittePipe.jpg)

- the magrittr library implements this in R

## Simple example

Read the file `patient-data` and pass the result to the `head` function.
```{r eval=FALSE}
patients <- read.delim("patient-data.txt") %>% 
  tbl_df
```

> read the file `patient-data.txt` ***and then*** use the `tbl_df` function

We can re-write our steps from above;

```{r}
patients_clean <- read.delim("patient-data.txt") %>% 
  tbl_df %>% 
  mutate(Sex = factor(str_trim(Sex))) %>% 
  mutate(Height= as.numeric(str_replace_all(patients_clean$Height,pattern = "cm","")))
```

> read the file `patient-data.txt` ***and then*** use the `tbl_df` function ***and then*** trim the whitespace from the Sex variable ***and then*** replace cm with blank characters in the Height variable

## Exercise: workflow-exercise.Rmd

******

Take the steps used to clean the patients dataset and calculate BMI (see template for the code)
- Re-write in the piping framework
- Add a step to print just the ID, Name, Date of Birth, Smokes and Overweight columns

******

```{r}
mutate(patients_clean, Weight = as.numeric(str_replace_all(patients_clean$Weight,"kg",""))) %>% 
  mutate(BMI = (Weight/(Height/100)^2), Overweight = BMI > 25) %>% 
  mutate(Smokes = str_replace_all(Smokes, "Yes", TRUE)) %>% 
  select(ID, Name, Birth,BMI,Smokes,Overweight)

```

Now having displayed the relevant information for our patients, we want to extract rows of interest from the data frame.


## Selecting rows: The `filter` verb

![filter](images/filter.png)

The **`filter``** verb is used to select rows from the data frame. The criteria we use to select can use the comparisons `==`, `>`, `<`, `!=`

e.g. select all the males

```{r}
filter(patients, Sex == "Male") 
```

In base R, we would do

```{r eval=FALSE}
patients[patients$Sex == "Male",]
```

Again, to non R-users, this is less intuitive

```{r echo=FALSE}
head(patients[patients$Sex == "Male",])
```

Combining conditions can be achieved using `&` (and) `|` (or)

```{r}
filter(patients, Sex == "Male" & Died)
```

```{r eval=FALSE}
patients[patients$Sex == "Male" & patients$Died,]
```


```{r echo=FALSE}
patients[patients$Sex == "Male" & patients$Died,]
```


```{r}
filter(patients, Sex == "Female" | Grade_Level > 1)
```

A really convenient function is `top_n`

```{r}
top_n(patients_clean,10,Height)
top_n(patients_clean,10,Weight)
```


******

Modify the workflow to select the candidates (overweight smokers)
Write the result to a file

******


## Ordering rows: The `arrange` verb

```{r}
arrange(patients, Height)
```

## Re-usable pipelines

Imagine we have a second dataset that we want to process; `cohort-data.txt`. 


Take a moment to try and read the data into Excel (/LibreOffice) and consider if you would want to work with these data....


```{r eval=FALSE}
 read.delim("cohort-data.txt") %>% 
  tbl_df %>% 
  mutate(Sex = factor(str_trim(Sex))) %>% 
  mutate(Weight = as.numeric(str_replace_all(patients_clean$Weight,"kg",""))) %>% 
  mutate(Height= as.numeric(str_replace_all(patients_clean$Height,pattern = "cm",""))) %>% 
  mutate(BMI = (Weight/(Height/100)^2), Overweight = BMI > 25) %>% 
  mutate(Smokes = as.logical(str_replace_all(Smokes, "Yes", TRUE))) %>% 
  mutate(Smokes = as.logical(str_replace_all(Smokes, "No", FALSE))) %>%
  filter(Smokes & Overweight) %>% 
  select(ID, Name, Birth, Smokes,Overweight)  %>% 
  write.table("study-candidates.txt")
```

As the file is quite large, we might want to switch to `readr` for smarter and faster reading

```{r eval=FALSE}
library(readr)
 read_tsv("cohort-data.txt") %>% 
  tbl_df %>% 
  mutate(Sex = factor(str_trim(Sex))) %>% 
  mutate(Weight = as.numeric(str_replace_all(patients_clean$Weight,"kg",""))) %>% 
  mutate(Height= as.numeric(str_replace_all(patients_clean$Height,pattern = "cm",""))) %>% 
  mutate(BMI = (Weight/(Height/100)^2), Overweight = BMI > 25) %>% 
  mutate(Smokes = str_replace_all(Smokes, "Yes", TRUE)) %>% 
  mutate(Smokes = as.logical(str_replace_all(Smokes, "No", FALSE))) %>%
  filter(Smokes & Overweight) %>% 
  select(ID, Name, Smokes,Overweight)  %>% 
  write.table("study-candidates.txt")
```