-
Notifications
You must be signed in to change notification settings - Fork 44
/
Copy path3.workflows.Rmd
215 lines (144 loc) · 5.72 KB
/
3.workflows.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
---
title: "Towards Analysis Workflows in R"
author: "Mark Dunning"
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output: html_document
---
# Overview of this section
- Introducing piping
- **filter** verb
- **arrange** verb
```{r echo=FALSE}
suppressPackageStartupMessages(library(dplyr))
library(stringr)
library(tidyr)
patients <- tbl_df(read.delim("patient-data.txt"))
```
We've ended up with a long chain of steps to perform on our data. It is quite common to nest commands in R into a single line;
```{r}
patients <- tbl_df(read.delim("patient-data.txt"))
```
we read as "
> apply the `tbl_df` to the result of reading the file `patient-data.txt`
Could also do the same for our `mutate` statements, although this would quickly become tricky...
```{r}
patients_clean <- mutate(mutate(patients, Sex = factor(str_trim(Sex))),
ID = str_pad(ID,pad="0",width=3))
```
We always have to work out what the first statement was and work forwards from that.
Alternatively, we could write each command as a separate line
```{r}
patients_clean<- mutate(patients, Sex = factor(str_trim(Sex)))
patients_clean <- mutate(patients_clean, ID=str_pad(patients_clean$ID,pad = "0",width=3))
patients_clean <- mutate(patients_clean, Height= str_replace_all(patients_clean$Height,pattern = "cm",""))
```
- prone to error if we copy-and paste
- notice how the output of one line is the input to the following line
## Introducing piping
The output of one operations gets used as the input of the next
In computing, this is referring to as *piping*
- unix commands use the `|` symbol
## magrittr


- the magrittr library implements this in R
## Simple example
Read the file `patient-data` and pass the result to the `head` function.
```{r eval=FALSE}
patients <- read.delim("patient-data.txt") %>%
tbl_df
```
> read the file `patient-data.txt` ***and then*** use the `tbl_df` function
We can re-write our steps from above;
```{r}
patients_clean <- read.delim("patient-data.txt") %>%
tbl_df %>%
mutate(Sex = factor(str_trim(Sex))) %>%
mutate(Height= as.numeric(str_replace_all(patients_clean$Height,pattern = "cm","")))
```
> read the file `patient-data.txt` ***and then*** use the `tbl_df` function ***and then*** trim the whitespace from the Sex variable ***and then*** replace cm with blank characters in the Height variable
## Exercise: workflow-exercise.Rmd
******
Take the steps used to clean the patients dataset and calculate BMI (see template for the code)
- Re-write in the piping framework
- Add a step to print just the ID, Name, Date of Birth, Smokes and Overweight columns
******
```{r}
mutate(patients_clean, Weight = as.numeric(str_replace_all(patients_clean$Weight,"kg",""))) %>%
mutate(BMI = (Weight/(Height/100)^2), Overweight = BMI > 25) %>%
mutate(Smokes = str_replace_all(Smokes, "Yes", TRUE)) %>%
select(ID, Name, Birth,BMI,Smokes,Overweight)
```
Now having displayed the relevant information for our patients, we want to extract rows of interest from the data frame.
## Selecting rows: The `filter` verb

The **`filter``** verb is used to select rows from the data frame. The criteria we use to select can use the comparisons `==`, `>`, `<`, `!=`
e.g. select all the males
```{r}
filter(patients, Sex == "Male")
```
In base R, we would do
```{r eval=FALSE}
patients[patients$Sex == "Male",]
```
Again, to non R-users, this is less intuitive
```{r echo=FALSE}
head(patients[patients$Sex == "Male",])
```
Combining conditions can be achieved using `&` (and) `|` (or)
```{r}
filter(patients, Sex == "Male" & Died)
```
```{r eval=FALSE}
patients[patients$Sex == "Male" & patients$Died,]
```
```{r echo=FALSE}
patients[patients$Sex == "Male" & patients$Died,]
```
```{r}
filter(patients, Sex == "Female" | Grade_Level > 1)
```
A really convenient function is `top_n`
```{r}
top_n(patients_clean,10,Height)
top_n(patients_clean,10,Weight)
```
******
Modify the workflow to select the candidates (overweight smokers)
Write the result to a file
******
## Ordering rows: The `arrange` verb
```{r}
arrange(patients, Height)
```
## Re-usable pipelines
Imagine we have a second dataset that we want to process; `cohort-data.txt`.
Take a moment to try and read the data into Excel (/LibreOffice) and consider if you would want to work with these data....
```{r eval=FALSE}
read.delim("cohort-data.txt") %>%
tbl_df %>%
mutate(Sex = factor(str_trim(Sex))) %>%
mutate(Weight = as.numeric(str_replace_all(patients_clean$Weight,"kg",""))) %>%
mutate(Height= as.numeric(str_replace_all(patients_clean$Height,pattern = "cm",""))) %>%
mutate(BMI = (Weight/(Height/100)^2), Overweight = BMI > 25) %>%
mutate(Smokes = as.logical(str_replace_all(Smokes, "Yes", TRUE))) %>%
mutate(Smokes = as.logical(str_replace_all(Smokes, "No", FALSE))) %>%
filter(Smokes & Overweight) %>%
select(ID, Name, Birth, Smokes,Overweight) %>%
write.table("study-candidates.txt")
```
As the file is quite large, we might want to switch to `readr` for smarter and faster reading
```{r eval=FALSE}
library(readr)
read_tsv("cohort-data.txt") %>%
tbl_df %>%
mutate(Sex = factor(str_trim(Sex))) %>%
mutate(Weight = as.numeric(str_replace_all(patients_clean$Weight,"kg",""))) %>%
mutate(Height= as.numeric(str_replace_all(patients_clean$Height,pattern = "cm",""))) %>%
mutate(BMI = (Weight/(Height/100)^2), Overweight = BMI > 25) %>%
mutate(Smokes = str_replace_all(Smokes, "Yes", TRUE)) %>%
mutate(Smokes = as.logical(str_replace_all(Smokes, "No", FALSE))) %>%
filter(Smokes & Overweight) %>%
select(ID, Name, Smokes,Overweight) %>%
write.table("study-candidates.txt")
```