-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathr_intro_tidyverse_literate_programming.Rmd
413 lines (268 loc) · 11.3 KB
/
r_intro_tidyverse_literate_programming.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
---
title: "Literate Programming & Tidyverse Introduction"
author: "Brian High, Nancy Carmona & Chris Zuidema"
date: "![CC BY-SA 4.0](images/cc_by-sa_4.png)"
output:
ioslides_presentation:
fig_caption: yes
fig_height: 3
fig_retina: 1
fig_width: 5
keep_md: yes
logo: images/logo_128.png
smaller: yes
editor_options:
chunk_output_type: console
---
```{r set_knitr_options, echo=FALSE, message=FALSE, warning=FALSE}
suppressMessages(library(knitr))
opts_chunk$set(tidy=FALSE, cache=TRUE, echo=TRUE, message=FALSE)
```
## Learning Objectives
You will learn:
* What literate programming is and why it's important
* How to use R Markdown for literate programming
* What the "tidyverse" is
* What data "wrangling" is
* Packages and functions used to tidy and wrangle data in R
* How to tidy and wrangle data with R
## Reproducibility
For the purpose of this discussion, let's define:
* **_Replicate_**: to repeat a study (with new samples and data)
* **_Reproduce_**: to regenerate the results of an analysis (with the same data)
Our studies may be difficult to fully *replicate* ... but our analysis should always be *reproducible*.
Or in other words:
We should always be able to go from raw data to final results at any time ...
*and always get the same results* from the same data and the same analysis*.
`*` Even if you are doing modeling with randomly-generated values, you can set
the *seed* to produce the same values each time you run the analysis. There are
situations where you may encounter small rounding differences, however.
## Literate Programming
In the light of reproducibility, Literate Programming is:
Interspersing explanatory text with code to make a report that is executable
and improves reproducibility.
It allows a recipient to re-render the report to review the code/analysis and
verify the results (including plots and figures).
## Literate Programming Tools
There are two main platforms for literate programming:
* **_Python_** and other languages: Jupyter Notebook (a future topic!)
* **_R_**: R Markdown with *knitr* (today's focus) and [R Notebooks](https://blog.rstudio.com/2016/10/05/r-notebooks)
Both platforms can be used with other programming languages or a mix of them.
## R Markdown and *knitr*
To use R Markdown and *knitr* for Literate Programming, all you need is a text
editor, R, and a few packages such as *knitr*. These tools are well integrated into
RStudio.
* R Markdown: The text notation you use for formatting, based on Markdown
* *knitr*: A package by Yihui Xie used to render R Markdown documents
![Markdown document rendered as a webpage. Image by: [RStudio Blog](https://blog.rstudio.org/2014/06/)](images/storms-004.png)
## Installing R Markdown and *knitr*
You need the R packages *htmltools*, *rmarkdown* and *knitr*.
The first time to try to "knit" in RStudio, you will likely be prompted to
install these packages. If so, go ahead and install them as prompted.
Otherwise, you can install them manually:
```{r, eval=FALSE}
install.packages(c("htmltools", "rmarkdown", "knitr"), dependencies = TRUE)
```
## R Markdown Syntax
[R Markdown](https://rmarkdown.rstudio.com/) is a "wiki" syntax based on
[Markdown](https://daringfireball.net/projects/markdown/). Here is an
example:
![](images/rmarkdown_document.png)
## R Markdown Cheat Sheet
[R Markdown cheat sheet](https://www.rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf)
![](images/cheatsheet.png)
## *knitr*
The R package which renders the R Markdown document is *knitr*. It uses other
packages and utilities as needed to produce different document formats.
In RStudio, you would normally run *knitr* using the *Knit* button.
If your document has been configured in R Markdown to generate a PDF document,
pressing the *Knit* button will create a `.pdf` file.
If you click the little black arrow button next to the *Knit* button, you can
select alternative output options.
You can also use the *File* -> *Knit Document* or *File* -> *Compile Notebook*
menu options.
## Creating an `.Rmd` file
Let's make our first `.Rmd` file. This is similar to making a script:
File -> New File -> R Markdown...
![](images/Rmd_save.png)
Be sure to save it! The filename "onramp_example.Rmd" will work for now.
## Features of the Markdown Document
* YAML Header
* Text
* Code chunks
![](images/rmd_example.png)
## Introduction to the Tidyverse
Before we get into data wrangling, let's look at the [Tidyverse](https://www.tidyverse.org/)
<div class="columns-2">
![](images/tidyverse.png)
- "The tidyverse is an opinionated collection of R packages designed for data science."
- "All packages share an underlying design philosophy, grammar, and data structures."
- There are tidyverse packages for data wrangling, modeling, and visualization.
</div>
## Data wrangling with *dplyr*
"[*dplyr*](https://dplyr.tidyverse.org/) is a grammar of data manipulation,
providing a consistent set of verbs that help you solve the most common data
manipulation challenges"
Common functions include:
* Choosing variables based on their name with `select()`
* Choosing rows based on a condition with `filter()`
* Grouping data by a variable with `group_by()`
* Creating new variables or changing old variables with `mutate()`
## Loading packages with *pacman*
* *pacman* will load packages and automatically install missing packages
* *pacman* is not a tidyverse package, but it works well with the tidyverse
```{r, message = FALSE}
# Load pacman, installing if needed
if (!require("pacman"))
install.packages("pacman", repos = "http://cran.r-project.org")
# Load other packages, installing as needed.
pacman::p_load(dplyr, ggplot2)
```
This will load these packages (installing first if neccessary):
- *dplyr* for data wrangling
- *ggplot2* for data visualization
## Example Dataset `airquality`
R has many built-in datasets - let's get an air quality dataset from New York in 1973.
```{r}
# Load the "airquality" dataset into the environment.
data(airquality)
```
We can show the structure this dataset with `str()`:
```{r}
# Show the structure of the dataset
str(airquality)
```
## Example Dataset `airquality`
Alternatively, you can use `class()`, `dim()`, and `head()`:
```{r}
# Show the class and dimensions (153 rows and 6 columns) of the dataset
class(airquality)
dim(airquality)
# Display the first three rows of data
head(airquality, n = 3)
```
## Wrangle *airquality*
Common data tasks are simplified with dplyr, once you learn the verbs.
Let's add a variable to airquality using `mutate()`
```{r}
# Add the year of the measurements and look at the first 10 rows
head(mutate(airquality, Year = 1973), n = 10)
```
## Wrangle *airquality*
Common data tasks are simplified with *dplyr*, once you learn the verbs.
Let's add a variable to `airquality` using `mutate()`
```{r}
# Add the year of the measurements and look at the first 10 rows
head(mutate(airquality, Year = 1973), n = 10)
```
## Sneak preview: Using "pipes"
Alternatively, we can do the same thing with "pipes", `%>%`:
```{r}
# Add the year of the measurements and look at the first 10 rows
airquality %>% mutate(Year = 1973) %>% head(n = 10)
```
Is this code with pipes more readable? Why? We'll see pipes again a little later...
## Wrangle *airquality*
Let's take a look at the new `airquality` dataframe with `head()`.
```{r}
# Show the dataframe, looking at the first few rows
head(airquality)
```
What happened here? Where did the `Year` column go?
In our previous step, a dataframe printed to the console (if you entered the command at the prompt) or displayed below the code chunk (if you ran the command from the Rmarkdown document) with the new `Year` variable.
We used `airquality` object, but since we did not make a new assignment (e.g., with `<-`), the change was not "saved" to the object in the environment.
## Wrangle *airquality*
Now we'll add `Year` and make a new assignment to save it to memory.
```{r}
# Add year to the dataframe
airquality <- mutate(airquality, Year = 1973)
# Take a look!
head(airquality)
```
Better!
## Wrangle *airquality*
Now let's create a date column. We'll create a character string first to break the steps down.
```{r}
# Make a character variable, then a date variable.
airquality <- mutate(airquality, Date_char = paste(Year, Month, Day, sep = "-"))
airquality <- mutate(airquality, Date = as.Date(Date_char, format = "%Y-%m-%d"))
head(airquality)
# Check the class.
class(airquality$Date)
```
## Wrangle *airquality*
Say we're only interested in data from May. Let's filter our data now.
The filter function takes a logical condition, requiring `==` (is equal to?)
which is different than `=` when we are assigning variables within functions.
```{r}
airquality <- filter(airquality, Month == 5)
head(airquality)
dim(airquality)
```
## Wrangle *airquality*
Next, let's now consider using the pipe operator, `%>%`. It will help us
streamline these operations to be more clear and succinct.
The pipe operator feeds the output of one function into the input of the next, avoiding having to make multiple assignments.
```{r}
# Reload dataset to start fresh.
data(airquality)
# Wrangle airquality with pipe operators starting with original dataframe.
airquality_may <- airquality %>%
# Add year and date variables using mutate.
mutate(Year = 1973,
Date_char = paste(Year, Month, Day, sep = "-"),
Date = as.Date(Date_char, format = "%Y-%m-%d")) %>%
# Filter dataframe to the month of May.
filter(Month == 5) %>%
# Select the variables of interest.
select(Ozone, Temp, Date)
```
## Wrangle *airquality*
Let's take a look.
```{r}
# Show the dataframe.
head(airquality_may)
```
## Wrangle *airquality*
What if we're interested in the average ozone and temperature by month? Would we have to repeat this process for each month? *dplyr* has easy and powerful functions to perform data wrangling tasks on groups: `group_by()` and `summarise()`
```{r}
# Average by month, using the argument `na.rm = TRUE` to ignore the NA values.
# If you omit `na.rm = TRUE`, then any NAs will cause the mean to be NA.
airquality_by_month <- airquality %>%
# Group dataframe by month.
group_by(Month) %>%
# Calculate average ozone concentration and temperature.
summarise(Ozone_avg = mean(Ozone, na.rm = TRUE),
Temp_avg = mean(Temp, na.rm = TRUE),
.groups = "keep")
```
The last argument, `.groups` specifies how to leave the grouping when the
operation is finished. In this case, we "keep" the grouping as it is. (This is
an "experimental" feature as of *dplyr* 1.0.2.)
## Wrangle *airquality*
```{r}
# Show new dataframe.
airquality_by_month
```
## Make a plot using *ggplot2*
Now let's make a simple plot so we have an example for our rendered document. The goal here is just to introduce the main plotting tool of the Tidyverse - we'll be covering *ggplot2* in more detail in the future.
```{r}
# Create a ggplot object and save it as "p".
p <- ggplot(data = airquality_may) +
# Specify the type of geometry and the aesthetic features.
geom_line(aes(x = Date, y = Ozone)) +
# Specify the labels.
labs(y = "Ozone Conc. (ppb)") +
# Choose the theme.
theme_classic()
```
## Make a plot using *ggplot2*
```{r}
# Show the plot.
print(p)
```
## Render Markdown Document
Next, let's render the document using the "Knit" button.
##
```{r child = 'images/questions.html'}
```