-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathr_intro_tidyverse_wrangling.Rmd
306 lines (210 loc) · 7.84 KB
/
r_intro_tidyverse_wrangling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
---
title: "Introduction to the Tidyverse and Data Wrangling"
author: "Brian High, Nancy Carmona & Chris Zuidema"
date: "![CC BY-SA 4.0](images/cc_by-sa_4.png)"
output:
ioslides_presentation:
fig_caption: yes
fig_height: 3
fig_retina: 1
fig_width: 5
keep_md: yes
logo: images/logo_128.png
smaller: yes
editor_options:
chunk_output_type: console
---
```{r set_knitr_options, echo=FALSE, message=FALSE, warning=FALSE}
suppressMessages(library(knitr))
opts_chunk$set(tidy=FALSE, cache=TRUE, echo=TRUE, message=FALSE)
```
## Learning Objectives
You will learn:
* What the "tidyverse" is
* What data "wrangling" is
* Packages and functions used to tidy and wrangle data in R
* How to tidy and wrangle data with R
## Introduction to the Tidyverse
Before we get into data wrangling, let's look at the [Tidyverse](https://www.tidyverse.org/)
<div class="columns-2">
![](images/tidyverse.png)
- "The tidyverse is an opinionated collection of R packages designed for data science."
- "All packages share an underlying design philosophy, grammar, and data structures."
- There are tidyverse packages for data wrangling, modeling, and visualization.
</div>
## Data wrangling with *dplyr*
"[*dplyr*](https://dplyr.tidyverse.org/) is a grammar of data manipulation,
providing a consistent set of verbs that help you solve the most common data
manipulation challenges"
Common functions include:
* Choosing variables based on their name with `select()`
* Choosing rows based on a condition with `filter()`
* Grouping data by a variable with `group_by()`
* Creating new variables or changing old variables with `mutate()`
## Loading packages with *pacman*
* *pacman* will load packages and automatically install missing packages
* *pacman* is not a tidyverse package, but it works well with the tidyverse
```{r, message = FALSE}
# Load pacman, installing if needed
if (!require(pacman)){ install.packages("pacman") }
# Load other packages, installing as needed
pacman::p_load(dplyr, ggplot2)
```
This will load these packages (installing first if neccessary):
- *dplyr* for data wrangling
- *ggplot2* for data visualization
## Example Dataset *airquality*
R has many built-in datasets - let's get an air quality dataset from New York in 1973.
```{r}
# Load the "airquality" dataset into the environment
data(airquality)
```
We can show the structure this dataset with `str()`:
```{r}
# Show the structure of the dataset
str(airquality)
```
## Example Dataset *airquality*
Alternatively, you can use `class()`, `dim()`, and `head()`:
```{r}
# Show the class and dimensions (153 rows and 6 columns) of the dataset
class(airquality)
dim(airquality)
# Display the first three rows of data
head(airquality, n = 3)
```
## Wrangle *airquality*
Common data tasks are simplified with *dplyr*, once you learn the verbs.
Let's add a variable to *airquality* using `mutate()`
```{r}
# Add the year of the measurements and look at the first 10 rows
head(mutate(airquality, Year = 1973), n = 10)
```
## Sneak preview: Using "pipes"
Alternatively, we can do the same thing with "pipes," `|>`.
```{r}
# Add the year of the measurements and look at the first 10 rows
airquality |> mutate(Year = 1973) |> head(n = 10)
```
Is this code with pipes more readable? Why? We'll see pipes again a little later...
## Wrangle *airquality*
Let's take a look at the new *airquality* dataframe with `head()`.
```{r}
# Show the dataframe, looking at the first few rows
head(airquality)
```
What happened here? Where did the `Year` column go?
In our previous step, a dataframe printed to the console (if you entered the
command at the prompt) or displayed below the code chunk (if you ran the command
from the Rmarkdown document) with the new `Year` variable.
We used the *airquality* object, but since we did not make a new assignment
(e.g., with `<-`), the change was not "saved" to the object in the environment.
## Wrangle *airquality*
Now we'll add `Year` and make a new assignment to save it to memory.
```{r}
# Add year to the dataframe
airquality <- mutate(airquality, Year = 1973)
# Take a look!
head(airquality)
```
Better!
## Wrangle *airquality*
Now let's create a date column. We'll create a character string first to break
the steps down.
```{r}
# Make a character variable, then a date variable.
airquality <- mutate(airquality, Date_char = paste(Year, Month, Day, sep = "-"))
airquality <- mutate(airquality, Date = as.Date(Date_char, format = "%Y-%m-%d"))
head(airquality)
# Check the class.
class(airquality$Date)
```
## Wrangle *airquality*
Say we're only interested in data from May. Let's filter our data now.
The filter function takes a logical condition, requiring `==` (is equal to?)
which is different than `=` when we are assigning variables within functions.
```{r}
airquality <- filter(airquality, Month == 5)
head(airquality)
dim(airquality)
```
## The pipe operator
Next, let's consider wrangling *airquality* using the pipe operator. There are
two common pipe operators in R:
* The "native" pipe operator, `|>` (starting with R version 4.1.0)
* The *magrittr* pipe operator, `%>%`
These two pipes are largely the same, but there are some
[differences](https://www.tidyverse.org/blog/2023/04/base-vs-magrittr-pipe/) to
be aware of. You're also likely to encounter both types of pipes.
Pipes help to streamline consecutive operations to be more clear and succinct.
Pipe operators feed the output of one function into the input of the next,
avoiding having to make multiple assignments. In the next code chunk, we'll use
the *magrittr* pipe operator, `%>%`.
## Wrangle *airquality*
### with the [*magrittr*](https://magrittr.tidyverse.org/) pipe operator
```{r}
# Reload dataset to start fresh.
data(airquality)
# Wrangle airquality with pipe operators starting with original dataframe
airquality_may <- airquality %>%
# Add year and date variables using mutate
mutate(Year = 1973,
Date_char = paste(Year, Month, Day, sep = "-"),
Date = as.Date(Date_char, format = "%Y-%m-%d")) %>%
# Filter dataframe to the month of May
filter(Month == 5) %>%
# Select the variables of interest.
select(Ozone, Temp, Date)
```
## Wrangle *airquality*
Let's take a look.
```{r}
# Show the dataframe.
head(airquality_may)
```
## Wrangle *airquality*
What if we're interested in the average ozone and temperature by month? Would we
have to repeat this process for each month? *dplyr* has easy and powerful
functions to perform data wrangling tasks on groups: `group_by()` and
`summarise()`.
```{r}
# Average by month, using the argument `na.rm = TRUE` to ignore the NA values
# If you omit `na.rm = TRUE`, then any NAs will cause the mean to be NA
airquality_by_month <- airquality %>%
# Group dataframe by month.
group_by(Month) %>%
# Calculate average ozone concentration and temperature.
summarise(Ozone_avg = mean(Ozone, na.rm = TRUE),
Temp_avg = mean(Temp, na.rm = TRUE),
.groups = "keep")
```
The last argument, `.groups` specifies how to leave the grouping when the
operation is finished. In this case, we "keep" the grouping as it is. (This is
an "experimental" feature as of *dplyr*.)
## Wrangle *airquality*
```{r}
# Show new dataframe
airquality_by_month
```
## Make a plot using *ggplot2*
Now let's make a simple plot so we have an example for our rendered document.
The goal here is just to introduce the main plotting tool of the Tidyverse -
we'll be covering *ggplot2* in more detail in the future.
```{r}
# Create a ggplot object and save it as "p"
p <- ggplot(data = airquality_may) +
# Specify the type of geometry and the aesthetic features
geom_line(aes(x = Date, y = Ozone)) +
# Specify the labels.
labs(y = "Ozone Conc. (ppb)") +
# Choose the theme.
theme_classic()
```
## Make a plot using *ggplot2*
```{r show_plot}
# Show the plot.
print(p)
```
##
```{r child = 'images/questions.html'}
```