-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path03-data-manipulation-exercises.Rmd
241 lines (114 loc) · 4.22 KB
/
03-data-manipulation-exercises.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
---
title: "Data Manipulation Exercises"
author: "Joscelin Rocha Hidalgo"
output:
html_document:
css: slides/style.css
toc: true
toc_depth: 1
toc_float: true
df_print: paged
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Load Packages
Load the `tidyverse` package and the `ggplot2` package.
```{r}
```
# Import data: chds6162_data
Import the chds6162_data to a dataframe called `data`.
```{r}
```
# select
![](slides/images/select.png)
![](https://swcarpentry.github.io/r-novice-gapminder/fig/13-dplyr-fig1.png)
Use `select` to show just the `marital` variable.
```{r}
```
`select` for multiple variables:
Use `select` to show `marital` and `ed` variables.
```{r}
```
Use `select` for a range of columns.
`select` all the variables from `wt` to the end.
```{r}
```
Drop the `race` variable.
```{r}
```
Drop the variables that belonged to the father from `drace` to `dwt`
```{r}
```
# mutate
![](slides/images/mutate.png)
![art by @allison_horst](slides/images/dplyr_mutate.png)
art by @allison_horst
Create a **new variable with a specific value**
Create a new variable called `country` and fill that variable with "US".
```{r}
```
Create a **new variable based on other variables**
Create a new variable called `gestation_weeks` and calculate gestation length in weeks rather than days (try rounding this number to only 2 decimals). Remember that the `gestation` variable is a measure of the **gestation length** in days. Then `select` both variables.
```{r}
```
Change an **existing variable**
Convert the `dwt` variable to kilos by dividing by 2.205 (it's in pounds now). Then, use `select` to show only the `dwt` variable.
```{r}
```
# case_when
![art by @allison_horst](slides/images/dplyr_case_when.png)
art by @allison_horst
Using mutate and case_when, let's create a variable called `age_group`. In this variable we will use the mom's age variable and recode it so all moms whose age are now "20s", "30s", and so on. Then select the `id` and the new variable
*tip= Wanna know the range of your sample age? Use `range(dataframe$variable,na.rm = TRUE)`
** Use `%in%` (you just learned it 😉)
```{r}
```
# filter
![](slides/images/filter.png)
![art by @allison_horst](https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/dplyr_filter.jpg)
art by @allison_horst
We use `filter` to choose a subset of cases.
Use `filter` to keep only respondents who are divorced (`marital` = 3).
```{r}
```
Use `filter` to keep only respondents who are **not** divorced.
```{r}
```
Use `%in%` within the `filter` function to keep only those whose mothers' race was reported as white. In this dataset, values from 0 to 5 represent "white" under the `race`variable.
```{r}
```
Create a chain that keeps only those are college grads (value of 5). Then, `filter` to keep only those who are not married. Finally, use `select` to show only the `ed` and `marital` variables.
```{r}
```
# summarize or summarise
![](slides/images/summarize.png)
Get the mean, median, min, and max weight for the fathers `dwt`.
```{r}
```
# group_by
![](slides/images/group-by.png)
![art by @allison_horst](slides/images/group_by_ungroup.png)
art by @allison_horst
Calculate the mean weight for fathers based on their education levels (`ded`variable has some NAs, don't forget to remove them!)
with `na.rm=`)
```{r}
```
# across
![art by @allison_horst](slides/images/dplyr_across.png)
art by @allison_horst
We messed up and we found out that out collaborators need the mean of all the variables for dad by their race category (dads').
Use `across` to get them all!
* Don't forget to remove NAs otherwise it will yell at you!
```{r}
```
# Big exercise
A new data frame called `mothers_below30` that:
1. Uses `filter` to only include those age younger than 30
2. Uses `mutate` to transform the `gestation` variable into weeks rather than days
3. Uses `group_by` to create mom education groups
4. Uses `summarise` to calculate a new variable called `mean_gestation_w`
5. Make sure to export it as a csv!
+ Once you have that data frame just move the variables to a different position using "relocate"
```{r}
```