-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathlab_4.Rmd
217 lines (142 loc) · 9.87 KB
/
lab_4.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
---
title: "Lab Session 4: Read and manipulate data"
author: "Ari Anisfeld"
date: "8/29/2021"
output: pdf_document
urlcolor: blue
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(dplyr.summarise.inform = FALSE)
```
Before diving into this lab, please open the pdf and read the background and data sections!
# Getting the data
Download the data [here](https://github.com/harris-coding-lab/harris-coding-lab.github.io/raw/master/data/data_traffic.csv). You can save the file directly from your browser using ctrl + s or cmd + s. Alternatively, you can read the csv directly from the internet like we saw in lab 0 using the link https://github.com/harris-coding-lab/harris-coding-lab.github.io/raw/master/data/data_traffic.csv
# Warm-up
1. Open the `Rmd` and save it in your coding lab folder; if you downloaded the data, move your data file to your preferred data location.
2. In your `Rmd`, write code to load your packages. If you load packages in the console, you will get an error when you knit because knitting starts a fresh R session.
3. Load `data_traffic.csv` and assign it to the name `traffic_data`. This data was scrapped from the UCPD website and partially cleaned by Prof. Jones.
```{r}
```
4. Recall that `group_by()` operates silently. Below I create a new data frame called grouped_data.
```{r}
grouped_data <-
traffic_data %>%
group_by(Race, Gender)
```
a. How can you tell `grouped_data` is different from `traffic_data`?
b. How many groups (Race-Gender pairs) are in the data? (This information should be available without writing additional code!)
c. Without running the code, predict the dimensions (number of rows by number of columns) of the tibbles created by `traffic_data %>% summarize(n = n())` and `grouped_data %>% summarize(n = n())`.
d. Now check you intuition by running the code.
```{r}
```
5. Use `group_by()` and `summarize()` to recreate the table in the pdf.
```{r}
```
6. Use `count()` to produce the same table.
```{r}
```
## Moving beyond counts
1. Raw counts are okay, but frequencies (or proportions) are easier to compare across data sets. Add a column with frequencies and assign the new tibble to the name `traffic_stop_freq`. The result should be identical to Prof. Jones's analysis on twitter.
Try on your own first. If you're not sure how to add a frequency though, you could google "add a proportion to count with tidyverse" and find this [stackoverflow post](https://stackoverflow.com/questions/24576515/relative-frequencies-proportions-with-dplyr). Follow the advice of the number one answer. The green checkmark and large number of upvotes indicate the answer is likely reliable.
```{r}
```
2. The frequencies out of context are not super insightful. What additional information do we need to argue the police are disproportionately stopping members of a certain group? (Hint: Prof. Jones shares the information in his tweets.)
3. For the problem above, your groupmate tried the following code. Explain why the frequencies are all 1. (Hint: This is a lesson about `group_by()`!)
```{r}
traffic_stop_freq_bad <-
traffic_data %>%
group_by(Race) %>%
summarize(n = n(),
freq = n / sum(n))
traffic_stop_freq_bad
```
4. Now we want to go a step further.^[The analysis that follows is partially inspired by Eric Langowski, a Harris alum, who was also inspired to investigate by the existence of this data (You may have seen Prof. Jones retweet him at the end of the thread.)] Do outcomes differ by race? In the first code block below, I provide code so you can visualize disposition by race. "Disposition" is police jargon that means the current status or final outcome of a police interaction.
```{r}
citation_strings <- c("citation issued", "citations issued",
"citation issued" )
arrest_strings <- c("citation issued, arrested on active warrant",
"citation issued; arrested on warrant",
"arrested by cpd",
"arrested on warrant",
"arrested",
"arrest")
disposition_by_race <-
traffic_data %>%
mutate(Disposition = str_to_lower(Disposition),
Disposition =
case_when(Disposition %in% citation_strings ~ "citation",
Disposition %in% arrest_strings ~ "arrest",
TRUE ~ Disposition)) %>%
count(Race, Disposition) %>%
group_by(Race) %>%
mutate(freq = round(n / sum(n), 3))
disposition_by_race %>%
filter(n > 5, Disposition == "citation") %>%
ggplot(aes(y = freq, x = Race)) +
geom_col() +
labs(y = "Citation Rate Once Stopped", x = "", title = "Traffic Citation Rate") +
theme_minimal()
```
Let's break down how we got to this code. First, I ran `traffic_data %>% count(Race, Disposition)` and noticed that we have a lot of variety in how officers enter information into the system.^[Try it yourself!] I knew I could deal with some of the issue by standardizing capitalization.
a. In the console, try out `str_to_lower(...)` by replacing the `...` with different strings. The name may be clear enough, but what does `str_to_lower()` do? This code comes from the `stringr` package. Checkout `?str_to_lower` to learn about some related functions.
After using `mutate` with `str_to_lower()`, I piped into `count()` again and looked for strings that represent the same `Disposition`. I stored terms in character vectors (e.g. `citation_strings`). The purpose is to make the `case_when()` easier to code and read. Once I got that right, I added frequencies to finalize `disposition_by_race`.
5. To make the graph, I first tried to get all the disposition data on the same plot.
```{r}
disposition_by_race %>%
ggplot(aes(y = freq, x = Race, fill = Disposition)) +
geom_col()
```
By default, the bar graph is stacked. Look at the resulting graph and discuss the pros and cons of this plot with your group.
6. I decided I would focus on citations only and added the `filter(n > 5, Disposition == "citation")` to the code.^[Notice that I get the data exactly how I want it using `dplyr` verbs and then try to make the graph.] What is the impact of filtering based on `n > 5`? Would you make the same choice? This question doesn't have a "right" answer. You should try different options and reflect.
7. Now, you can create a similar plot based called "Search Rate" using the `Search` variable. Write code to reproduce this plot.
```{r}
# write your code here
```
## Extension: Revisiting world inequality data
When we explored the World Inequality Database data in lab 1, we mimicked grouped analysis by filtering the data to only show data for France and then repeated the analysis for Russia. Using `group_by()`, we can complete the analysis for each country simultaneously.
1. Read in the `wid_data`.
```{r}
wid_data_raw <-
# You will like have to adjust the file path
readxl::read_xlsx("../data/world_wealth_inequality.xlsx",
col_names = c("country", "indicator", "percentile", "year", "value")) %>%
separate(indicator, sep = "[\\r]?\\n", into = c("row_tag", "type", "notes"))
wid_data <- wid_data_raw %>%
select(-row_tag) %>%
select(-notes, everything()) %>%
# some students had trouble because excel added "\r" to the end
# of each string. mutate standardizes the string across platforms.
mutate(type = ifelse(str_detect(type, "Net personal wealth"),
"Net personal wealth", type)) %>%
filter(type == "Net personal wealth")
```
2. Create a table that tells us the number of years observed per country and first and last year we have data^[i.e. not `NA`s.] for each country.^[Hint: `?summarize` lists "Useful functions" for summarizing data. Look at the "Range" or "Position" functions. If you are going to use the "Position" functions, make sure the data is sorted properly.] For example, India has 6 years of observations and has data from 1961 to 2012.
```{r}
```
3. Create a table that provides the mean and standard deviation of the share of wealth owned by the top 10 percent and top 1 percent for each country. Call the resulting tibble `mean_share_per_country`.
```{r}
```
a. Which country has the smallest standard deviation in share of wealth owned by the top 10 percent? Use `arrange()` to order the countries by standard deviation. Compare the order to you results above about the number of observation and time horizon.
b. If your code worked, you should be able to make this bar chart.
```{r}
mean_share_per_country %>%
mutate(country = case_when(country == "Russian Federation" ~ "Russia",
country == "United Kingdom" ~ "UK",
country == "South Africa" ~ "S Africa",
TRUE ~ country)) %>%
ggplot(aes(x = country, y = mean_share, fill = percentile)) +
geom_col(position = "dodge2") +
labs(y = "Mean share of national wealth", x = "", fill = "Wealth\npercentile")
```
4. **Challenge** Write code to create `mean_share_per_country_with_time` a tibble that produces the following graph which lets us see how the share of national wealth held by the top 10 and 1 percent change over time.^[Hint: use `case_when` or several `ifelse` to create a new column called `time_period` that labels data as "1959 and earlier", "1960 to 1979", "1980 to 1999", or "2000 to present". Then, add `time_period` to your `group_by()` along with other relevant grouping variables.]
```{r}
mean_share_per_country_with_time <- NULL
```
```{r}
mean_share_per_country_with_time %>%
ggplot(aes(x = country, y = mean_share, fill = percentile)) +
geom_col(position = "dodge2") +
facet_wrap(~time_period)
```
Want to improve this tutorial? Report any suggestions/bugs/improvements on [here](mailto:[email protected])! We’re interested in learning from you how we can make this tutorial better.