-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathget_wa_wqi_alt.Rmd
325 lines (233 loc) · 10.1 KB
/
get_wa_wqi_alt.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
---
title: 'Data Import: WA WQI'
author: "Brian High"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
ioslides_presentation:
fig_caption: yes
fig_retina: 1
fig_width: 5
fig_height: 3
keep_md: yes
smaller: yes
template: ../../../templates/ioslides_template.html
html_document:
template: ../../../templates/html_template.html
editor_options:
chunk_output_type: console
---
```{r header, include=FALSE}
# Filename: get_wa_wqi_alt.Rmd
# Copyright (c) University of Washington
# License: MIT https://opensource.org/licenses/MIT (See LICENSE file.)
# Repository: https://github.com/deohs/coders
```
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r get_template, include=FALSE}
# Copy template from package folder to work around Pandoc bug on Windows.
# If file is missing from template_dir, then first rendering attempt may fail.
template_pkg <- file.path(system.file(package = 'rmarkdown'),
'rmd', 'ioslides', 'default.html')
if (.Platform$OS.type == "windows" & length(grep(' ', template_pkg) > 0)) {
template_dir <- file.path('..', '..', '..', 'templates')
dir.create(template_dir, showWarnings = FALSE)
template_loc <- file.path(template_dir, 'ioslides_template.html')
if (!file.exists(template_loc)) {
file.copy(template_pkg, template_loc, copy.mode = FALSE)
}
}
```
## Data Import and Cleanup Demo
Today's example demonstrates these objectives:
* Use a public dataset freely available on the web.
* Automate data processing from start to finish (importing to reporting).
* Explore alternatives to "base" functions.
* Use data "pipelines" for improved readability.
* Use "regular expressions" to simplify data manipulation.
* Use "literate programming" to provide a reproducable report.
* Use a consistent coding [style](https://google.github.io/styleguide/Rguide.xml).
* Share code through a public [repository](https://github.com/deohs/coders) to
facilitate collaboration.
We will be using the R language, but several other tools could do the job.
The code and this presentation are free to share and modify according to the
[MIT License](https://github.com/deohs/coders/blob/master/LICENSE).
## WA Water Quality Index Scores
We would like to get Washington water quality index (WQI) data and visualize it.
We will download [WQI Parameter Scores 1994-2013](https://catalog.data.gov/dataset/wqi-parameter-scores-1994-2013-b0941) from `data.gov`,
import, clean and plot on a map.
To find the data, we can search this page: https://catalog.data.gov/dataset
... for these terms: `WQI Parameter Scores 1994-2013`
The result for "WQI Parameter Scores 1994-2013" provides a link to a
CSV file.
We will use this data.wa.gov [link](https://data.wa.gov/api/views/dn4d-x42e/rows.csv?accessType=DOWNLOAD) to import the data into R.
## Access & Use Information
The data.wa.gov page says:
* Public: This dataset is intended for public access and use.
* Non-Federal: This dataset is covered by different Terms of Use than Data.gov.
* License: No license information was provided.
The data page also says:
* The WQI ranges from 1 (poor quality) to 100 (good quality).
Lastly, the authors of the `ggmap` package ask that users provide a citation:
D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The R Journal,
5(1), 144-161. URL http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf
## Setup
Load packages with `pacman` to auto-install any missing packages.
```{r}
if (! suppressPackageStartupMessages(require(pacman))) {
install.packages('pacman', repos = 'http://cran.us.r-project.org')
}
pacman::p_load(readr, dplyr, tidyr, ggmap)
```
We are loading:
* `readr` for `read_csv()` -- a [tidyverse](https://www.tidyverse.org/) replacement for `read.csv()`
* `dplyr` for several functions -- a [tidyverse](https://www.tidyverse.org/) package for data manipulation
* `tidyr` for `separate()` -- a [tidyverse](https://www.tidyverse.org/) function for column splitting
* `ggmap` for `ggmap()` -- a function, similar to `ggplot()`, to create maps
## Get the Data
Import the data with `read_csv()` as an alternative to `read.csv()`.
```{r}
url <- 'https://data.wa.gov/api/views/dn4d-x42e/rows.csv?accessType=DOWNLOAD'
wa_wqi <- read_csv(url)
```
## View the Location Data
Look at some of the values of `Location 1`.
```{r}
head(wa_wqi$`Location 1`)
```
We will want to split out the longitude and latitude into their own variables.
## Cleanup the Data
Parse location column to get latitude and longitude columns using a "pipeline".
* Use `mutate()` to create a "cleaned" `Location.1` variable.
* Use a "regular expression" with `gsub()` to remove unwanted text.
* Use `separate()` to split `Location.1` into `lon` and `lat` variables.
* Use `convert = TRUE` to auto-convert the `character` values to `numeric`.
```{r}
wa_wqi <- wa_wqi %>%
mutate(Location.1 = gsub('POINT |[()]', '', `Location 1`)) %>%
separate(col = Location.1, into = c('lon', 'lat'), sep = ' ', convert = TRUE)
```
Note: The `sep` parameter of `separate()` will also accept a "regular expression".
## Our Regular Expression
When we used `gsub()` for cleaning the `Location 1` variable, we replaced
matching character strings with `''` in order to remove them.
```{r, eval=FALSE}
gsub(pattern = 'POINT |[()]', replacement = '', x = `Location 1`)
```
The **pattern match** was made using a [regular expression](https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf).
The expression was: `POINT |[()]`
* The `POINT ` is a literal string (including a literal space after it).
* The `|` (vertical bar) symbol means "or" in this context.
* The `[` and `]` (angle brackets) defines a character set in this context.
* The `(` and `)` (parenthesis) are the characters in that set.
Translating the whole expression we get:
* Either `POINT ` or `(` or `)`.
So, any character strings which matched these patterns were removed.
## View the Cleaned Data
View some of the location data. Show the first few rows.
```{r}
options(pillar.sigfig = 7)
wa_wqi %>% select(`Location 1`, lon, lat) %>% head()
```
## Check Ranges: Summary
```{r}
wa_wqi %>% select(Year, `Location 1`, lon, lat) %>% summary()
```
It looks the Min. `Year` is a longitude, the Max. `lon` is a latitute and
the Max. `lat` is a Year.
## Check Ranges: Count
Count the number unique `lon` per `Station` where Year is less than zero.
```{r}
wa_wqi %>% filter(Year < 0) %>% group_by(Station, Year, lon) %>%
summarize(Count = n()) %>% knitr::kable()
```
There are 4 Stations with 19 rows (Years) per location that have values in
the wrong columns.
## Investigate Errors: List Stations
Get a list of the Stations with values in the wrong columns.
```{r}
error_sta <- wa_wqi %>% filter(Year < 0) %>% select(Station) %>% unique() %>%
pull(Station)
error_sta
```
## Investigate Errors: List Locations
List the affected stations and their locations in a table. Show the first
few rows.
```{r}
wa_wqi %>% filter(Station %in% error_sta) %>%
select(Year, `Location 1`, lon, lat) %>% head() %>% knitr::kable()
```
## Fix Errors
For these Stations, it appears that we need move `lat` values to `Year`,
`Year` to `lon`. and `lon` to `lat`.
```{r}
wa_wqi[wa_wqi$Year < 0, c('Year', 'lon', 'lat')] <-
wa_wqi[wa_wqi$Year < 0, c('lat', 'Year', 'lon')]
```
We could have also swapped these by column index, but doing so by name is
more clear and less error prone.
## Verify Changes: List Locations
List the affected stations and their locations in a table. Show the first few
rows.
```{r}
wa_wqi %>% filter(Station %in% error_sta) %>%
select(Year, `Location 1`, lon, lat) %>% head() %>% knitr::kable()
```
## Verify Changes: Summary
Show a summary of the ranges of values for the affected variables.
```{r}
wa_wqi %>% select(Year, `Location 1`, lon, lat) %>% summary()
```
## Create a Map
Define a boundary box.
```{r}
height <- max(wa_wqi$lat) - min(wa_wqi$lat)
width <- max(wa_wqi$lon) - min(wa_wqi$lon)
bbox <- c(
min(wa_wqi$lon) - 0.15 * width,
min(wa_wqi$lat) - 0.15 * height,
max(wa_wqi$lon) + 0.15 * width,
max(wa_wqi$lat) + 0.15 * height
)
names(bbox) <- c('left', 'bottom', 'right', 'top')
```
Make a map base layer of "Stamen" tiles.
```{r}
map <- suppressMessages(
get_stamenmap(bbox, zoom = 8, maptype = "toner-background"))
```
Make the map image from the tiles using `ggmap`.
```{r}
g <- ggmap(map, darken = c(0.3, "white")) + theme_void()
```
## Add to the Map
Add points, a legend, and a title to the map.
```{r}
wa_wqi_2013 <- wa_wqi %>% filter(Year == 2013)
g <- g + geom_point(aes(x = lon, y = lat, fill = `Overall WQI`),
data = wa_wqi_2013, pch = 21, size = 3) +
scale_fill_gradient(name = "WQI", low = "red", high = "green") +
ggtitle(label = paste("Washington State",
"River and Stream Water Quality Index (WQI)", sep = " "),
subtitle = paste("Source: River and Stream Monitoring Program,",
"WA State Department of Ecology (2013)")) +
theme(legend.position = c(.98, .02), legend.justification = c(1, 0))
```
## View the Map
```{r view_map, fig.height=5, fig.width=7, echo=FALSE}
g
```
## Exercises
A. Look more carefully at this dataset and see if there are other errors which
you could fix. For example, we did not examine all of the variables. Are there
other variables which have unexpected values? How would you fix those?
B. What alternatives to the regular expression `POINT |[()]` could we use?
C. Instead of reading the file directly from the web, how could we download
the file first? How could we code so we only download the file if we do not
already have a local copy of it?
D. Consider downloading a different dataset and import, clean, and plot the data.
For example, compare your results with the [Annual 2013 Water Quality Index Scores](https://catalog.data.gov/dataset/annual-2013-water-quality-index-scores-4d1fd)
dataset from `data.gov`. Do you see any differences? How would you account for them? Check for incorrect values as we did earlier in this tutorial. Try to
fix any obvious errors. What errors did you find and how did you find and fix
them? Or were there no errors? How can you be sure?