-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathget_wa_wqi.Rmd
260 lines (189 loc) · 8.86 KB
/
get_wa_wqi.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
---
title: 'Data Import: WA WQI'
author: "Brian High"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
ioslides_presentation:
fig_caption: yes
fig_retina: 1
fig_width: 5
fig_height: 3
keep_md: yes
smaller: yes
template: ../../../templates/ioslides_template.html
html_document:
template: ../../../templates/html_template.html
---
```{r header, include=FALSE}
# Filename: get_wa_wqi.Rmd
# Copyright (c) University of Washington
# License: MIT https://opensource.org/licenses/MIT (See LICENSE file.)
# Repository: https://github.com/deohs/coders
```
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache = FALSE)
```
```{r get_template, include=FALSE}
# Copy template from package folder to work around Pandoc bug on Windows.
# If file is missing from template_dir, then first rendering attempt may fail.
template_pkg <- file.path(system.file(package = 'rmarkdown'),
'rmd', 'ioslides', 'default.html')
if (.Platform$OS.type == "windows" & length(grep(' ', template_pkg) > 0)) {
template_dir <- file.path('..', '..', '..', 'templates')
dir.create(template_dir, showWarnings = FALSE)
template_loc <- file.path(template_dir, 'ioslides_template.html')
if (!file.exists(template_loc)) {
file.copy(template_pkg, template_loc, copy.mode = FALSE)
}
}
```
## Data Import and Cleanup Demo
Today's example demonstrates these objectives:
* Use a public dataset freely available on the web.
* Automate data processing from start to finish (importing to reporting).
* Explore alternatives to "base" functions.
* Use data "pipelines" for improved readability.
* Use "regular expressions" to simplify data manipulation.
* Use "literate programming" to provide a reproducable report.
* Use a consistent coding [style](https://google.github.io/styleguide/Rguide.xml).
* Share code through a public [repository](https://github.com/deohs/coders) to
facilitate collaboration.
We will be using the R language, but several other tools could do the job.
The code and this presentation are free to share and modify according to the
[MIT License](https://github.com/deohs/coders/blob/master/LICENSE).
## WA Water Quality Index Scores
We would like to get Washington water quality index (WQI) data and visualize it.
We will download [Annual 2013 Water Quality Index Scores](https://catalog.data.gov/dataset/annual-2013-water-quality-index-scores-4d1fd) from `data.gov`,
import, clean and plot on a map.
To find the data, we can search this page: https://catalog.data.gov/dataset
... for these terms: `washington state wqi stream monitoring`
The result for "Annual 2013 Water Quality Index Scores" provides a link to a
CSV file.
We will use this data.wa.gov [link](https://data.wa.gov/api/views/h7j9-vgr3/rows.csv?accessType=DOWNLOAD) to import the data into R.
## Access & Use Information
The data.wa.gov page says:
* Public: This dataset is intended for public access and use.
* Non-Federal: This dataset is covered by different Terms of Use than Data.gov.
* License: No license information was provided.
The [River & stream water quality index](https://ecology.wa.gov/Research-Data/Monitoring-assessment/River-stream-monitoring/Water-quality-monitoring/River-stream-water-quality-index) page also says:
Data presented in the Freshwater Quality Index helps indicate whether
water quality is good, meeting standards to protect aquatic life, whether
it is of moderate concern or is poor and doesn't meet expectations. The
index ranges from 1 to 100; a higher number indicates better water quality.
Lastly, the authors of the `ggmap` package ask that users provide a citation:
D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The R Journal,
5(1), 144-161. URL http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf
## Setup
Load packages with `pacman` to auto-install any missing packages.
```{r}
if (! suppressPackageStartupMessages(require(pacman))) {
install.packages('pacman', repos = 'http://cran.us.r-project.org')
}
pacman::p_load(readr, dplyr, tidyr, ggmap)
```
We are loading:
* `readr` for `read_csv()` -- a [tidyverse](https://www.tidyverse.org/) replacement for `read.csv()`
* `dplyr` for `mutate()` -- a [tidyverse](https://www.tidyverse.org/) function for data modification
* `tidyr` for `separate()` -- a [tidyverse](https://www.tidyverse.org/) function for column splitting
* `ggmap` for `ggmap()` -- a function, similar to `ggplot()`, to create maps
## Get the Data
Import the data with `read_csv()` as an alternative to `read.csv()`.
```{r}
url <- 'https://data.wa.gov/api/views/h7j9-vgr3/rows.csv?accessType=DOWNLOAD'
wa_wqi <- read_csv(url)
```
## View the Location Data
Look at some of the values of `Location 1`.
```{r}
head(wa_wqi$`Location 1`)
```
We will want to split out the longitude and latitude into their own variables.
## Cleanup the Data
Parse location column to get latitude and longitude columns using a "pipeline".
* Use `mutate()` to create a "cleaned" `Location.1` variable.
* Use a "regular expression" with `gsub()` to remove unwanted text.
* Use `separate()` to split `Location.1` into `lon` and `lat` variables.
* Use `convert = TRUE` to auto-convert the `character` values to `numeric`.
```{r}
wa_wqi <- wa_wqi %>%
mutate(Location.1 = gsub('POINT |[()]', '', `Location 1`)) %>%
separate(col = Location.1, into = c('lon', 'lat'), sep = ' ', convert = TRUE)
```
Note: The `sep` parameter of `separate()` will also accept a "regular expression".
## Our Regular Expression
When we used `gsub()` for cleaning the `Location 1` variable, we replaced
matching character strings with `''` in order to remove them.
```{r, eval=FALSE}
gsub(pattern = 'POINT |[()]', replacement = '', x = `Location 1`)
```
The **pattern match** was made using a [regular expression](https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf).
The expression was: `POINT |[()]`
* The `POINT ` is a literal string (including a literal space after it).
* The `|` (vertical bar) symbol means "or" in this context.
* The `[` and `]` (angle brackets) defines a character set in this context.
* The `(` and `)` (parenthesis) are the characters in that set.
Translating the whole expression we get:
* Either `POINT ` or `(` or `)`.
So, any character strings which matched these patterns were removed.
## View the Cleaned Data
View some of the location data.
```{r}
options(pillar.sigfig = 7)
wa_wqi %>% select(`Location 1`, lon, lat) %>% head()
```
## Create a Map
Define a boundary box.
```{r}
height <- max(wa_wqi$lat) - min(wa_wqi$lat)
width <- max(wa_wqi$lon) - min(wa_wqi$lon)
bbox <- c(
min(wa_wqi$lon) - 0.15 * width,
min(wa_wqi$lat) - 0.15 * height,
max(wa_wqi$lon) + 0.15 * width,
max(wa_wqi$lat) + 0.15 * height
)
names(bbox) <- c('left', 'bottom', 'right', 'top')
```
Make a map base layer of "Stamen" tiles.
```{r}
map <- suppressMessages(
get_stamenmap(bbox, zoom = 8, maptype = "toner-background"))
```
Make the map image from the tiles using `ggmap`.
```{r}
g <- ggmap(map, darken = c(0.3, "white")) + theme_void()
```
## Add to the Map
Add points, a legend, and a title to the map.
```{r}
g <- g + geom_point(aes(x = lon, y = lat, fill = `OVERALLWQI 2013`),
data = wa_wqi, pch = 21, size = 3) +
scale_fill_gradient(name = "WQI", low = "red", high = "green") +
ggtitle(label = paste("Washington State",
"River and Stream Water Quality Index (WQI)", sep = " "),
subtitle = paste("Source: River and Stream Monitoring Program,",
"WA State Department of Ecology (2013)")) +
theme(legend.position = c(.98, .02), legend.justification = c(1, 0))
```
## View the Map
```{r view_map, fig.height=5, fig.width=7, echo=FALSE}
g
```
## Exercises
A. We used the `separate()` function from the `tidyr` package to parse the
location variable to get `lon` and `lat`. Would this be easier with base-R
functions? Which ones? Give this a try and compare results. You could also
modify the example to use base-R functions in place of all `tidyverse`
functions used here, such as `read_csv()` and `mutate()`. Which approach is
easier to code? Which is easier to read?
B. What alternatives to the regular expression `POINT |[()]` could we use?
C. Consider downloading a different dataset and import, clean, and plot, etc.
For example, search [data.gov](https://catalog.data.gov/dataset) for:
`WQI Parameter Scores 1994-2013`.
You will find a dataset with that same title. If you download it, you will see
that it also contains some WQI data for 2014, despite the year range listed in
the filename. What else is unusual about this dataset?
Modify the above code to work with this dataset to produce a WQI map for
Washington using the 2013 data. You can use the `filter()` function in `dplyr`
to filter the dataset by year. You can incorporate this into your "pipeline".
How does the resulting map compare with that produced with in this demo?