-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathApplied_SQL.Rpres
396 lines (297 loc) · 14.4 KB
/
Applied_SQL.Rpres
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
Data Management
========================================================
For Scientific Research
[//]: # (author: Brian High, UW DEOHS)
[//]: # (date: 2014-05-23)
[//]: # (license: CC0 1.0 Universal,linked-content/images)
[//]: # (note: License does not apply to external content such as quoted material, linked web pages, images, or videos. These are licensed separately by their authors, publishers or other copyright holders. See attribution links for details.)
[//]: # (note: Any of the trademarks, service marks, collective marks, design rights, personality rights, or similar rights that are mentioned, used, or cited in the presentations and wiki of the Data Management For Scientific Research workshop/course are the property of their respective owners.)
[//]: # (homepage: https://github.com/brianhigh/data-workshop)
<p style="width: 600px; float: right; clear: right; margin-bottom: 5px; margin-left: 10px; text-align: right; font-weight: bold; font-size: 14pt;"><img src="http://www.stanza.co.uk/body/stanza_+BODY-copy.jpg" alt="stanza body copy" style="padding-bottom:0.5em;" />Photo: © <a href="http://www.stanza.co.uk/body/index.html">Stanza</a>. Used with permission.</p>
Session 7: Applied SQL
========================================================
Today we will be applying our data cleanup and SQL skills to our current research projects.
<p style="width: 500px; float: left; clear: right; margin-bottom: 5px; margin-left: 10px; text-align: right; font-weight: bold; font-size: 14pt;"><img src="http://upload.wikimedia.org/wikipedia/commons/thumb/a/aa/SQL_ANATOMY_wiki.svg/500px-SQL_ANATOMY_wiki.svg.png" alt="SQL anatomy" style="padding-bottom:0.5em;" />Image: <a href="http://commons.wikimedia.org/wiki/File:SQL_ANATOMY_wiki.svg">SqlPac / Ferdna / Wikimedia</a></p>
Example: Estimated Pesticide Use
========================================================
We want to download some files from the USGS listing estimated pesticide use in the United States.
http://pubs.usgs.gov/ds/752/
From these files, we want to ...
* Produce statewide yearly pesticide totals
* Compare "high" and "low" [chlorpyrifos](http://en.wikipedia.org/wiki/Chlorpyrifos) (CPF) use
---
<p style="width: 320px; float: left; clear: right; margin-bottom: 5px; margin-left: 10px; text-align: right; font-weight: bold; font-size: 14pt;"><img src="http://upload.wikimedia.org/wikipedia/commons/thumb/e/ea/Chlorpyrifos-3D-vdW.png/320px-Chlorpyrifos-3D-vdW.png" alt="CPF molecule" style="padding-bottom:0.5em;" />Image: <a href="http://en.wikipedia.org/wiki/File:Chlorpyrifos-3D-vdW.png"> Benjah-bmm27 / Wikimedia</a></p>
Download the Files
========================================================
While we could manually download each text file from the website, on by one, we are going to automate this process.
Here is one way to grab all of those text files with one Bash shell command:
```
wget -q --no-parent \
-e robots=off \
--recursive \
--accept=txt \
--no-directories \
http://pubs.usgs.gov/ds/752/
```
This can be pasted into the Bash shell and run as one multiline command.
You will download 14 text files totalling about 266 MB in size.
Combine the Files
========================================================
We can use a Bash command loop to combine the "high" pesticide use and "low" use files.
"High" and "low" represent the two different [EPest Methods](http://water.usgs.gov/nawqa/pnsp/usage/maps/about.php) that we want to compare.
In our ```for``` loop we will ```use tail -n +2``` to skip the first line of each file.
```
for i in high low; do \
tail -q -n +2 \
EPest.${i}.*.table*.txt > \
EPest.${i}.county.estimates.txt
done
```
Now we have a file for "high" and another file for "low".
Create Database Tables
========================================================
Run these commands in MySQL:
```
CREATE TABLE IF NOT EXISTS `epest_high` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`compound` VARCHAR(45) NULL,
`year` YEAR(4) NULL,
`state_fips_code` SMALLINT NULL,
`county_fips_code` SMALLINT NULL,
`high_use_kg` NUMERIC(8,1) NULL,
PRIMARY KEY (`id`) )
ENGINE = InnoDB;
CREATE TABLE IF NOT EXISTS `epest_low` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`compound` VARCHAR(45) NULL,
`year` YEAR(4) NULL,
`state_fips_code` SMALLINT NULL,
`county_fips_code` SMALLINT NULL,
`low_use_kg` NUMERIC(8,1) NULL,
PRIMARY KEY (`id`) )
ENGINE = InnoDB;
```
Import the Data
========================================================
Copy the two combined files to /Data/high/ and import them.
Bash:
```
cp EPest.*.county.estimates.txt /Data/high/
```
MySQL:
```
LOAD DATA INFILE
'/Data/high/EPest.high.county.estimates.txt'
INTO TABLE epest_high
(compound, year, state_fips_code, county_fips_code, high_use_kg);
LOAD DATA INFILE
'/Data/high/EPest.low.county.estimates.txt'
INTO TABLE epest_low
(compound, year, state_fips_code, county_fips_code, low_use_kg);
```
Add Indexes to Tables for Performance
========================================================
This will speed up our later queries.
MySQL:
```
ALTER TABLE epest_high
ADD INDEX year_state_county
(`year`, `state_fips_code`, `county_fips_code`);
ALTER TABLE epest_low
ADD INDEX year_state_county
(`year`, `state_fips_code`, `county_fips_code`);
```
Find Annual "EPest Low" Use in WA
========================================================
```
SELECT year,
SUM(low_use_kg) AS `wa_kg_low_sum`,
AVG(low_use_kg) AS `wa_kg_low_mean`,
COUNT(low_use_kg) AS `wa_kg_low_cnt`
FROM epest_low
WHERE state_fips_code = 53
GROUP BY year
ORDER BY year;
```
![epest low wa by year view](images/epest_low_wa_by_year_view.png)
Save that Query as a View
========================================================
```
CREATE VIEW epest_low_wa_by_year_view AS
SELECT year,
SUM(low_use_kg) AS `wa_kg_low_sum`,
AVG(low_use_kg) AS `wa_kg_low_mean`,
COUNT(low_use_kg) AS `wa_kg_low_cnt`
FROM epest_low
WHERE state_fips_code = 53
GROUP BY year
ORDER BY year;
SELECT * FROM epest_low_wa_by_year_view;
```
![epest low wa by year view](images/epest_low_wa_by_year_view.png)
Create a View for EPest Low WA CPF Use
========================================================
```
CREATE VIEW epest_cpf_low_wa_by_year_view AS
SELECT year,
SUM(low_use_kg) AS `wa_kg_low_sum`,
AVG(low_use_kg) AS `wa_kg_low_mean`,
COUNT(low_use_kg) AS `wa_kg_low_cnt`
FROM epest_low
WHERE compound = 'CHLORPYRIFOS'
AND state_fips_code = 53
GROUP BY year
ORDER BY year;
SELECT * FROM epest_cpf_low_wa_by_year_view;
```
![epest cpf low wa by year view](images/epest_cpf_low_wa_by_year_view.png)
Analyze EPest Low CPF use in Yakima Co.
=========================================================
```
SELECT year,
SUM(low_use_kg) AS `yakima_kg_low_sum`,
AVG(low_use_kg) AS `yakima_kg_low_mean`,
COUNT(low_use_kg) AS `yakima_kg_low_cnt`
FROM epest_low
WHERE state_fips_code = 53
AND county_fips_code = 77
GROUP BY year
ORDER BY year;
```
![epest low cpf yakima](images/epest_low_cpf_yakima.png)
Compare High and Low for Yakima Co.
=========================================================
```
CREATE VIEW epest_cpf_yakima_by_year_view AS
SELECT l.year AS year,
l.low_use_kg AS low_use_kg,
h.high_use_kg AS high_use_kg
FROM epest_low AS l
INNER JOIN epest_high AS h
ON (l.compound = h.compound
AND l.year = h.year
AND l.state_fips_code = h.state_fips_code
AND l.county_fips_code = h.county_fips_code)
WHERE l.state_fips_code = 53
AND l.county_fips_code = 77
AND l.compound = 'CHLORPYRIFOS'
ORDER BY year;
SELECT * FROM epest_cpf_yakima_by_year_view;
```
![epest cpf yakima by year view](images/epest_cpf_yakima_by_year_view.png)
Compare Yakima CPF Use to WA
=========================================================
```
SELECT y.year AS year,
y.low_use_kg AS yakima_kg,
t.wa_kg_low_sum AS wa_kg,
CONCAT(ROUND(( y.low_use_kg/t.wa_kg_low_sum * 100 ),2),'%') AS `perc`
FROM epest_cpf_yakima_by_year_view y
JOIN epest_cpf_low_wa_by_year_view t
ON t.year = y.year
ORDER BY year;
```
![low cpf yakima wa](images/epest_low_cpf_yakima_percent_wa.png)
Hands-on Group Exercise
========================================================
We will work some more with this database and produce some plots.
![r ggplot](images/rggplot.png)
Discussion
========================================================
We will discuss your SQL queries and plots.
<p style="width: 275px; float: left; clear: right; margin-bottom: 5px; margin-left: 10px; text-align: right; font-weight: bold; font-size: 14pt;"><img src="http://upload.wikimedia.org/wikipedia/commons/e/eb/User_journey_discussion.png" alt="discussion" style="padding-bottom:0.5em;" />Graphic: <a href="http://upload.wikimedia.org/wikipedia/commons/e/eb/User_journey_discussion.png">Jagbirlehl / Wikimedia</a></p>
Connecting from R
========================================================
Here are two ways to connect to the MySQL database from R.
Using RMySQL:
```{sql}
install.packages("RMySQL")
library("RMySQL")
drv <- dbDriver("MySQL")
myconn <- dbConnect(drv, host="plasmid",
dbname="dataman", user="USERNAME",
password="PASSWORD")
```
Using RODBC and a DSN:
```{sql}
install.packages("RODBC")
library(RODBC)
myconn <- odbcConnect("plasmid-dataman")
```
R Script Using RMySQL and ggplot2
=======================================================
![r ggplot script](images/rggplot_code.png)
As you can see, one downside of using a DSN-less connection is that the username and password is in the script.
R Plot Using ggplot2
=======================================================
![r ggplot](images/rggplot.png)
```
ggplot(lowhighdat, aes(x=YEAR, y=KG, colour=EPest, group=EPest)) +
geom_line() + ggtitle(title)
```
---
![epest cpf yakima by year view complete](images/epest_cpf_yakima_by_year_view_complete.png)
R Script Using RODBC and ggplot2
=======================================================
![r ggplot script in Rstudio Windows](images/rggplot_code_Rstudio_win.png)
In the Coming Sessions...
========================================================
* Project Management
* Version Control
* Data Security
* Systems Administration
Action Items (videos, readings, and tasks)
========================================================
<table>
<tr border=0>
<td width="128" valign="middle"><img width="128" height="128" alt="watching" src="images/watching.jpg">
</td>
<td valign="middle">
<ul>
<li><a href="https://www.youtube.com/watch?v=aWY3nTVR8vk">Redmine A Guided Tour</a> and Graphical Git Clients (<a href="https://www.youtube.com/watch?v=f-Wy2bob2vQ">Windows</a> and <a href="https://www.youtube.com/watch?v=u12zHGRfiKU">Mac</a>)
<li>Beginning Git and GitHub: <a href="https://www.youtube.com/watch?v=9WC4P8t7T8w">[1]</a>, <a href="https://www.youtube.com/watch?v=DVDLoe_2MBc">[2]</a>, and <a href="https://www.youtube.com/watch?v=LXoWxrTdXkM">[3]</a>
<li>Branching and Merging: <a href="https://www.youtube.com/watch?v=QQhD5VveZhw">[1]</a> and <a href="https://www.youtube.com/watch?v=uR-9NGrpU-c">[2]</a> - and Forking <a href="https://www.youtube.com/watch?v=1S_526C8Gkw">[3]</a> and Pulling <a href="https://www.youtube.com/watch?v=NnBb9NTk-To">[4]</a>
<li><a href="https://www.youtube.com/channel/UCP7RrmoueENv9TZts3HXXtw">GitHub Training & Guides</a> (watch a few -- <a href="https://www.youtube.com/watch?v=FyfwLX4HAxM&list=PLg7s6cbtAD15G8lNyoaYDuKZSKyJrgwB-">like this playlist</a>)
</ul>
</td>
</tr>
<tr>
<td width="128" valign="middle"><img width="128" height="128" alt="readings" src="images/reading.jpg">
</td>
<td valign="middle">
<ul>
<li><A href="http://practicalcomputing.org/about">PCfB</a> textbook: Chapter 20. Working on Remote Computers
<li>Skim: <a href="http://seattle.bibliocommons.com/item/show/2897906030_sams_teach_yourself_sql_in_10_minutes,_fourth_edition">SAMS Teach Yourself SQL on 10 Minutes</a>: Ch. 9-14
<li>Optional- Skim: <a href="http://www.amazon.com/dp/0123747309">RDDaI3CE</a> textbook: Chapter 17
<li>Optional- Skim: <a href="http://www.amazon.com/dp/0123756979">SQLCE3</a> textbook: Chapters 5-8
</ul>
</td>
</tr>
<tr>
<td width="128" valign="middle"><img width="128" height="128" alt="tasks" src="images/tasks.jpg"></td>
<td valign="middle">
<ul>
<li>Clean up any external data files you may have.
<li>Import your external data into your database.
<li>Check your database tables and resolve any errors.
<li>Query your database for to answer research questions.
</ul>
</td>
</tr>
</table>
See Also
========================================================
To prepare for next week...
* [Project Management](http://en.wikipedia.org/wiki/Project_management), [Version Control](http://en.wikipedia.org/wiki/Version_control), [diff](http://en.wikipedia.org/wiki/Diff), and [patch](http://en.wikipedia.org/wiki/Patch_%28Unix%29)
* [Mastering Redmine](http://seattle.bibliocommons.com/item/show/2920711030_mastering_redmine) eBook
* [Version Control With Git, 2nd Edition](http://seattle.bibliocommons.com/item/show/2847665030_version_control_with_git,_2nd_edition) eBook
* [Git - Version Control for Everyone](http://seattle.bibliocommons.com/item/show/2937299030_git_-_version_control_for_everyone) eBook
* [Git Magic](http://www-cs-students.stanford.edu/~blynn/gitmagic/) eBook
* [Getting Started with GitHub + Git](https://www.youtube.com/watch?v=DVDLoe_2MBc) video
* Project Pages: [Redmine](http://www.redmine.org/), [Git](http://git-scm.com/), [GitHub](https://github.com/)
* [Implementing Redmine for Secure Project Management](http://www.sans.org/reading-room/whitepapers/testing/implementing-redmine-secure-project-management-34122)
* [Code School - Try Git](https://try.github.io/levels/1/challenges/1) - 15 minute online Git tutorial
* [6 Useful Graphical Git Client for Linux](http://www.maketecheasier.com/6-useful-graphical-git-client-for-linux/) Guide
Questions and Comments
========================================================
<p style="width: 380px; float: right; clear: right; margin-bottom: 5px; margin-left: 10px; text-align: right; font-weight: bold; font-size: 14pt;"><img src="http://upload.wikimedia.org/wikipedia/commons/thumb/2/23/Happy_Question.svg/380px-Happy_Question.svg.png" alt="question" style="padding-bottom:0.5em;" />Image: <a href=http://commons.wikimedia.org/wiki/File:Happy_Question.svg">© Nevit Dilmen</a> / Wikimedia</p>