For Scientific Research
Photo: © Stanza. Used with permission.
Today we will be applying our data cleanup and SQL skills to our current research projects.
Image: SqlPac / Ferdna / Wikimedia
We want to download some files from the USGS listing estimated pesticide use in the United States.
From these files, we want to ...
-
Produce statewide yearly pesticide totals
-
Compare "high" and "low" chlorpyrifos (CPF) use
Image: Benjah-bmm27 / Wikimedia
While we could manually download each text file from the website, on by one, we are going to automate this process.
Here is one way to grab all of those text files with one Bash shell command:
wget -q --no-parent \
-e robots=off \
--recursive \
--accept=txt \
--no-directories \
http://pubs.usgs.gov/ds/752/
This can be pasted into the Bash shell and run as one multiline command.
You will download 14 text files totalling about 266 MB in size.
We can use a Bash command loop to combine the "high" pesticide use and "low" use files.
"High" and "low" represent the two different EPest Methods that we want to compare.
In our for
loop we will use tail -n +2
to skip the first line of each file.
for i in high low; do \
tail -q -n +2 \
EPest.${i}.*.table*.txt > \
EPest.${i}.county.estimates.txt
done
Now we have a file for "high" and another file for "low".
Run these commands in MySQL:
CREATE TABLE IF NOT EXISTS `epest_high` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`compound` VARCHAR(45) NULL,
`year` YEAR(4) NULL,
`state_fips_code` SMALLINT NULL,
`county_fips_code` SMALLINT NULL,
`high_use_kg` NUMERIC(8,1) NULL,
PRIMARY KEY (`id`) )
ENGINE = InnoDB;
CREATE TABLE IF NOT EXISTS `epest_low` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`compound` VARCHAR(45) NULL,
`year` YEAR(4) NULL,
`state_fips_code` SMALLINT NULL,
`county_fips_code` SMALLINT NULL,
`low_use_kg` NUMERIC(8,1) NULL,
PRIMARY KEY (`id`) )
ENGINE = InnoDB;
Copy the two combined files to /Data/high/ and import them.
Bash:
cp EPest.*.county.estimates.txt /Data/high/
MySQL:
LOAD DATA INFILE
'/Data/high/EPest.high.county.estimates.txt'
INTO TABLE epest_high
(compound, year, state_fips_code, county_fips_code, high_use_kg);
LOAD DATA INFILE
'/Data/high/EPest.low.county.estimates.txt'
INTO TABLE epest_low
(compound, year, state_fips_code, county_fips_code, low_use_kg);
This will speed up our later queries.
MySQL:
ALTER TABLE epest_high
ADD INDEX year_state_county
(`year`, `state_fips_code`, `county_fips_code`);
ALTER TABLE epest_low
ADD INDEX year_state_county
(`year`, `state_fips_code`, `county_fips_code`);
SELECT year,
SUM(low_use_kg) AS `wa_kg_low_sum`,
AVG(low_use_kg) AS `wa_kg_low_mean`,
COUNT(low_use_kg) AS `wa_kg_low_cnt`
FROM epest_low
WHERE state_fips_code = 53
GROUP BY year
ORDER BY year;
CREATE VIEW epest_low_wa_by_year_view AS
SELECT year,
SUM(low_use_kg) AS `wa_kg_low_sum`,
AVG(low_use_kg) AS `wa_kg_low_mean`,
COUNT(low_use_kg) AS `wa_kg_low_cnt`
FROM epest_low
WHERE state_fips_code = 53
GROUP BY year
ORDER BY year;
SELECT * FROM epest_low_wa_by_year_view;
CREATE VIEW epest_cpf_low_wa_by_year_view AS
SELECT year,
SUM(low_use_kg) AS `wa_kg_low_sum`,
AVG(low_use_kg) AS `wa_kg_low_mean`,
COUNT(low_use_kg) AS `wa_kg_low_cnt`
FROM epest_low
WHERE compound = 'CHLORPYRIFOS'
AND state_fips_code = 53
GROUP BY year
ORDER BY year;
SELECT * FROM epest_cpf_low_wa_by_year_view;
SELECT year,
SUM(low_use_kg) AS `yakima_kg_low_sum`,
AVG(low_use_kg) AS `yakima_kg_low_mean`,
COUNT(low_use_kg) AS `yakima_kg_low_cnt`
FROM epest_low
WHERE state_fips_code = 53
AND county_fips_code = 77
GROUP BY year
ORDER BY year;
CREATE VIEW epest_cpf_yakima_by_year_view AS
SELECT l.year AS year,
l.low_use_kg AS low_use_kg,
h.high_use_kg AS high_use_kg
FROM epest_low AS l
INNER JOIN epest_high AS h
ON (l.compound = h.compound
AND l.year = h.year
AND l.state_fips_code = h.state_fips_code
AND l.county_fips_code = h.county_fips_code)
WHERE l.state_fips_code = 53
AND l.county_fips_code = 77
AND l.compound = 'CHLORPYRIFOS'
ORDER BY year;
SELECT * FROM epest_cpf_yakima_by_year_view;
SELECT y.year AS year,
y.low_use_kg AS yakima_kg,
t.wa_kg_low_sum AS wa_kg,
CONCAT(ROUND(( y.low_use_kg/t.wa_kg_low_sum * 100 ),2),'%') AS `perc`
FROM epest_cpf_yakima_by_year_view y
JOIN epest_cpf_low_wa_by_year_view t
ON t.year = y.year
ORDER BY year;
We will work some more with this database and produce some plots.
We will discuss your SQL queries and plots.
Graphic: Jagbirlehl / Wikimedia
Here are two ways to connect to the MySQL database from R.
Using RMySQL:
install.packages("RMySQL")
library("RMySQL")
drv <- dbDriver("MySQL")
myconn <- dbConnect(drv, host="plasmid",
dbname="dataman", user="USERNAME",
password="PASSWORD")
Using RODBC and a DSN:
install.packages("RODBC")
library(RODBC)
myconn <- odbcConnect("plasmid-dataman")
As you can see, one downside of using a DSN-less connection is that the username and password is in the script.
ggplot(lowhighdat, aes(x=YEAR, y=KG, colour=EPest, group=EPest)) +
geom_line() + ggtitle(title)
- Project Management
- Version Control
- Data Security
- Systems Administration
|
|
|
|
|
To prepare for next week...
- Project Management, Version Control, diff, and patch
- Mastering Redmine eBook
- Version Control With Git, 2nd Edition eBook
- Git - Version Control for Everyone eBook
- Git Magic eBook
- Getting Started with GitHub + Git video
- Project Pages: Redmine, Git, GitHub
- Implementing Redmine for Secure Project Management
- Code School - Try Git - 15 minute online Git tutorial
- 6 Useful Graphical Git Client for Linux Guide
Image: © Nevit Dilmen / Wikimedia