Data Management

For Scientific Research

Photo: © Stanza. Used with permission.

Session 7: Applied SQL

Today we will be applying our data cleanup and SQL skills to our current research projects.

Image: SqlPac / Ferdna / Wikimedia

Example: Estimated Pesticide Use

We want to download some files from the USGS listing estimated pesticide use in the United States.

http://pubs.usgs.gov/ds/752/

From these files, we want to ...

Produce statewide yearly pesticide totals
Compare "high" and "low" chlorpyrifos (CPF) use

Image: Benjah-bmm27 / Wikimedia

Download the Files

While we could manually download each text file from the website, on by one, we are going to automate this process.

Here is one way to grab all of those text files with one Bash shell command:

wget -q --no-parent \
     -e robots=off \
     --recursive \
     --accept=txt \
     --no-directories \
     http://pubs.usgs.gov/ds/752/

This can be pasted into the Bash shell and run as one multiline command.

You will download 14 text files totalling about 266 MB in size.

Combine the Files

We can use a Bash command loop to combine the "high" pesticide use and "low" use files.

"High" and "low" represent the two different EPest Methods that we want to compare.

In our for loop we will use tail -n +2 to skip the first line of each file.

for i in high low; do \
    tail -q -n +2 \
        EPest.${i}.*.table*.txt > \
        EPest.${i}.county.estimates.txt
done

Now we have a file for "high" and another file for "low".

Create Database Tables

Run these commands in MySQL:

CREATE  TABLE IF NOT EXISTS `epest_high` (
  `id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
  `compound` VARCHAR(45) NULL,
  `year` YEAR(4) NULL,
  `state_fips_code` SMALLINT NULL,
  `county_fips_code` SMALLINT NULL,
  `high_use_kg` NUMERIC(8,1) NULL,
  PRIMARY KEY (`id`) )
ENGINE = InnoDB;

CREATE  TABLE IF NOT EXISTS `epest_low` (
  `id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
  `compound` VARCHAR(45) NULL,
  `year` YEAR(4) NULL,
  `state_fips_code` SMALLINT NULL,
  `county_fips_code` SMALLINT NULL,
  `low_use_kg` NUMERIC(8,1) NULL,
  PRIMARY KEY (`id`) )
ENGINE = InnoDB;

Import the Data

Copy the two combined files to /Data/high/ and import them.

Bash:

cp EPest.*.county.estimates.txt /Data/high/

MySQL:

LOAD DATA INFILE 
'/Data/high/EPest.high.county.estimates.txt' 
INTO TABLE epest_high 
(compound, year, state_fips_code, county_fips_code, high_use_kg);

LOAD DATA INFILE 
'/Data/high/EPest.low.county.estimates.txt' 
INTO TABLE epest_low 
(compound, year, state_fips_code, county_fips_code, low_use_kg);

Add Indexes to Tables for Performance

This will speed up our later queries.

MySQL:

ALTER TABLE epest_high 
ADD INDEX year_state_county 
(`year`, `state_fips_code`, `county_fips_code`);

ALTER TABLE epest_low 
ADD INDEX year_state_county 
(`year`, `state_fips_code`, `county_fips_code`);

Find Annual "EPest Low" Use in WA

SELECT year, 
SUM(low_use_kg) AS `wa_kg_low_sum`, 
AVG(low_use_kg) AS `wa_kg_low_mean`, 
COUNT(low_use_kg) AS `wa_kg_low_cnt` 
FROM epest_low  
WHERE state_fips_code = 53
GROUP BY year 
ORDER BY year;

Save that Query as a View

CREATE VIEW epest_low_wa_by_year_view AS 
SELECT year, 
SUM(low_use_kg) AS `wa_kg_low_sum`, 
AVG(low_use_kg) AS `wa_kg_low_mean`, 
COUNT(low_use_kg) AS `wa_kg_low_cnt` 
FROM epest_low  
WHERE state_fips_code = 53
GROUP BY year 
ORDER BY year;

SELECT * FROM epest_low_wa_by_year_view;

Create a View for EPest Low WA CPF Use

CREATE VIEW epest_cpf_low_wa_by_year_view AS 
SELECT year, 
SUM(low_use_kg) AS `wa_kg_low_sum`, 
AVG(low_use_kg) AS `wa_kg_low_mean`, 
COUNT(low_use_kg) AS `wa_kg_low_cnt` 
FROM epest_low  
WHERE compound = 'CHLORPYRIFOS' 
AND state_fips_code = 53
GROUP BY year 
ORDER BY year;

SELECT * FROM epest_cpf_low_wa_by_year_view;

Analyze EPest Low CPF use in Yakima Co.

SELECT year, 
SUM(low_use_kg) AS `yakima_kg_low_sum`, 
AVG(low_use_kg) AS `yakima_kg_low_mean`, 
COUNT(low_use_kg) AS `yakima_kg_low_cnt` 
FROM epest_low 
WHERE state_fips_code = 53 
    AND county_fips_code = 77 
GROUP BY year 
ORDER BY year;

Compare High and Low for Yakima Co.

CREATE VIEW epest_cpf_yakima_by_year_view AS 
SELECT l.year AS year,  
    l.low_use_kg AS low_use_kg, 
    h.high_use_kg AS high_use_kg 
FROM epest_low AS l 
    INNER JOIN epest_high AS h 
ON (l.compound = h.compound 
    AND l.year = h.year 
    AND l.state_fips_code = h.state_fips_code 
    AND l.county_fips_code = h.county_fips_code) 
WHERE l.state_fips_code = 53 
    AND l.county_fips_code = 77 
    AND l.compound = 'CHLORPYRIFOS' 
ORDER BY year;

SELECT * FROM epest_cpf_yakima_by_year_view;

Compare Yakima CPF Use to WA

SELECT y.year AS year, 
    y.low_use_kg AS yakima_kg, 
    t.wa_kg_low_sum AS wa_kg, 
    CONCAT(ROUND(( y.low_use_kg/t.wa_kg_low_sum * 100 ),2),'%') AS `perc`
FROM epest_cpf_yakima_by_year_view y 
JOIN epest_cpf_low_wa_by_year_view t 
ON t.year = y.year 
ORDER BY year;

Hands-on Group Exercise

We will work some more with this database and produce some plots.

Discussion

We will discuss your SQL queries and plots.

Graphic: Jagbirlehl / Wikimedia

Connecting from R

Here are two ways to connect to the MySQL database from R.

Using RMySQL:

install.packages("RMySQL")
library("RMySQL")
drv <- dbDriver("MySQL")
myconn <- dbConnect(drv, host="plasmid", 
  dbname="dataman", user="USERNAME", 
  password="PASSWORD")

Using RODBC and a DSN:

install.packages("RODBC")
library(RODBC)
myconn <- odbcConnect("plasmid-dataman")

R Script Using RMySQL and ggplot2

As you can see, one downside of using a DSN-less connection is that the username and password is in the script.

R Plot Using ggplot2

ggplot(lowhighdat, aes(x=YEAR, y=KG, colour=EPest, group=EPest)) +
  geom_line() + ggtitle(title)

R Script Using RODBC and ggplot2

In the Coming Sessions...

Project Management
Version Control
Data Security
Systems Administration

Action Items (videos, readings, and tasks)

	Redmine A Guided Tour and Graphical Git Clients (Windows and Mac) Beginning Git and GitHub: [1], [2], and [3] Branching and Merging: [1] and [2] - and Forking [3] and Pulling [4] GitHub Training & Guides (watch a few -- like this playlist)
	PCfB textbook: Chapter 20. Working on Remote Computers Skim: SAMS Teach Yourself SQL on 10 Minutes: Ch. 9-14 Optional- Skim: RDDaI3CE textbook: Chapter 17 Optional- Skim: SQLCE3 textbook: Chapters 5-8
	Clean up any external data files you may have. Import your external data into your database. Check your database tables and resolve any errors. Query your database for to answer research questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!