Syllabus

Data Management for Researchers

This is a tentative schedule for a collaborative, hands-on course in data management for academic scientific research projects.

Please see the rationale page for why we are running this course.

We will use a casual, study group approach for Guerrilla Education: No tuition, no tests, no grades, no credits - just fun and learning!

We will meet once a week for a presentation and workshop, followed by a quick summary discussion. Before parting, we will agree on action items (i.e., "homework") to prepare for the next meeting.

 5 min. - Review of last meeting and "homework"
15 min. - Presentation of new material (see outline below)
20 min. - HoE: A guided "hands-on exercise" (laptop or pen/paper)
10 min. - Discussion: share exercise results and choose action items
-------
50 min.

We will be using the following as a textbook for our workshop sessions:

Practical Computing for Biologists

The handy reference tables from the appendices can be downloaded freely here:

http://practicalcomputing.org/files/PCfB_Appendices.pdf

We will not have time to review much of this material during our workshops. Instead we will be assigning readings from this text and will refer to (and use) the information and techniques described in the text. Ideally, this material would have already been covered in a previous course, as it lays a foundation in computer skills needed for data management and analysis. These skills include navigating filesystems, use of a command-line interface (CLI) known as the "shell" (Terminal), matching text with regular expressions, creating data pipelines, shell scripting, and installing software. We will have some time in our meetings to answer questions about these topics. The chapter on relational databases will be covered in our workshops, however, and expanded upon with material from other sources.

Most other course materials will be available freely over the Internet. Some resources, however, will be accessed as eBooks* through the Seattle Public Library. If you do not already have a SPL card, you can register to get one here:

http://www.spl.org/using-the-library/get-started/get-a-library-card

Learning Objectives

Participants in this course should expect to learn:

When to consider the use of a database system for scientific research projects
How to determine project requirements and anticipate disk, memory and processing needs
The basics of data security in networked environments
Practical skills in managing, converting, and processing data files
Familiarity with and working knowledge of command-line-interface (CLI) skills
Basic database programming skills using the SQL language
How to design and implement a relational database
How to connect to and use a database from various statistical applications
How websites are built on (and from) database systems (and other web technologies)
Basic systems administration skills such as installing software and configuring services
Familiarity with virtual machine (VM) technology and how to use it for data system development
How to use collaborative project management applications and revision control systems

Course Outline

Exact topics, exercises, dates and times TBD.

Session 1: Data System Essentials

Types of data systems: different types, pluses and minuses, examples
Sidebar: Most interactive websites are "database-driven" (Google, Facebook, Twitter, etc.)
Databases (manageable, scalable) versus spreadsheets (convenient)
Sidebar: File types, line-endings, human/computer readable formats, and text editors
Data System Requirements: Capacity Planning, Modes of Access, Security
HoE: Cloud DBs: eScience SQLShare, Google Fusion Tables, YQL Data Tables
Discussion
Action Items (readings, videos and tasks)
See also:

Session 2: Database Analysis and Design

Database Analysis: Investigating what you have
Database Design: Inventing what you need
Case Study: A real example from a research project
Data Flow Diagrams (DFDs)
HoE: Create a DFD: using Creately, Gliffy, Visio, or Dia
Discussion
Action Items (readings, videos and tasks)
See also:

Session 3: Introduction to Relational Databases

Identifying entities, relationships, and keys
Entity-Relationship Diagrams (ERDs), Schemas, and Data Dictionaries (DDs)
Sidebar: Firefox history is a relational database
Normalization (for ease of maintenance and performance)
Example: Given a DFD, now create an ERD
HoE: Design a relational database (ERD) (Creately, Gliffy, MySQL Workbench)
Discussion
Action Items (readings, videos and tasks)
See also:

Session 4: Building Database Tables

Defining data types and constraints
Creating a Schema given an ERD
Database design tools (Examples: MySQL Workbench, pgAdmin)
Example: Create some tables (and relations) from ERD
HoE: Creating more tables from ERD
Discussion
Action Items (readings, videos and tasks)
See also:

Session 5: Database Applications

Data entry forms, "business logic", and reports
Connecting to databases with ODBC
Database queries with Structured Query Language (SQL)
Example: Entering/importing data and running queries
HoE: Connect to a database, enter data, and run queries
Discussion
Action Items (readings, videos and tasks)
See also:

Session 6: Structured Query Language (SQL)

CRUD: Create, Read, Update and Delete
Creating tables with SQL (CREATE TABLE)
SELECT, WHERE, and GROUP BY
Various kinds of JOINs
HoE: Try some SQL on your tables
Discussion
Action Items (readings, videos and tasks)
See also:

Example 1: INNER JOIN with Firefox History

Here is one way to query multiple tables using the the WHERE clause of a SELECT statement:

SELECT SUBSTR(moz_places.url,0,50) AS `URL`, 
   datetime(moz_historyvisits.visit_date/1000000,"unixepoch") AS `TimeStamp`
FROM moz_places, moz_historyvisits 
WHERE URL LIKE "%youtube.com%" 
   AND moz_places.id = moz_historyvisits.place_id
ORDER BY TimeStamp DESC 
LIMIT 3;

This will find the date and time of the most recent visits to youtube.com.

The same result can be obtained using the INNER JOIN syntax:

SELECT SUBSTR(moz_places.url,0,50) AS `URL`, 
   datetime(moz_historyvisits.visit_date/1000000,"unixepoch") AS `TimeStamp`
FROM moz_places INNER JOIN moz_historyvisits
   ON moz_places.id = moz_historyvisits.place_id 
WHERE URL LIKE "%youtube.com%" 
ORDER BY TimeStamp DESC 
LIMIT 3;

In both cases, you will see output similar to this in sqlite3 (with .header on, .mode column and .width 50):

URL                                                 TimeStamp          
--------------------------------------------------  -------------------
https://www.youtube.com/watch?v=z2kbsG8zsLM         2014-03-23 17:02:38
https://www.youtube.com/watch?v=zoXLU86ohmw         2014-03-23 17:02:33
https://www.youtube.com/watch?v=KA4rRnihLII         2014-03-23 17:02:27

Session 7: Embedded SQL

Embedding SQL in other environments (Stata, R, etc.)
Using a SQL query to populate a data structure
Examples with RStudio and Stata
HoE: Try embedded SQL with R, Stata, SAS, SPSS, Python, etc.
Discussion
Action Items (readings, videos and tasks)
See also:

RMySQL

####Example 1: Running a SQL query of Firefox history from R

install.packages("RSQLite")
library(RSQLite)
hist <- dbConnect(SQLite(),'places.sqlite')
> dbListTables(hist)
 [1] "moz_anno_attributes" "moz_annos"           "moz_bookmarks"      
 [4] "moz_bookmarks_roots" "moz_favicons"        "moz_historyvisits"  
 [7] "moz_hosts"           "moz_inputhistory"    "moz_items_annos"    
[10] "moz_keywords"        "moz_places"          "sqlite_sequence"    
[13] "sqlite_stat1"       
> dbListFields(hist,'moz_places')
 [1] "id"              "url"             "title"           "rev_host"       
 [5] "visit_count"     "hidden"          "typed"           "favicon_id"     
 [9] "frecency"        "last_visit_date" "guid"           
> dbGetQuery(hist,'SELECT substr(url,0,26) as link,frecency from moz_places where link like "http%" order by frecency desc limit 3')
                         link frecency
1     https://www.google.com/     2100
2  http://www.washington.edu/     2000
3    http://www.slashdot.org/      150

Example 2: Using R to create a database from a "data.frame"

We will be running this SQL command:

SELECT row_names AS State, Murder, Assault, 
   ROUND(100 * Murder / Assault, 1) AS MurderAssaultRatio 
FROM arrests 
ORDER BY MurderAssaultRatio DESC 
LIMIT 10

Here is how we will create the database and run the query in R.

> str(USArrests)
'data.frame':	50 obs. of  4 variables:
 $ Murder  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
 $ Assault : int  236 263 294 190 276 204 110 238 335 211 ...
 $ UrbanPop: int  58 48 80 50 91 78 77 72 80 60 ...
 $ Rape    : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
> library("RSQLite")
Loading required package: DBI
> drv <- dbDriver("SQLite")
> sqlfile <- tempfile(tmpdir="~", fileext=".sqlite")
> sqlfile
[1] "~/file1ea87682fba.sqlite"
> con <- dbConnect(drv, dbname = sqlfile)
> data(USArrests)
> dbWriteTable(con, "arrests", USArrests)[1] TRUE
> system("file ~/file1ea87682fba.sqlite")
/home/brianhigh/file1ea87682fba.sqlite: SQLite 3.x database
> system("sqlite3 ~/file1ea87682fba.sqlite .dump | grep -A6 '^CREATE TABLE arrests'")
CREATE TABLE arrests 
( row_names TEXT,
	Murder REAL,
	Assault INTEGER,
	UrbanPop INTEGER,
	Rape REAL 
);
> dbGetQuery(con, "SELECT COUNT(*) FROM arrests")[1, ]
[1] 50
> dbListTables(con)
[1] "arrests"
> dbListFields(con, "arrests")
[1] "row_names" "Murder"    "Assault"   "UrbanPop"  "Rape"     
> dbGetQuery(con, "SELECT row_names AS State, Murder, Assault, ROUND(100 * Murder / Assault, 1) AS MurderAssaultRatio FROM arrests ORDER BY MurderAssaultRatio DESC LIMIT 10")
           State Murder Assault MurderAssaultRatio
1         Hawaii    5.3      46               11.5
2       Kentucky    9.7     109                8.9
3        Georgia   17.4     211                8.2
4      Tennessee   13.2     188                7.0
5  West Virginia    5.7      81                7.0
6        Indiana    7.2     113                6.4
7          Texas   12.7     201                6.3
8      Louisiana   15.4     249                6.2
9    Mississippi   16.1     259                6.2
10          Ohio    7.3     120                6.1

And we can see that this same query result can be produced from the Bash shell using the sqlite3 command, using the same SQL SELECT statement. First we will set .header on, .mode column, and .width 15 so that the formatting of the output will be similar. We do this with an "init file" which we will create with echo, which is a shell command that sends a text string to a file. The commands look like this:

brianhigh@twisty:~$ echo -e ".header on\n.mode column\n.width 15" > sqlite.init
brianhigh@twisty:~$ sqlite3 -init sqlite.init ~/file1ea87682fba.sqlite "SELECT row_names AS State, Murder, Assault, ROUND(100 * Murder / Assault, 1) AS MurderAssaultRatio FROM arrests ORDER BY MurderAssaultRatio DESC LIMIT 10"
-- Loading resources from sqlite.init
State            Murder      Assault     MurderAssaultRatio
---------------  ----------  ----------  ------------------
Hawaii           5.3         46          11.5              
Kentucky         9.7         109         8.9               
Georgia          17.4        211         8.2               
Tennessee        13.2        188         7.0               
West Virginia    5.7         81          7.0               
Indiana          7.2         113         6.4               
Texas            12.7        201         6.3               
Louisiana        15.4        249         6.2               
Mississippi      16.1        259         6.2               
Ohio             7.3         120         6.1

Session 8: Mobile Data Collection

ODK Collect (Android App) and Aggregate (Server)
How to connect to ODK and create forms
Examples using DEOHS ODK Server
HoE: Create some ODK forms
Discussion
Action Items (readings, videos and tasks)
See also:

ODK

Session 9: Server Administration

Different ways to run ODK Aggregate
Example: Creating a ODK virtual machine (VM)
Security: accounts, passwords, services, firewalls, updates
HoE: Create an ODK (on Linux) VM with a VirtualBox "appliance"
Discussion
Action Items (readings, videos and tasks)
See also:

ODK Aggregate VM

Session 10: Project Management (PM) and Version Control Systems (VCS)

Project Management and Collaboration Tools
Version Control Systems (Git, Subversion, Mercurial)
Cloud-based PM and VCS: Redmine, GitHub, GoogleCode
Example: Exploring Redmine and Git
HoE: Using Redmine and Git to manage your code, docs, and project
See also:

Session 11: Web-enabled Data: Applications and Frameworks

From CGI to Frameworks: A brief history of web apps
Web languages and their associated frameworks (RoR, Cake, Django)
Content Management Systems (CMS): Drupal, Joomla, WordPress
Example: R and Shiny web app, "showcase", demo
HoE: Build a simple Shiny web site
Discussion
Action Items (readings, videos and tasks)
See also:

Shiny

Additional Materials: eBooks

Some of the eBooks we will be using (a page here, a section there) are:

If you prefer paper books, purchase any of the above, or consider:

... both by Jan L. Harrington, who really does "clearly explain" things. The used prices for these are very affordable - $8 to $12 each.

Participant Profiles:

1.) Primarily a windows user
2.) I have some experience with STATA, very little with R (not super
comfortable w/o being able to search google and have my past do files)
3.) I have my data in STATA and excel
4.) I would like to learn more about what I am doing while managing data
(not just how to do it)
5.) I have to do 5 more sampling "runs" by June 22nd, 2014, and I have to
defend my proposal by December 12th,
2014.  Otherwise I do not have any long term goals.

1. Primarily a mac user. However, I use windows on a daily basis and for
all programming tasks (although curious about Apple scripts and linux).
2. Most experience with SAS and Stata programming. Very little R
experience. I also use ArcGIS, but only via menuing/gui. I'd really like to
learn Python for scriping in ArcGIS (and everywhere is, it seems).
3. My data are in .xls, .xlsx, .csv, .mdb, .dta, and .sas7bdat. Both
character and numeric data. These data originated from previous projects
that I was not involved with. Content and structure varies. I currently am
in the process of creating data dictionaries in an effort to keep myself
organized.
4. While I am in survival mode for some tasks (and thus need to know just
enough to finish them on time), I prefer to have a more complete
understanding because that is (usually) more edifying. I suppose this is on
a case-by-case basis for me.
5. For one project, I already have data described in #3 that I'd like to
have at least partially analyzed by the end of spring quarter. For another
project, I will be collecting interview data in May and June. I would like
to have those data analyzed by August 30 if at all possible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly