Skip to content

Syllabus

brianhigh edited this page Mar 23, 2014 · 61 revisions

Data Management for Researchers

This is a tentative schedule for a collaborative, hands-on course in data management for academic scientific research projects.

Please see the rationale page for why we are running this course.

We will use a casual, study group approach for Guerrilla Education: No tuition, no tests, no grades, no credits - just fun and learning!

We will meet once a week for a presentation and workshop, followed by a quick summary discussion. Before parting, we will agree on action items (i.e., "homework") to prepare for the next meeting.

 5 min. - Review of last meeting and "homework"
15 min. - Presentation of new material (see outline below)
20 min. - HoE: A guided "hands-on exercise" (laptop or pen/paper)
10 min. - Discussion: share exercise results and choose action items
-------
50 min.

We will be using the following as a textbook for our workshop sessions:

Practical Computing for Biologists

The handy reference tables from the appendices can be downloaded freely here:

We will not have time to review much of this material during our workshops. Instead we will be assigning readings from this text and will refer to (and use) the information and techniques described in the text. Ideally, this material would have already been covered in a previous course, as it lays a foundation in computer skills needed for data management and analysis. These skills include navigating filesystems, use of a command-line interface (CLI) known as the "shell" (Terminal), matching text with regular expressions, creating data pipelines, shell scripting, and installing software. We will have some time in our meetings to answer questions about these topics. The chapter on relational databases will be covered in our workshops, however, and expanded upon with material from other sources.

Most other course materials will be available freely over the Internet. Some resources, however, will be accessed as eBooks* through the Seattle Public Library. If you do not already have a SPL card, you can register to get one here:

http://www.spl.org/using-the-library/get-started/get-a-library-card

Learning Objectives

Participants in this course should expect to learn:

  • When to consider the use of a database system for scientific research projects
  • How to determine project requirements and anticipate disk, memory and processing needs
  • The basics of data security in networked environments
  • Practical skills in managing, converting, and processing data files
  • Familiarity with and working knowledge of command-line-interface (CLI) skills
  • Basic database programming skills using the SQL language
  • How to design and implement a relational database
  • How to connect to and use a database from various statistical applications
  • How websites are built on (and from) database systems (and other web technologies)
  • Basic systems administration skills such as installing software and configuring services
  • Familiarity with virtual machine (VM) technology and how to use it for data system development
  • How to use collaborative project management applications and revision control systems

Course Outline

Exact topics, exercises, dates and times TBD.

Session 1: Data System Essentials

  1. Types of data systems: different types, pluses and minuses, examples
  2. Sidebar: Most interactive websites are "database-driven" (Google, Facebook, Twitter, etc.)
  3. Databases (manageable, scalable) versus spreadsheets (convenient)
  4. Sidebar: File types, line-endings, human/computer readable formats, and text editors
  5. Data System Requirements: Capacity Planning, Modes of Access, Security
  6. HoE: Cloud DBs: eScience SQLShare, Google Fusion Tables, YQL Data Tables
  7. Discussion
  8. Action Items (readings, videos and tasks)
  9. See also:

Session 2: Database Analysis and Design

  1. Database Analysis: Investigating what you have
  2. Database Design: Inventing what you need
  3. Case Study: A real example from a research project
  4. Data Flow Diagrams (DFDs)
  5. HoE: Create a DFD: using Creately, Gliffy, Visio, or Dia
  6. Discussion
  7. Action Items (readings, videos and tasks)
  8. See also:

Session 3: Introduction to Relational Databases

  1. Identifying entities, relationships, and keys
  2. Entity-Relationship Diagrams (ERDs), Schemas, and Data Dictionaries (DDs)
  3. Sidebar: Firefox history is a relational database
  4. Normalization (for ease of maintenance and performance)
  5. Example: Given a DFD, now create an ERD
  6. HoE: Design a relational database (ERD) (Creately, Gliffy, MySQL Workbench)
  7. Discussion
  8. Action Items (readings, videos and tasks)
  9. See also:

Session 4: Building Database Tables

  1. Defining data types and constraints
  2. Creating a Schema given an ERD
  3. Database design tools (Examples: MySQL Workbench, pgAdmin)
  4. Example: Create some tables (and relations) from ERD
  5. HoE: Creating more tables from ERD
  6. Discussion
  7. Action Items (readings, videos and tasks)
  8. See also:

Session 5: Database Applications

  1. Data entry forms, "business logic", and reports
  2. Connecting to databases with ODBC
  3. Database queries with Structured Query Language (SQL)
  4. Example: Entering/importing data and running queries
  5. HoE: Connect to a database, enter data, and run queries
  6. Discussion
  7. Action Items (readings, videos and tasks)
  8. See also:

Session 6: Structured Query Language (SQL)

  1. CRUD: Create, Read, Update and Delete
  2. Creating tables with SQL (CREATE TABLE)
  3. SELECT, WHERE, and GROUP BY
  4. Various kinds of JOINs
  5. HoE: Try some SQL on your tables
  6. Discussion
  7. Action Items (readings, videos and tasks)
  8. See also:

Example 1: INNER JOIN with Firefox History

Here is one way to query multiple tables using the the WHERE clause of a SELECT statement:

SELECT SUBSTR(moz_places.url,0,50) AS `URL`, 
   datetime(moz_historyvisits.visit_date/1000000,"unixepoch") AS `TimeStamp`
FROM moz_places, moz_historyvisits 
WHERE URL LIKE "%youtube.com%" 
   AND moz_places.id = moz_historyvisits.place_id
ORDER BY TimeStamp DESC 
LIMIT 3;

This will find the date and time of the most recent visits to youtube.com.

The same result can be obtained using the INNER JOIN syntax:

SELECT SUBSTR(moz_places.url,0,50) AS `URL`, 
   datetime(moz_historyvisits.visit_date/1000000,"unixepoch") AS `TimeStamp`
FROM moz_places INNER JOIN moz_historyvisits
   ON moz_places.id = moz_historyvisits.place_id 
WHERE URL LIKE "%youtube.com%" 
ORDER BY TimeStamp DESC 
LIMIT 3;

In both cases, you will see output similar to this in sqlite3 (with .header on, .mode column and .width 50):

URL                                                 TimeStamp          
--------------------------------------------------  -------------------
https://www.youtube.com/watch?v=z2kbsG8zsLM         2014-03-23 17:02:38
https://www.youtube.com/watch?v=zoXLU86ohmw         2014-03-23 17:02:33
https://www.youtube.com/watch?v=KA4rRnihLII         2014-03-23 17:02:27

See also:

Session 7: Embedded SQL

  1. Embedding SQL in other environments (Stata, R, etc.)
  2. Using a SQL query to populate a data structure
  3. Examples with RStudio and Stata
  4. HoE: Try embedded SQL with R, Stata, SAS, SPSS, Python, etc.
  5. Discussion
  6. Action Items (readings, videos and tasks)
  7. See also:

####Example 1: Running a SQL query of Firefox history from R

install.packages("RSQLite")
library(RSQLite)
hist <- dbConnect(SQLite(),'places.sqlite')
> dbListTables(hist)
 [1] "moz_anno_attributes" "moz_annos"           "moz_bookmarks"      
 [4] "moz_bookmarks_roots" "moz_favicons"        "moz_historyvisits"  
 [7] "moz_hosts"           "moz_inputhistory"    "moz_items_annos"    
[10] "moz_keywords"        "moz_places"          "sqlite_sequence"    
[13] "sqlite_stat1"       
> dbListFields(hist,'moz_places')
 [1] "id"              "url"             "title"           "rev_host"       
 [5] "visit_count"     "hidden"          "typed"           "favicon_id"     
 [9] "frecency"        "last_visit_date" "guid"           
> dbGetQuery(hist,'SELECT substr(url,0,26) as link,frecency from moz_places where link like "http%" order by frecency desc limit 3')
                         link frecency
1     https://www.google.com/     2100
2  http://www.washington.edu/     2000
3    http://www.slashdot.org/      150

See also:

Example 2: Using R to create a database from a "data.frame"

We will be running this SQL command:

SELECT row_names AS State, Murder, Assault, 
   ROUND(100 * Murder / Assault, 1) AS MurderAssaultRatio 
FROM arrests 
ORDER BY MurderAssaultRatio DESC 
LIMIT 10

Here is how we will create the database and run the query in R.

> str(USArrests)
'data.frame':	50 obs. of  4 variables:
 $ Murder  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
 $ Assault : int  236 263 294 190 276 204 110 238 335 211 ...
 $ UrbanPop: int  58 48 80 50 91 78 77 72 80 60 ...
 $ Rape    : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
> library("RSQLite")
Loading required package: DBI
> drv <- dbDriver("SQLite")
> sqlfile <- tempfile(tmpdir="~", fileext=".sqlite")
> sqlfile
[1] "~/file1ea87682fba.sqlite"
> con <- dbConnect(drv, dbname = sqlfile)
> data(USArrests)
> dbWriteTable(con, "arrests", USArrests)[1] TRUE
> system("file ~/file1ea87682fba.sqlite")
/home/brianhigh/file1ea87682fba.sqlite: SQLite 3.x database
> system("sqlite3 ~/file1ea87682fba.sqlite .dump | grep -A6 '^CREATE TABLE arrests'")
CREATE TABLE arrests 
( row_names TEXT,
	Murder REAL,
	Assault INTEGER,
	UrbanPop INTEGER,
	Rape REAL 
);
> dbGetQuery(con, "SELECT COUNT(*) FROM arrests")[1, ]
[1] 50
> dbListTables(con)
[1] "arrests"
> dbListFields(con, "arrests")
[1] "row_names" "Murder"    "Assault"   "UrbanPop"  "Rape"     
> dbGetQuery(con, "SELECT row_names AS State, Murder, Assault, ROUND(100 * Murder / Assault, 1) AS MurderAssaultRatio FROM arrests ORDER BY MurderAssaultRatio DESC LIMIT 10")
           State Murder Assault MurderAssaultRatio
1         Hawaii    5.3      46               11.5
2       Kentucky    9.7     109                8.9
3        Georgia   17.4     211                8.2
4      Tennessee   13.2     188                7.0
5  West Virginia    5.7      81                7.0
6        Indiana    7.2     113                6.4
7          Texas   12.7     201                6.3
8      Louisiana   15.4     249                6.2
9    Mississippi   16.1     259                6.2
10          Ohio    7.3     120                6.1

And we can see that this same query result can be produced from the Bash shell using the sqlite3 command, using the same SQL SELECT statement. First we will set .header on, .mode column, and .width 15 so that the formatting of the output will be similar. We do this with an "init file" which we will create with echo, which is a shell command that sends a text string to a file. The commands look like this:

brianhigh@twisty:~$ echo -e ".header on\n.mode column\n.width 15" > sqlite.init
brianhigh@twisty:~$ sqlite3 -init sqlite.init ~/file1ea87682fba.sqlite "SELECT row_names AS State, Murder, Assault, ROUND(100 * Murder / Assault, 1) AS MurderAssaultRatio FROM arrests ORDER BY MurderAssaultRatio DESC LIMIT 10"
-- Loading resources from sqlite.init
State            Murder      Assault     MurderAssaultRatio
---------------  ----------  ----------  ------------------
Hawaii           5.3         46          11.5              
Kentucky         9.7         109         8.9               
Georgia          17.4        211         8.2               
Tennessee        13.2        188         7.0               
West Virginia    5.7         81          7.0               
Indiana          7.2         113         6.4               
Texas            12.7        201         6.3               
Louisiana        15.4        249         6.2               
Mississippi      16.1        259         6.2               
Ohio             7.3         120         6.1

See also:

Session 8: Mobile Data Collection

  1. ODK Collect (Android App) and Aggregate (Server)
  2. How to connect to ODK and create forms
  3. Examples using DEOHS ODK Server
  4. HoE: Create some ODK forms
  5. Discussion
  6. Action Items (readings, videos and tasks)
  7. See also:

Session 9: Server Administration

  1. Different ways to run ODK Aggregate
  2. Example: Creating a ODK virtual machine (VM)
  3. Security: accounts, passwords, services, firewalls, updates
  4. HoE: Create an ODK (on Linux) VM with a VirtualBox "appliance"
  5. Discussion
  6. Action Items (readings, videos and tasks)
  7. See also:

Session 10: Project Management (PM) and Version Control Systems (VCS)

  1. Project Management and Collaboration Tools
  2. Version Control Systems (Git, Subversion, Mercurial)
  3. Cloud-based PM and VCS: Redmine, GitHub, GoogleCode
  4. Example: Exploring Redmine and Git
  5. HoE: Using Redmine and Git to manage your code, docs, and project
  6. See also:

Session 11: Web-enabled Data: Applications and Frameworks

  1. From CGI to Frameworks: A brief history of web apps
  2. Web languages and their associated frameworks (RoR, Cake, Django)
  3. Content Management Systems (CMS): Drupal, Joomla, WordPress
  4. Example: R and Shiny web app, "showcase", demo
  5. HoE: Build a simple Shiny web site
  6. Discussion
  7. Action Items (readings, videos and tasks)
  8. See also:

Additional Materials: eBooks

Some of the eBooks we will be using (a page here, a section there) are:

If you prefer paper books, purchase any of the above, or consider:

... both by Jan L. Harrington, who really does "clearly explain" things. The used prices for these are very affordable - $8 to $12 each.

Participant Profiles:

1.) Primarily a windows user
2.) I have some experience with STATA, very little with R (not super
comfortable w/o being able to search google and have my past do files)
3.) I have my data in STATA and excel
4.) I would like to learn more about what I am doing while managing data
(not just how to do it)
5.) I have to do 5 more sampling "runs" by June 22nd, 2014, and I have to
defend my proposal by December 12th,
2014.  Otherwise I do not have any long term goals.
1. Primarily a mac user. However, I use windows on a daily basis and for
all programming tasks (although curious about Apple scripts and linux).
2. Most experience with SAS and Stata programming. Very little R
experience. I also use ArcGIS, but only via menuing/gui. I'd really like to
learn Python for scriping in ArcGIS (and everywhere is, it seems).
3. My data are in .xls, .xlsx, .csv, .mdb, .dta, and .sas7bdat. Both
character and numeric data. These data originated from previous projects
that I was not involved with. Content and structure varies. I currently am
in the process of creating data dictionaries in an effort to keep myself
organized.
4. While I am in survival mode for some tasks (and thus need to know just
enough to finish them on time), I prefer to have a more complete
understanding because that is (usually) more edifying. I suppose this is on
a case-by-case basis for me.
5. For one project, I already have data described in #3 that I'd like to
have at least partially analyzed by the end of spring quarter. For another
project, I will be collecting interview data in May and June. I would like
to have those data analyzed by August 30 if at all possible.