-
Notifications
You must be signed in to change notification settings - Fork 3
Syllabus
This is a tentative schedule for a collaborative, hands-on course in data management for academic scientific research projects.
Please see the rationale page for why we are running this course.
We will use a casual, study group approach for Guerrilla Education: No tuition, no tests, no grades, no credits - just fun and learning!
We will meet once a week for a presentation and workshop, followed by a quick summary discussion. Before parting, we will agree on action items (i.e., "homework") to prepare for the next meeting.
5 min. - Review of last meeting and "homework"
15 min. - Presentation of new material (see outline below)
20 min. - HoE: A guided "hands-on exercise" (laptop or pen/paper)
10 min. - Discussion: share exercise results and choose action items
-------
50 min.
We will be using the following as a textbook for our workshop sessions:
Practical Computing for Biologists
- http://www.sinauer.com/catalog/biology/practical-computing-for-biologists.html
- http://www.amazon.com/dp/0878933913
The handy reference tables from the appendices can be downloaded freely here:
We will not have time to review much of this material during our workshops. Instead we will be assigning readings from this text and will refer to (and use) the information and techniques described in the text. Ideally, this material would have already been covered in a previous course, as it lays a foundation in computer skills needed for data management and analysis. These skills include navigating filesystems, use of a command-line interface (CLI) known as the "shell" (Terminal), matching text with regular expressions, creating data pipelines, shell scripting, and installing software. We will have some time in our meetings to answer questions about these topics. The chapter on relational databases will be covered in our workshops, however, and expanded upon with material from other sources.
Most other course materials will be available freely over the Internet. Some resources, however, will be accessed as eBooks* through the Seattle Public Library. If you do not already have a SPL card, you can register to get one here:
http://www.spl.org/using-the-library/get-started/get-a-library-card
Participants in this course should expect to learn:
- When to consider the use of a database system for scientific research projects
- How to determine project requirements and anticipate disk, memory and processing needs
- The basics of data security in networked environments
- Practical skills in managing, converting, and processing data files
- Familiarity with and working knowledge of command-line-interface (CLI) skills
- Basic database programming skills using the SQL language
- How to design and implement a relational database
- How to connect to and use a database from various statistical applications
- How websites are built on (and from) database systems (and other web technologies)
- Basic systems administration skills such as installing software and configuring services
- Familiarity with virtual machine (VM) technology and how to use it for data system development
- How to use collaborative project management applications and revision control systems
Exact topics, exercises, dates and times TBD.
- Types of data systems: different types, pluses and minuses, examples
- Sidebar: Most interactive websites are "database-driven" (Google, Facebook, Twitter, etc.)
- Databases (manageable, scalable) versus spreadsheets (convenient)
- Sidebar: File types, line-endings, human/computer readable formats, and text editors
- Data System Requirements: Capacity Planning, Modes of Access, Security
- HoE: Cloud DBs: eScience SQLShare, Google Fusion Tables, YQL Data Tables
- Discussion
- Action Items (readings, videos and tasks)
- See also:
- DailyViz Fusion Tables Examples
- Topics in Data Management
- That information, These data?
- Is "Data" Singular or Plural?
- Database Analysis: Investigating what you have
- Database Design: Inventing what you need
- Case Study: A real example from a research project
- Data Flow Diagrams (DFDs)
- HoE: Create a DFD: using Creately, Gliffy, Visio, or Dia
- Discussion
- Action Items (readings, videos and tasks)
- See also:
- The Basics of Good Database Design
- DFD over Flowcharts PDF
- DFD Slideshow
- Creately DFD
- DFDs - and follow link to "Article"
- How to Draw a DFD
- Creating Sturdy Databases in SQL
- 10 Steps to SQL Success
- Graphical Data Flow Programming in LabVIEW Video
- Identifying entities, relationships, and keys
- Entity-Relationship Diagrams (ERDs), Schemas, and Data Dictionaries (DDs)
- Sidebar: Firefox history is a relational database
- Normalization (for ease of maintenance and performance)
- Example: Given a DFD, now create an ERD
- HoE: Design a relational database (ERD) (Creately, Gliffy, MySQL Workbench)
- Discussion
- Action Items (readings, videos and tasks)
- See also:
- Gliffy
- ERD Tutorial
- Databases and SQL
- ERD Training
- DB Designer
- Sqlite Installation
- Firefox History Sqlite DB
- SANS Google Chrome Forensics
- Defining data types and constraints
- Creating a Schema given an ERD
- Database design tools (Examples: MySQL Workbench, pgAdmin)
- Example: Create some tables (and relations) from ERD
- HoE: Creating more tables from ERD
- Discussion
- Action Items (readings, videos and tasks)
- See also:
- Create New Table (in MySQL Workbench)
- Installing MySQL Video
- MySQL WB Intro Video
- Install PgSQL Video
- pgAdmin Video
- SQL Integrity Constraints
- Database Schema
- Schema
- SQLite vs MySQL vs PostgreSQL: A Comparison Of Relational Database Management Systems
- Data entry forms, "business logic", and reports
- Connecting to databases with ODBC
- Database queries with Structured Query Language (SQL)
- Example: Entering/importing data and running queries
- HoE: Connect to a database, enter data, and run queries
- Discussion
- Action Items (readings, videos and tasks)
- See also:
- CRUD: Create, Read, Update and Delete
- Creating tables with SQL (CREATE TABLE)
- SELECT, WHERE, and GROUP BY
- Various kinds of JOINs
- HoE: Try some SQL on your tables
- Discussion
- Action Items (readings, videos and tasks)
- See also:
Here is one way to query multiple tables using the the WHERE clause of a SELECT statement:
SELECT SUBSTR(moz_places.url,0,50) AS `URL`,
datetime(moz_historyvisits.visit_date/1000000,"unixepoch") AS `TimeStamp`
FROM moz_places, moz_historyvisits
WHERE URL LIKE "%youtube.com%"
AND moz_places.id = moz_historyvisits.place_id
ORDER BY TimeStamp DESC
LIMIT 3;
This will find the date and time of the most recent visits to youtube.com.
The same result can be obtained using the INNER JOIN syntax:
SELECT SUBSTR(moz_places.url,0,50) AS `URL`,
datetime(moz_historyvisits.visit_date/1000000,"unixepoch") AS `TimeStamp`
FROM moz_places INNER JOIN moz_historyvisits
ON moz_places.id = moz_historyvisits.place_id
WHERE URL LIKE "%youtube.com%"
ORDER BY TimeStamp DESC
LIMIT 3;
In both cases, you will see output similar to this in sqlite3
(with .header on
, .mode column
and .width 50
):
URL TimeStamp
-------------------------------------------------- -------------------
https://www.youtube.com/watch?v=z2kbsG8zsLM 2014-03-23 17:02:38
https://www.youtube.com/watch?v=zoXLU86ohmw 2014-03-23 17:02:33
https://www.youtube.com/watch?v=KA4rRnihLII 2014-03-23 17:02:27
See also:
- Embedding SQL in other environments (Stata, R, etc.)
- Using a SQL query to populate a data structure
- Examples with RStudio and Stata
- HoE: Try embedded SQL with R, Stata, SAS, SPSS, Python, etc.
- Discussion
- Action Items (readings, videos and tasks)
- See also:
####Example 1: Running a SQL query of Firefox history from R
install.packages("RSQLite")
library(RSQLite)
hist <- dbConnect(SQLite(),'places.sqlite')
> dbListTables(hist)
[1] "moz_anno_attributes" "moz_annos" "moz_bookmarks"
[4] "moz_bookmarks_roots" "moz_favicons" "moz_historyvisits"
[7] "moz_hosts" "moz_inputhistory" "moz_items_annos"
[10] "moz_keywords" "moz_places" "sqlite_sequence"
[13] "sqlite_stat1"
> dbListFields(hist,'moz_places')
[1] "id" "url" "title" "rev_host"
[5] "visit_count" "hidden" "typed" "favicon_id"
[9] "frecency" "last_visit_date" "guid"
> dbGetQuery(hist,'SELECT substr(url,0,26) as link,frecency from moz_places where link like "http%" order by frecency desc limit 3')
link frecency
1 https://www.google.com/ 2100
2 http://www.washington.edu/ 2000
3 http://www.slashdot.org/ 150
See also:
We will be running this SQL command:
SELECT row_names AS State, Murder, Assault,
ROUND(100 * Murder / Assault, 1) AS MurderAssaultRatio
FROM arrests
ORDER BY MurderAssaultRatio DESC
LIMIT 10
Here is how we will create the database and run the query in R.
> str(USArrests)
'data.frame': 50 obs. of 4 variables:
$ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
$ Assault : int 236 263 294 190 276 204 110 238 335 211 ...
$ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ...
$ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
> library("RSQLite")
Loading required package: DBI
> drv <- dbDriver("SQLite")
> sqlfile <- tempfile(tmpdir="~", fileext=".sqlite")
> sqlfile
[1] "~/file1ea87682fba.sqlite"
> con <- dbConnect(drv, dbname = sqlfile)
> data(USArrests)
> dbWriteTable(con, "arrests", USArrests)[1] TRUE
> system("file ~/file1ea87682fba.sqlite")
/home/brianhigh/file1ea87682fba.sqlite: SQLite 3.x database
> system("sqlite3 ~/file1ea87682fba.sqlite .dump | grep -A6 '^CREATE TABLE arrests'")
CREATE TABLE arrests
( row_names TEXT,
Murder REAL,
Assault INTEGER,
UrbanPop INTEGER,
Rape REAL
);
> dbGetQuery(con, "SELECT COUNT(*) FROM arrests")[1, ]
[1] 50
> dbListTables(con)
[1] "arrests"
> dbListFields(con, "arrests")
[1] "row_names" "Murder" "Assault" "UrbanPop" "Rape"
> dbGetQuery(con, "SELECT row_names AS State, Murder, Assault, ROUND(100 * Murder / Assault, 1) AS MurderAssaultRatio FROM arrests ORDER BY MurderAssaultRatio DESC LIMIT 10")
State Murder Assault MurderAssaultRatio
1 Hawaii 5.3 46 11.5
2 Kentucky 9.7 109 8.9
3 Georgia 17.4 211 8.2
4 Tennessee 13.2 188 7.0
5 West Virginia 5.7 81 7.0
6 Indiana 7.2 113 6.4
7 Texas 12.7 201 6.3
8 Louisiana 15.4 249 6.2
9 Mississippi 16.1 259 6.2
10 Ohio 7.3 120 6.1
And we can see that this same query result can be produced from the Bash shell using the sqlite3
command, using the same SQL SELECT statement. First we will set .header on
, .mode column
, and .width 15
so that the formatting of the output will be similar. We do this with an "init file" which we will create with echo
, which is a shell command that sends a text string to a file. The commands look like this:
brianhigh@twisty:~$ echo -e ".header on\n.mode column\n.width 15" > sqlite.init
brianhigh@twisty:~$ sqlite3 -init sqlite.init ~/file1ea87682fba.sqlite "SELECT row_names AS State, Murder, Assault, ROUND(100 * Murder / Assault, 1) AS MurderAssaultRatio FROM arrests ORDER BY MurderAssaultRatio DESC LIMIT 10"
-- Loading resources from sqlite.init
State Murder Assault MurderAssaultRatio
--------------- ---------- ---------- ------------------
Hawaii 5.3 46 11.5
Kentucky 9.7 109 8.9
Georgia 17.4 211 8.2
Tennessee 13.2 188 7.0
West Virginia 5.7 81 7.0
Indiana 7.2 113 6.4
Texas 12.7 201 6.3
Louisiana 15.4 249 6.2
Mississippi 16.1 259 6.2
Ohio 7.3 120 6.1
See also:
- ODK Collect (Android App) and Aggregate (Server)
- How to connect to ODK and create forms
- Examples using DEOHS ODK Server
- HoE: Create some ODK forms
- Discussion
- Action Items (readings, videos and tasks)
- See also:
- Different ways to run ODK Aggregate
- Example: Creating a ODK virtual machine (VM)
- Security: accounts, passwords, services, firewalls, updates
- HoE: Create an ODK (on Linux) VM with a VirtualBox "appliance"
- Discussion
- Action Items (readings, videos and tasks)
- See also:
- Project Management and Collaboration Tools
- Version Control Systems (Git, Subversion, Mercurial)
- Cloud-based PM and VCS: Redmine, GitHub, GoogleCode
- Example: Exploring Redmine and Git
- HoE: Using Redmine and Git to manage your code, docs, and project
- See also:
- From CGI to Frameworks: A brief history of web apps
- Web languages and their associated frameworks (RoR, Cake, Django)
- Content Management Systems (CMS): Drupal, Joomla, WordPress
- Example: R and Shiny web app, "showcase", demo
- HoE: Build a simple Shiny web site
- Discussion
- Action Items (readings, videos and tasks)
- See also:
Some of the eBooks we will be using (a page here, a section there) are:
- Statistics In A Nutshell, 2nd Edition
- System Analysis and Design, Fifth Edition
- Practical Data Analysis
- MySQL Workbench Data Modeling and Development
- PostgreSQL Up And Running
- Sams Teach Yourself SQL In 10 Minutes, Fourth Edition
- Downloads for SAMS SQL book
- Big Data Analytics With R and Hadoop
- Python For Data Analysis
- Data Manipulation with R
- Learning R
- Bad Data Handbook
- SQL For Dummies, 8th Edition
- Mastering Redmine
- Version Control With Git, 2nd Edition
- Git - Version Control for Everyone
- LabVIEW Graphical Programming Cookbook
- MATLAB for Neuroscientists, 2nd Edition
If you prefer paper books, purchase any of the above, or consider:
... both by Jan L. Harrington, who really does "clearly explain" things. The used prices for these are very affordable - $8 to $12 each.
1.) Primarily a windows user
2.) I have some experience with STATA, very little with R (not super
comfortable w/o being able to search google and have my past do files)
3.) I have my data in STATA and excel
4.) I would like to learn more about what I am doing while managing data
(not just how to do it)
5.) I have to do 5 more sampling "runs" by June 22nd, 2014, and I have to
defend my proposal by December 12th,
2014. Otherwise I do not have any long term goals.
1. Primarily a mac user. However, I use windows on a daily basis and for
all programming tasks (although curious about Apple scripts and linux).
2. Most experience with SAS and Stata programming. Very little R
experience. I also use ArcGIS, but only via menuing/gui. I'd really like to
learn Python for scriping in ArcGIS (and everywhere is, it seems).
3. My data are in .xls, .xlsx, .csv, .mdb, .dta, and .sas7bdat. Both
character and numeric data. These data originated from previous projects
that I was not involved with. Content and structure varies. I currently am
in the process of creating data dictionaries in an effort to keep myself
organized.
4. While I am in survival mode for some tasks (and thus need to know just
enough to finish them on time), I prefer to have a more complete
understanding because that is (usually) more edifying. I suppose this is on
a case-by-case basis for me.
5. For one project, I already have data described in #3 that I'd like to
have at least partially analyzed by the end of spring quarter. For another
project, I will be collecting interview data in May and June. I would like
to have those data analyzed by August 30 if at all possible.