-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathData_System_Essentials.Rpres
414 lines (255 loc) · 21.7 KB
/
Data_System_Essentials.Rpres
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
Data Management
========================================================
For Scientific Research
[//]: # (author: Brian High, UW DEOHS)
[//]: # (date: 2014-14-03)
[//]: # (license: CC0 1.0 Universal, linked-content/images)
[//]: # (note: License does not apply to external content such as quoted material, linked web pages, images, or videos. These are licensed separately by their authors, publishers or other copyright holders. See attribution links for details.)
[//]: # (note: Any of the trademarks, service marks, collective marks, design rights, personality rights, or similar rights that are mentioned, used, or cited in the presentations and wiki of the Data Management For Scientific Research workshop/course are the property of their respective owners.)
[//]: # (homepage: https://github.com/brianhigh/data-workshop)
<p style="width: 600px; float: right; clear: right; margin-bottom: 5px; margin-left: 10px; text-align: right; font-weight: bold; font-size: 14pt;"><img src="http://www.stanza.co.uk/body/stanza_+BODY-copy.jpg" alt="stanza body copy" style="padding-bottom:0.5em;" />Photo: © <a href="http://www.stanza.co.uk/body/index.html">Stanza</a>. Used with permission.</p>
Course Introduction
========================================================
Welcome to a course in data management for scientific research projects.
## Course Structure
* Casual "guided" study-group approach
* Presentations, demos, hands-on exercises, discussions and "homework"
* Materials: A textbook, eBooks, websites, and online videos
<p style="width: 171px; float: left; clear: right; margin-bottom: 5px; margin-left: 10px; text-align: right; font-weight: bold; font-size: 14pt;"><img src="http://practicalcomputing.org/images/s_CoverShadow.png" alt="practical computing for biologists" style="padding-bottom:0.5em;" /><a href="http://practicalcomputing.org/">PracticalComputing.org</a></p>
Why Take This Course?
========================================================
* Researchers work with increasing amounts of data.
* Many students do not have training in data management.
* Science degree programs generally do not address this gap.
* It is difficult for "non-majors" to get into IT courses.
* This leaves students and research teams struggling to cope.
* And therefore places a heavy burden on IT support.
* Our data management course provides the needed skills to address these issues.
* Exciting new discoveries await those who can effectively sift through mounds of data!
Participant Introductions
========================================================
## Please introduce yourself and share your:
* Degree program and emphasis
* Research area (general topic)
* Your current research project (specific topic)
* The types of data or data systems you use in this project
* What you hope to get out of this course
![be friendly](images/friendly.jpg)
[//]: # (From: http://iconsforlife.com/post/28922487778)
Session 1: Data System Essentials
========================================================
<p style="width: 600px; float: right; clear: right; margin-bottom: 5px; margin-left: 10px; text-align: right; font-weight: bold; font-size: 14pt;"><img src="http://grin.hq.nasa.gov/IMAGES/SMALL/GPN-2000-000353.jpg" alt="nasa data analysis" style="padding-bottom:0.5em;" />Photo: <a href="http://grin.hq.nasa.gov/ABSTRACTS/GPN-2000-000353.html">NASA</a></p>
How will you *manage* your data?
You need a data *system*.
There are many *choices*.
To pick the best one, you need to state your *requirements*.
Today's Learning Objectives
========================================================
In this session, you will ...
* Become familiar with common types of data systems
* Learn to differentiate between flat files and relational databases
* Learn to differentiate between spreadsheets and databases
* Learn how to model system functions and interactions
* Learn how to create system diagrams
* Learn how to state system requirements
Ultimately, this knowledge will help you select or design the best data system for your needs.
Types of Data Systems
========================================================
### Unlinked
* [Flat file](http://en.wikipedia.org/wiki/Flat_file_database)
---
### Linked
* [Network](http://en.wikipedia.org/wiki/Network_model)
* [Distributed](http://en.wikipedia.org/wiki/Distributed_database)
* [Hierarchical](http://en.wikipedia.org/wiki/Hierarchical_database_model)
* [Relational](http://en.wikipedia.org/wiki/Relational_database)
* [Object relational](http://en.wikipedia.org/wiki/Object-relational_database)
* [NoSQL](http://en.wikipedia.org/wiki/NoSQL)
========================================================
## Flat Files
* MS Office Documents
* Plain Text Files (CSV, TXT)
* Instrument Output
* Stats. Program Output
---
## Relational Databases
* [MS Access](http://en.wikipedia.org/wiki/Microsoft_Access)
* [FileMaker Pro](http://en.wikipedia.org/wiki/FileMaker)
* [SQLite](http://en.wikipedia.org/wiki/SQLite)
* [MS SQL Server](http://en.wikipedia.org/wiki/MS_SQL_Server)
* [Oracle](http://en.wikipedia.org/wiki/Oracle_Database)
* [MySQL](http://en.wikipedia.org/wiki/MySQL)
* [MariaDB](http://en.wikipedia.org/wiki/MariaDB)
* [PostgreSQL](http://en.wikipedia.org/wiki/PostgreSQL)
Spreadsheets and Databases
========================================================
An excellent short video presentation explaining the differences between databases and spreadsheets can be found on YouTube:
* Video: [What are databases?](https://www.youtube.com/watch?v=Ls_LzOZ7x0c) - lynda.com
Watching this video is a "homework" assignment.
So for now, we will just summarize the differences.
---
<p style="width: 490px; float: right; clear: right; margin-bottom: 5px; margin-left: 10px; text-align: right; font-weight: bold; font-size: 14pt;"><img src="images/spreadsheet_vs_database.jpg" alt="spreadsheet and database" style="padding-bottom:0.5em;" />Source: <a href="http://www.who.int/healthinfo/statistics/bodgbddeathdalyestimates.xls">WHO</a> and <a href="https://developer.mozilla.org/en-US/docs/The_Places_database">Mozilla/dietrich</a></p>
Spreadsheets
========================================================
* Convenient
* Interactive
* Visual
* Flexible
* Portable
---
<p style="width: 500px; float: right; clear: right; margin-bottom: 5px; margin-left: 10px; text-align: right; font-weight: bold; font-size: 14pt;"><img src="images/spreadsheet.jpg" alt="spreadsheet" style="padding-bottom:0.5em;" />Source: <a href="http://www.who.int/healthinfo/statistics/bodgbddeathdalyestimates.xls">WHO</a></p>
Databases
========================================================
* Manageable
* Structured
* Standardized
* Scalable
* Accessible
---
<p style="width: 500px; float: right; clear: right; margin-bottom: 5px; margin-left: 10px; text-align: right; font-weight: bold; font-size: 14pt;"><img src="images/mozilla_places-erd.jpg" alt="relational database" style="padding-bottom:0.5em;" />Graphic: <a href="https://developer.mozilla.org/en-US/docs/The_Places_database">Mozilla/dietrich</a></p>
Designing a Data System
========================================================
To design a data system, we need to identify requirements and map out interactions and components. In this course you will learn how to create:
* Use Case Diagrams
* Data Flow Diagrams
* Entity Relationship Diagrams
So let's get started!
---
<p style="width: 500px; float: right; clear: right; margin-bottom: 5px; margin-left: 10px; text-align: right; font-weight: bold; font-size: 14pt;"><img src="images/data_modeling.png" alt="data modeling" style="padding-bottom:0.5em;" />Graphic: <a href="http://en.wikipedia.org/wiki/File:4-3_Data_Modelling_Today.svg">EPISTLE and its successors / Matthew West, Julian Fowler, Razorbliss / Wikimedia</a></p>
Get the Picture: Use Case Diagrams
========================================================
Let's visualize a model of a "system" ...
[Use Case Diagrams](http://en.wikipedia.org/wiki/Use_Case_Diagram) focus on the "what" and not the "how".
They model what people want to do with a system. A use case describes a "goal", expressed as an "action". People and other external entities are modeled as "actors" that "interact" with the system.
---
<p style="width: 400px; float: right; clear: right; margin-bottom: 5px; margin-left: 10px; text-align: right; font-weight: bold; font-size: 14pt;"><img src="http://upload.wikimedia.org/wikipedia/commons/1/1d/Use_case_restaurant_model.svg" alt="Use Case Diagram" style="padding-bottom:0.5em;" />Graphic: <a href="http://upload.wikimedia.org/wikipedia/commons/1/1d/Use_case_restaurant_model.svg">Kishorekumar 62 (redrawn by Marcel Douwe Dekker) / Wikimedia</a></p>
Example System Interactions
========================================================
Imagine a system called "research project." Some interactions that might appear in a model of this system are:
1. Researcher proposes experimental design.
2. Principal investigator approves experimental design.
3. Researcher creates survey.
4. Subject takes survey.
5. Subject provides survey results.
6. Researcher analyses results.
7. Researcher produces manuscript.
8. Principal investigator reviews manuscript.
Let's visualize these interactions in a use case diagram...
Research Project Use Case Diagram
========================================================
![Example Use Case Diagram](images/use_case_diagram.png)
If we were only modeling the data system, we would probably remove the goals (and some actors) which were "out of scope" with repect to the data system...
Survey Data System Use Case Diagram
========================================================
Here, the scope only encompasses the goals of conducting the survey and returning results.
![Example Use Case Diagram](images/research_survey_data_system.png)
---
1. Researcher uploads survey.
2. Subject takes survey.
3. Subject uploads results.
4. Researcher downloads results.
This is the basic operation of the [Open Data Kit](http://opendatakit.org/) system which we will learn about next week.
But what good are they, really?
========================================================
Modeling diagrams help you:
* Clarify your own understanding
* Explore possibilities
* Communicate with others
* Prepare for more detailed design steps
As a researcher, you can use these to clarify your project scope and requirements. They will help you present your project needs to others, such as your collaborators and support staff.
Use case diagrams identify what a system must do and how people will interact with it. A complete use case model includes use case diagrams and textual descriptions of each use case.
Hands-on Group Exercise
========================================================
<p style="width: 640px; float: left; clear: right; margin-bottom: 5px; margin-left: 10px; text-align: right; font-weight: bold; font-size: 14pt;"><img src="http://upload.wikimedia.org/wikipedia/commons/thumb/c/c1/Group_visioning_session_-_group_one_-_Stierch.jpg/640px-Group_visioning_session_-_group_one_-_Stierch.jpg" alt="group" style="padding-bottom:0.5em;" />Photo: <a href="http://upload.wikimedia.org/wikipedia/commons/thumb/c/c1/Group_visioning_session_-_group_one_-_Stierch.jpg/640px-Group_visioning_session_-_group_one_-_Stierch.jpg">SarahStierch / Wikimedia</a></p>
Create a Use Case Diagram
========================================================
As a group, list the goals (actions, use cases) for your research data system in "verb noun" form. Then figure out who (actors) will interact to perform those actions.
Draw a simple use case diagram with stick figures (actors) and elipses (goals, use cases). Use pen and paper or [software](http://alexdp.free.fr/violetumleditor/page.php).
All of the elipses should be enclosed in a "system boundary" box (if the software supports that), with the stick figures outside of the box.
Lines (interactions) should connect the actors to their goals. Label the lines with what the actor does (action) to achieve the goal.
Discussion
========================================================
We will display your diagrams on the screen and discuss them.
<p style="width: 275px; float: left; clear: right; margin-bottom: 5px; margin-left: 10px; text-align: right; font-weight: bold; font-size: 14pt;"><img src="http://upload.wikimedia.org/wikipedia/commons/e/eb/User_journey_discussion.png" alt="discussion" style="padding-bottom:0.5em;" />Graphic: <a href="http://upload.wikimedia.org/wikipedia/commons/e/eb/User_journey_discussion.png">Jagbirlehl / Wikimedia</a></p>
In the Coming Sessions...
========================================================
We will continue the [Needs Analysis](http://en.wikipedia.org/wiki/Needs_analysis) of your data system with:
* Detailed [Use Cases](http://en.wikipedia.org/wiki/Use_Cases) - including text descriptions
* More [Systems Analysis](http://en.wikipedia.org/wiki/Systems_analysis) - including [Data Flow Diagrams](http://en.wikipedia.org/wiki/Data_flow_diagram)
* A [Requirements Document](http://en.wikipedia.org/wiki/Product_requirements_document) - compiling the above
* A [Feasibility Study](http://en.wikipedia.org/wiki/Feasibility_study)
Which will present us with some options, typically:
* Do nothing (business as usual)
* Get something "off the shelf" (free or commercial)
* Build something ourselves
... or some combination of the above.
Action Items
========================================================
### ![videos - public domain CC0 1.0 icon](images/watching.jpg) Videos
[//]: # (From: http://www.iconsdb.com/black-icons/video-play-3-icon.html)
[//]: # (This icon is provided as CC0 1.0 Universal (CC0 1.0) Public Domain Dedication.)
### ![readings icon - iconsforlife.com](images/reading.jpg) Readings
[//]: # (From: http://iconsforlife.com/post/27700442282)
[//]: # (A series of icons to represent things in everyday life – available for download)
### ![tasks](images/tasks.jpg) Tasks
Watch Videos
========================================================
Watch these videos in the order listed.
* [What are databases?](https://www.youtube.com/watch?v=Ls_LzOZ7x0c)
* [Discover to Deliver](https://www.youtube.com/watch?v=x9oIpZaXTDs)
* [Structured Conversation](https://www.youtube.com/watch?v=fVLNmXq0beE)
* [Use Case Diagram Tutorial](https://www.youtube.com/playlist?list=PL05DE2D2EDDEA8D68) (watch first two or more)
* [ODK](https://www.youtube.com/user/odkteam/videos?shelf_id=1&sort=dd&view=0) (watch one or two)
![watching](images/watching.jpg)
Readings
========================================================
* Read: In the PCfB textbook: "Before You Begin", pp. 1-6; Chapters 1-3, pp. 9-43; and Appendix 1, pp. 451-453
* Read: [Use Case Tips](http://www.gatherspace.com/static/use_case_example.html)
* Skim Wikipedia articles: [Data Management](http://en.wikipedia.org/wiki/Data_management), [Data System](http://en.wikipedia.org/wiki/Data_system), [Data Modeling](http://en.wikipedia.org/wiki/Data_modeling), [Needs Analysis](http://en.wikipedia.org/wiki/Needs_analysis), [Agile Modeling](http://en.wikipedia.org/wiki/Agile_Modeling), [SDLC](http://en.wikipedia.org/wiki/Systems_Development_Life_Cycle)
* Skim eBook chapter: [Beginning Database Design](http://seattle.bibliocommons.com/item/show/2912387030_beginning_database_design) Chapter 3: Initial Requirements and Use Cases
* Explore websites: [Agile Modeling](http://www.agilemodeling.com/), [ODK](http://opendatakit.org/)
![readings](images/reading.jpg)
Tasks
========================================================
We have several tasks to perform as "homework" before our next session. They should be fairly quick to complete. You might do one task per day, spending maybe 15-30 minutes on each task.
![tasks](images/tasks.jpg)
Task 1: Your favorite website's database
========================================================
Find out through Internet research what database system (product name, database type, etc.) underlies your favorite or most-visited website. Examples might be a webmail, search, social, video/movie/music/store, blog, forum, or news website. (Since there are links to information about this on Facebook below, pick another site if that was your favorite.)
If the site is popular, you will likely find a blog, news article or conference presentation mentioning the technology that the site uses, including it's back-end database system. Look up the database system product name in Wikipedia. Try to determine why that product was chosen over the other alternatives. Be ready to share this information in the next class session in a one-minute verbal presentation.
Task 2: Limits to Excel as a "database"
========================================================
Find out the actual limits on MS Excel (max. file size, number of rows, etc.) that would make it unusable as a database if those limits were exceeded. (These may vary depending on the software version.)
How about for OpenOffice (LibreOffice) "Calc"? (bonus points)
For the Excel experts (bonus points): How do you link spreadsheets by matching columns headings, control the allowed values which can be entered in a column, protect cells (say, those containing formulas or constants) from being changed, restrict who can modify or view certain spreadsheets, and access the linked spreadsheets from other applications (like websites or statistics programs) over a network? If you know how to do these things, please demonstrate in class for us.
Task 3: Data Sources and Needs Analysis
========================================================
Use your wiki in Redmine (or GitHub, etc.) to document the list of the data sources you will be working with in your project. Note the file types/applications, organizations/persons/processes they came from, and what you will do to/with them.
The wiki language supports tables, which might be a good way to format the information in the wiki. Later you will use this wiki to further elucidate your "data dictionary".
Perform a [needs analysis](http://en.wikipedia.org/wiki/Needs_analysis). For example, how will you access your data (from campus, remotely, from a mobile device, using what software?) and what sorts of security protections you will need (encryption, access controls)? What other goals and requirements do you have? Store the detailed list in your wiki.
Task 4: Use Case Diagram for your project
========================================================
Based on your needs analysis, produce a Use Case Diagram for your research study data system. Make it more detailed than the one we made in class today. Break out complicated actions into separate, more detailed, diagrams if you need to.
Include all of the data-related goals and tasks associated with your research project from beginning to end. Go into a level of detail which would communicate your data system needs clearly to a IT professional (analyst, designer, developer, or administrator).
You can use pen and paper to make the diagram or you can use software tools such as Creately, Gliffy, Dia, or [Violet](http://alexdp.free.fr/violetumleditor/page.php). You will present this diagram (for two minutes) in the next class session.
Task 5: Get files for textbook exercises
========================================================
Download [Examples from the textbook](http://practicalcomputing.org/files/pcfb_examples.zip) and extract the example files from the "pcfb_examples.zip" file to the folder "pcfb". Put that folder in whichever environment you will be working. For now, this will probably be your "Documents" folder on your own computer or in your "home directory" on a Unix or Linux server.
See also
========================================================
* [Google and Facebook Team Up to Modernize Old-School Databases](http://www.wired.com/wiredenterprise/2014/03/webscalesql/)
* [WebScaleSQL: MySQL for Facebook-sized databases](http://www.zdnet.com/webscalesql-mysql-for-facebook-sized-databases-7000027814/)
* [What database actually FACEBOOK uses?](http://www.techworm.net/2013/05/what-database-actually-facebook-uses.html)
* [What database does Facebook use?](http://www.prodromus.com/2011/01/27/what-database-does-facebook-use)
* [NYT: Healthcare.gov Project Chaos Due Partly To Unorthodox Database Choice (Slashdot)](http://developers.slashdot.org/story/13/11/24/1437203/nyt-healthcaregov-project-chaos-due-partly-to-unorthodox-database-choice)
* [Topics in Data Management](http://en.wikipedia.org/wiki/Data_management#Topics_in_Data_Management)
* [Is "Data" Singular or Plural?](http://www.quickanddirtytips.com/education/grammar/is-data-singular-or-plural?page=all)
Questions and Comments
========================================================
<p style="width: 380px; float: right; clear: right; margin-bottom: 5px; margin-left: 10px; text-align: right; font-weight: bold; font-size: 14pt;"><img src="http://upload.wikimedia.org/wikipedia/commons/thumb/2/23/Happy_Question.svg/380px-Happy_Question.svg.png" alt="question" style="padding-bottom:0.5em;" />Image: <a href=http://commons.wikimedia.org/wiki/File:Happy_Question.svg">© Nevit Dilmen</a> / Wikimedia</p>
Some Parting Words
========================================================
> It is estimated that 40% of the defects that make it into the testing phase of enterprise software have their root cause in errors in the original requirements documents.
From: [Obamacare's Website Is Crashing Because Backend Was Doomed In The Requirements Stage (Forbes)](http://www.forbes.com/sites/anthonykosner/2013/10/21/obamacares-website-is-crashing-because-backend-was-doomed-in-the-requirements-stage/)
---
<p style="width: 500px; float: right; clear: right; margin-bottom: 5px; margin-left: 10px; text-align: right; font-weight: bold; font-size: 14pt;"><img src="http://4.bp.blogspot.com/_YzKCMr-tcMM/S9bMZVLyG9I/AAAAAAAAALY/CdAJufVzYzM/s640/32largebugs.jpg" alt="bug comic" style="padding-bottom:0.5em;" />Image: <a href="http://cartoontester.blogspot.com/2010/01/big-bugs.html">Andy Glover</a>. Used with permission.</p>