For Scientific Research
Photo: © Stanza. Used with permission.
Welcome to a course in data management for scientific research projects.
- Casual "guided" study-group approach
- Presentations, demos, hands-on exercises, discussions and "homework"
- Materials: A textbook, eBooks, websites, and online videos
- Researchers work with increasing amounts of data.
- Many students do not have training in data management.
- Science degree programs generally do not address this gap.
- It is difficult for "non-majors" to get into IT courses.
- This leaves students and research teams struggling to cope.
- And therefore places a heavy burden on IT support.
- Our data management course provides the needed skills to address these issues.
- Exciting new discoveries await those who can effectively sift through mounds of data!
- Degree program and emphasis
- Research area (general topic)
- Your current research project (specific topic)
- The types of data or data systems you use in this project
- What you hope to get out of this course
//: # (From: http://iconsforlife.com/post/28922487778)
Photo: NASA
How will you manage your data?
You need a data system.
There are many choices.
To pick the best one, you need to state your requirements.
In this session, you will ...
- Become familiar with common types of data systems
- Learn to differentiate between flat files and relational databases
- Learn to differentiate between spreadsheets and databases
- Learn how to model system functions and interactions
- Learn how to create system diagrams
- Learn how to state system requirements
Ultimately, this knowledge will help you select or design the best data system for your needs.
========================================================
- MS Office Documents
- Plain Text Files (CSV, TXT)
- Instrument Output
- Stats. Program Output
An excellent short video presentation explaining the differences between databases and spreadsheets can be found on YouTube:
- Video: What are databases? - lynda.com
Watching this video is a "homework" assignment.
So for now, we will just summarize the differences.
Source: WHO and Mozilla/dietrich
- Convenient
- Interactive
- Visual
- Flexible
- Portable
Source: WHO
- Manageable
- Structured
- Standardized
- Scalable
- Accessible
Graphic: Mozilla/dietrich
To design a data system, we need to identify requirements and map out interactions and components. In this course you will learn how to create:
- Use Case Diagrams
- Data Flow Diagrams
- Entity Relationship Diagrams
So let's get started!
Graphic: EPISTLE and its successors / Matthew West, Julian Fowler, Razorbliss / Wikimedia
Let's visualize a model of a "system" ...
Use Case Diagrams focus on the "what" and not the "how".
They model what people want to do with a system. A use case describes a "goal", expressed as an "action". People and other external entities are modeled as "actors" that "interact" with the system.
Graphic: Kishorekumar 62 (redrawn by Marcel Douwe Dekker) / Wikimedia
Imagine a system called "research project." Some interactions that might appear in a model of this system are:
- Researcher proposes experimental design.
- Principal investigator approves experimental design.
- Researcher creates survey.
- Subject takes survey.
- Subject provides survey results.
- Researcher analyses results.
- Researcher produces manuscript.
- Principal investigator reviews manuscript.
Let's visualize these interactions in a use case diagram...
If we were only modeling the data system, we would probably remove the goals (and some actors) which were "out of scope" with repect to the data system...
Here, the scope only encompasses the goals of conducting the survey and returning results.
- Researcher uploads survey.
- Subject takes survey.
- Subject uploads results.
- Researcher downloads results.
This is the basic operation of the Open Data Kit system which we will learn about next week.
Modeling diagrams help you:
- Clarify your own understanding
- Explore possibilities
- Communicate with others
- Prepare for more detailed design steps
As a researcher, you can use these to clarify your project scope and requirements. They will help you present your project needs to others, such as your collaborators and support staff.
Use case diagrams identify what a system must do and how people will interact with it. A complete use case model includes use case diagrams and textual descriptions of each use case.
Photo: SarahStierch / Wikimedia
As a group, list the goals (actions, use cases) for your research data system in "verb noun" form. Then figure out who (actors) will interact to perform those actions.
Draw a simple use case diagram with stick figures (actors) and elipses (goals, use cases). Use pen and paper or software.
All of the elipses should be enclosed in a "system boundary" box (if the software supports that), with the stick figures outside of the box.
Lines (interactions) should connect the actors to their goals. Label the lines with what the actor does (action) to achieve the goal.
We will display your diagrams on the screen and discuss them.
Graphic: Jagbirlehl / Wikimedia
We will continue the Needs Analysis of your data system with:
- Detailed Use Cases - including text descriptions
- More Systems Analysis - including Data Flow Diagrams
- A Requirements Document - compiling the above
- A Feasibility Study
Which will present us with some options, typically:
- Do nothing (business as usual)
- Get something "off the shelf" (free or commercial)
- Build something ourselves
... or some combination of the above.
//: # (This icon is provided as CC0 1.0 Universal (CC0 1.0) Public Domain Dedication.)
Watch these videos in the order listed.
- What are databases?
- Discover to Deliver
- Structured Conversation
- Use Case Diagram Tutorial (watch first two or more)
- ODK (watch one or two)
- Read: In the PCfB textbook: "Before You Begin", pp. 1-6; Chapters 1-3, pp. 9-43; and Appendix 1, pp. 451-453
- Read: Use Case Tips
- Skim Wikipedia articles: Data Management, Data System, Data Modeling, Needs Analysis, Agile Modeling, SDLC
- Skim eBook chapter: Beginning Database Design Chapter 3: Initial Requirements and Use Cases
- Explore websites: Agile Modeling, ODK
We have several tasks to perform as "homework" before our next session. They should be fairly quick to complete. You might do one task per day, spending maybe 15-30 minutes on each task.
Find out through Internet research what database system (product name, database type, etc.) underlies your favorite or most-visited website. Examples might be a webmail, search, social, video/movie/music/store, blog, forum, or news website. (Since there are links to information about this on Facebook below, pick another site if that was your favorite.)
If the site is popular, you will likely find a blog, news article or conference presentation mentioning the technology that the site uses, including it's back-end database system. Look up the database system product name in Wikipedia. Try to determine why that product was chosen over the other alternatives. Be ready to share this information in the next class session in a one-minute verbal presentation.
Find out the actual limits on MS Excel (max. file size, number of rows, etc.) that would make it unusable as a database if those limits were exceeded. (These may vary depending on the software version.)
How about for OpenOffice (LibreOffice) "Calc"? (bonus points)
For the Excel experts (bonus points): How do you link spreadsheets by matching columns headings, control the allowed values which can be entered in a column, protect cells (say, those containing formulas or constants) from being changed, restrict who can modify or view certain spreadsheets, and access the linked spreadsheets from other applications (like websites or statistics programs) over a network? If you know how to do these things, please demonstrate in class for us.
Use your wiki in Redmine (or GitHub, etc.) to document the list of the data sources you will be working with in your project. Note the file types/applications, organizations/persons/processes they came from, and what you will do to/with them.
The wiki language supports tables, which might be a good way to format the information in the wiki. Later you will use this wiki to further elucidate your "data dictionary".
Perform a needs analysis. For example, how will you access your data (from campus, remotely, from a mobile device, using what software?) and what sorts of security protections you will need (encryption, access controls)? What other goals and requirements do you have? Store the detailed list in your wiki.
Based on your needs analysis, produce a Use Case Diagram for your research study data system. Make it more detailed than the one we made in class today. Break out complicated actions into separate, more detailed, diagrams if you need to.
Include all of the data-related goals and tasks associated with your research project from beginning to end. Go into a level of detail which would communicate your data system needs clearly to a IT professional (analyst, designer, developer, or administrator).
You can use pen and paper to make the diagram or you can use software tools such as Creately, Gliffy, Dia, or Violet. You will present this diagram (for two minutes) in the next class session.
Download Examples from the textbook and extract the example files from the "pcfb_examples.zip" file to the folder "pcfb". Put that folder in whichever environment you will be working. For now, this will probably be your "Documents" folder on your own computer or in your "home directory" on a Unix or Linux server.
- Google and Facebook Team Up to Modernize Old-School Databases
- WebScaleSQL: MySQL for Facebook-sized databases
- What database actually FACEBOOK uses?
- What database does Facebook use?
- NYT: Healthcare.gov Project Chaos Due Partly To Unorthodox Database Choice (Slashdot)
- Topics in Data Management
- Is "Data" Singular or Plural?
Image: © Nevit Dilmen / Wikimedia
It is estimated that 40% of the defects that make it into the testing phase of enterprise software have their root cause in errors in the original requirements documents.
From: Obamacare's Website Is Crashing Because Backend Was Doomed In The Requirements Stage (Forbes)
Image: Andy Glover. Used with permission.