Skip to content

Latest commit

 

History

History
414 lines (255 loc) · 21.7 KB

Data_System_Essentials.md

File metadata and controls

414 lines (255 loc) · 21.7 KB

Data Management

For Scientific Research

stanza body copyPhoto: © Stanza. Used with permission.

Course Introduction

Welcome to a course in data management for scientific research projects.

Course Structure

  • Casual "guided" study-group approach
  • Presentations, demos, hands-on exercises, discussions and "homework"
  • Materials: A textbook, eBooks, websites, and online videos

practical computing for biologistsPracticalComputing.org

Why Take This Course?

  • Researchers work with increasing amounts of data.
  • Many students do not have training in data management.
  • Science degree programs generally do not address this gap.
  • It is difficult for "non-majors" to get into IT courses.
  • This leaves students and research teams struggling to cope.
  • And therefore places a heavy burden on IT support.
  • Our data management course provides the needed skills to address these issues.
  • Exciting new discoveries await those who can effectively sift through mounds of data!

Participant Introductions

Please introduce yourself and share your:

  • Degree program and emphasis
  • Research area (general topic)
  • Your current research project (specific topic)
  • The types of data or data systems you use in this project
  • What you hope to get out of this course

be friendly //: # (From: http://iconsforlife.com/post/28922487778)

Session 1: Data System Essentials

nasa data analysisPhoto: NASA

How will you manage your data?

You need a data system.

There are many choices.

To pick the best one, you need to state your requirements.

Today's Learning Objectives

In this session, you will ...

  • Become familiar with common types of data systems
  • Learn to differentiate between flat files and relational databases
  • Learn to differentiate between spreadsheets and databases
  • Learn how to model system functions and interactions
  • Learn how to create system diagrams
  • Learn how to state system requirements

Ultimately, this knowledge will help you select or design the best data system for your needs.

Types of Data Systems

Unlinked


Linked

========================================================

Flat Files

  • MS Office Documents
  • Plain Text Files (CSV, TXT)
  • Instrument Output
  • Stats. Program Output

Relational Databases

Spreadsheets and Databases

An excellent short video presentation explaining the differences between databases and spreadsheets can be found on YouTube:

Watching this video is a "homework" assignment.

So for now, we will just summarize the differences.


spreadsheet and databaseSource: WHO and Mozilla/dietrich

Spreadsheets

  • Convenient
  • Interactive
  • Visual
  • Flexible
  • Portable

spreadsheetSource: WHO

Databases

  • Manageable
  • Structured
  • Standardized
  • Scalable
  • Accessible

relational databaseGraphic: Mozilla/dietrich

Designing a Data System

To design a data system, we need to identify requirements and map out interactions and components. In this course you will learn how to create:

  • Use Case Diagrams
  • Data Flow Diagrams
  • Entity Relationship Diagrams

So let's get started!


data modelingGraphic: EPISTLE and its successors / Matthew West, Julian Fowler, Razorbliss / Wikimedia

Get the Picture: Use Case Diagrams

Let's visualize a model of a "system" ...

Use Case Diagrams focus on the "what" and not the "how".

They model what people want to do with a system. A use case describes a "goal", expressed as an "action". People and other external entities are modeled as "actors" that "interact" with the system.


Use Case DiagramGraphic: Kishorekumar 62 (redrawn by Marcel Douwe Dekker) / Wikimedia

Example System Interactions

Imagine a system called "research project." Some interactions that might appear in a model of this system are:

  1. Researcher proposes experimental design.
  2. Principal investigator approves experimental design.
  3. Researcher creates survey.
  4. Subject takes survey.
  5. Subject provides survey results.
  6. Researcher analyses results.
  7. Researcher produces manuscript.
  8. Principal investigator reviews manuscript.

Let's visualize these interactions in a use case diagram...

Research Project Use Case Diagram

Example Use Case Diagram

If we were only modeling the data system, we would probably remove the goals (and some actors) which were "out of scope" with repect to the data system...

Survey Data System Use Case Diagram

Here, the scope only encompasses the goals of conducting the survey and returning results.

Example Use Case Diagram


  1. Researcher uploads survey.
  2. Subject takes survey.
  3. Subject uploads results.
  4. Researcher downloads results.

This is the basic operation of the Open Data Kit system which we will learn about next week.

But what good are they, really?

Modeling diagrams help you:

  • Clarify your own understanding
  • Explore possibilities
  • Communicate with others
  • Prepare for more detailed design steps

As a researcher, you can use these to clarify your project scope and requirements. They will help you present your project needs to others, such as your collaborators and support staff.

Use case diagrams identify what a system must do and how people will interact with it. A complete use case model includes use case diagrams and textual descriptions of each use case.

Hands-on Group Exercise

groupPhoto: SarahStierch / Wikimedia

Create a Use Case Diagram

As a group, list the goals (actions, use cases) for your research data system in "verb noun" form. Then figure out who (actors) will interact to perform those actions.

Draw a simple use case diagram with stick figures (actors) and elipses (goals, use cases). Use pen and paper or software.

All of the elipses should be enclosed in a "system boundary" box (if the software supports that), with the stick figures outside of the box.

Lines (interactions) should connect the actors to their goals. Label the lines with what the actor does (action) to achieve the goal.

Discussion

We will display your diagrams on the screen and discuss them.

discussionGraphic: Jagbirlehl / Wikimedia

In the Coming Sessions...

We will continue the Needs Analysis of your data system with:

Which will present us with some options, typically:

  • Do nothing (business as usual)
  • Get something "off the shelf" (free or commercial)
  • Build something ourselves

... or some combination of the above.

Action Items

videos - public domain CC0 1.0 icon Videos

//: # (This icon is provided as CC0 1.0 Universal (CC0 1.0) Public Domain Dedication.)

readings icon - iconsforlife.com Readings

tasks Tasks

Watch Videos

Watch these videos in the order listed.

watching

Readings

readings

Tasks

We have several tasks to perform as "homework" before our next session. They should be fairly quick to complete. You might do one task per day, spending maybe 15-30 minutes on each task.

tasks

Task 1: Your favorite website's database

Find out through Internet research what database system (product name, database type, etc.) underlies your favorite or most-visited website. Examples might be a webmail, search, social, video/movie/music/store, blog, forum, or news website. (Since there are links to information about this on Facebook below, pick another site if that was your favorite.)

If the site is popular, you will likely find a blog, news article or conference presentation mentioning the technology that the site uses, including it's back-end database system. Look up the database system product name in Wikipedia. Try to determine why that product was chosen over the other alternatives. Be ready to share this information in the next class session in a one-minute verbal presentation.

Task 2: Limits to Excel as a "database"

Find out the actual limits on MS Excel (max. file size, number of rows, etc.) that would make it unusable as a database if those limits were exceeded. (These may vary depending on the software version.)

How about for OpenOffice (LibreOffice) "Calc"? (bonus points)

For the Excel experts (bonus points): How do you link spreadsheets by matching columns headings, control the allowed values which can be entered in a column, protect cells (say, those containing formulas or constants) from being changed, restrict who can modify or view certain spreadsheets, and access the linked spreadsheets from other applications (like websites or statistics programs) over a network? If you know how to do these things, please demonstrate in class for us.

Task 3: Data Sources and Needs Analysis

Use your wiki in Redmine (or GitHub, etc.) to document the list of the data sources you will be working with in your project. Note the file types/applications, organizations/persons/processes they came from, and what you will do to/with them.

The wiki language supports tables, which might be a good way to format the information in the wiki. Later you will use this wiki to further elucidate your "data dictionary".

Perform a needs analysis. For example, how will you access your data (from campus, remotely, from a mobile device, using what software?) and what sorts of security protections you will need (encryption, access controls)? What other goals and requirements do you have? Store the detailed list in your wiki.

Task 4: Use Case Diagram for your project

Based on your needs analysis, produce a Use Case Diagram for your research study data system. Make it more detailed than the one we made in class today. Break out complicated actions into separate, more detailed, diagrams if you need to.

Include all of the data-related goals and tasks associated with your research project from beginning to end. Go into a level of detail which would communicate your data system needs clearly to a IT professional (analyst, designer, developer, or administrator).

You can use pen and paper to make the diagram or you can use software tools such as Creately, Gliffy, Dia, or Violet. You will present this diagram (for two minutes) in the next class session.

Task 5: Get files for textbook exercises

Download Examples from the textbook and extract the example files from the "pcfb_examples.zip" file to the folder "pcfb". Put that folder in whichever environment you will be working. For now, this will probably be your "Documents" folder on your own computer or in your "home directory" on a Unix or Linux server.

See also

Questions and Comments

questionImage: © Nevit Dilmen / Wikimedia

Some Parting Words

It is estimated that 40% of the defects that make it into the testing phase of enterprise software have their root cause in errors in the original requirements documents.

From: Obamacare's Website Is Crashing Because Backend Was Doomed In The Requirements Stage (Forbes)


bug comicImage: Andy Glover. Used with permission.