The research data management lifecycle is an illustration of the research process as it relates to data management. In this system, data is meant to be organized, annotated and stored in ways that will facilitate data sharing and reuse and/or research validation.
At the start of any research project, you should think ahead about what data you will need to use (if any) during your research processes. You may need to pay to access, compute against, or store data, so knowing about these costs upfront can inform grant budgets.
Data might be collected from vendors, databases, or other researchers. If the data you are searching for does not exisit, you may need to collect it yourself from multiple sources, or you may need to create the data.
Support at Yale:
- Reusing data from data repositories - strategies for finding appropriate data repositories
- Reusing data associated with publications strategies for finding research data for reuse
- Collecting data from EHRs (Electronic Health Records) - YNHH Epic EHR data pull requests at Yale are places through JDAT (the Joint Data Analytics Team
After data is collected or created, most likely you will need to process or clean the data in some way. Data processing and cleaning can involve, merging multiple datasets, selecting or filtering out specific portions of a dataset, standardizing categories found within a dataset, reorganizing how spreadsheets containing data are organized, and more. When you create data groups that result in aggregation, data processing can start to bleed into data analysis.
Support at Yale:
- Data Support @ the Medical Library - email [email protected]
Data analysis = generating findings from your data.
An important part of data analysis includes data visualization (i.e., graphs).
Support at Yale
- Statistical help - StatLab
- Bioinformatics help - Bioinformatics Hub
- High Performance Computing & Parellel Computing - YCRC (Yale Center for Research Computing
- Research & Analytics Clinics - YCAS (Yale Center for Analytical Sciences)
- Help designing and creating data visualizations - email [email protected]
Generally, research data and materials that are commonly accepted in the scientific community as necessary to validate research findings must be retained by Yale researchers for three (3) years after publication of the findings or all required final reports (e.g., progress and financial) for the project have been submitted to the sponsor. Yale Policy 6001 Research Data & Materials Policy
Data sharing refers to the process of making data public, typically via a data repository. Data retention refers to storing data so it remains usable, though not nessissarily available to the public.
In addition to the (sometimes iterative) stages you will progress through during the Research Data, there are also themes that you will need to consider during multiple, if not all, of these phases.
Version control allows you to see the change history of a file, and to restore a file to a previous iteration. You can apply a manual version control by adding dates or v1/v2/vfinal notations to a file name, or by writing a change log within a READme file. Cloud data storage systems like Box and Google Drive have version control capabilitites (Note: whenever you are exploring a data, or content management system, make a note to check if the system supports version control and how versions are retained).
The most robust and independent way to maintain control over your file versioning is to apply a Version Control System like Git.
Support at Yale:
- If you have questions about Git, or would like help getting started with Git or GitHub, email [email protected].
Documentation can include any notes and annotations related to your research data that make your data understandable to others (as well as your future self). Maintaining accurate and useful documentation can make the difference between your data being reusable in future research senarios or not.
Support at Yale:
- Take a look at this additional information about Codebooks, Data Dictionaries & ReadMe Files and email [email protected] with any questions.
When choosing a data storage solution, you should think about how often you will be using this data, if others will need to access this data too, how much a data storage solution will cost, the level of risk associated with your data, the size of your data, and more.
Support at Yale:
- This interactive storage finder tool can help you navigate the various options available to you through Yale.
How can you know which software are cleared for moderate or high risk data? (And what are the classifications of moderate or high risk data?) Check with Yale Information Security
When you start to think about how you would actually engage with any of these steps, different technology aspects come into play, along with themes including version control, documentation, and operational data storage.
- DMPTool - access and store templated data management plans. Quick start guide
- To fill out a practise form that contains research data planning considerations, visit this Google Form
- Core Research Facilities - Yale’s Core Research Facilities provide Yale researchers access to state of the art scientific instrumentation with the intent to keep Yale’s scientific research at the cutting edge. Each Core employs highly trained staff that may provide training and assistance with use of instrumentation as well as aid in experimental design.
- APIs (Application Programming Interfaces)
- Qualtrics - create and deploy
- Microsoft Excel
- Databases - databases are more robust for storing and organizing interrelated data structures than spreadsheets or tables. Email [email protected] with questions about relational database design and set-up.
- Microsoft Excel - Excel funtions | Data processing/analysis in Excel
- Python - email [email protected] for a workshop or tutorial based on your research needs
- R - email [email protected] for a workshop or tutorial based on your research needs
- OpenRefine - A powerful tool for working with messy data, cleaning it, and transforming it from one format or structure to another
- There are many proprietary analysis tools; this document will focus on what you have access to free and/or through Yale
- Find software through the Yale Software IT Library
- Microsoft Excel - Excel funtions | Data processing/analysis in Excel
- Python - email [email protected] for a workshop or tutorial based on your research needs
- R - email [email protected] for a workshop or tutorial based on your research needs
- Learn about different types and categories of graphs via the Data Viz Catalogue
- Jump start your ability to create data visualizatins in R with The R Graph Gallery
- Interactive storage finder tool
- Dryad - deposit research data for free through Yale. More about Dryad
- Zenodo - deposit research code and data. More about Zenodo
- Where can I find additional training opportunities?
- What are file and folder naming best practises?
- What if I want to make bulk changes to how my files are currently named?
- What are best practises related to file organization?
- Where can I connect with peers engaging in biomedical data science research?
- What about specific topics (training, software, hardware, and collaborative opportunities) in bioinformatics research?
- How can you know which software are cleared for moderate or high risk data? (And what are the classifications of moderate or high risk data?) -> Check with Yale Information Security
Have other questions? Email [email protected]