Managing Research Data: Data Management

Every discipline will have its own best practices for file naming conventions, versioning, metadata, and archiving. The information in this guide can serve as a starting point to help you learn more about these standards of data management in your discipline.

Below are some excellent resources to get you started with managing your data!

The University of Wisconsin - Madison: Research Data Microcourse
The University of Minnesota: Managing Your Research Data Tutorial Series
UK Data Archive: Research Data Management
- See also their Data Management Checklist
ICPSR: Guide to Social Science Data Preparation and Archiving
Data Management in Large-Scale Education Research by Crystal Lewis

Schedule a Consultation with Dani Kirsch, Research Data Specialist.

Sharing your research data is impossible without proper data documentation. Metadata is data about data - structured information that describes the content and makes it easier to find or use. Metadata can be embedded within the data itself, or stored separately. Metadata can be included in any data file or file format.

While it is generally best to follow discipline-specific metadata formats (more on that later!), this generic README-style metadata template from Cornell University is an excellent starting point. The Dublin Core metadata standard is considered relatively simple and is widely used across many disciplines.

Good Metadata Practices

Providing sufficient detail so that context and content of data are clear to future users
Clearly stating licenses/restrictions placed on the data (such as Creative Commons licenses)
Reporting bibliographic information about the dataset, including citations to relevant publications
Summarizing key methodological information
- Sampling methods (e.g., geography, dates, protocols)
- Software (including versions, where applicable)
- Processing or transformation of files and/or data
Describing file formats (e.g., csv, txt, tiff), contents, and hierarchies
Following FAIR principles when creating metadata
Asking a colleague to review your metadata and data files and suggest improvements or point out concerns

Discipline-Specific Metadata Guidance

Research Data Alliance's Metadata Standards Catalog - browse by subject to discover metadata schemas
A few common metadata standards:
- Darwin Core (Biology)
- Data Documentation Initiative (Social/Behavioral Sciences)
- Ecological Metadata Language (Ecology)
  - The R package eml facilitates creation of this metadata schema
There are also metadata standards designed for specific types of data
- Recommended Metadata for Biological Images (Bioimaging Data)
- Astronomy Visualization Metadata (Astronomical Imagery from Telescopic Observations)
- PDBx / Macromolecular Crystallographic Information Framework (3-D Structure of Macromolecules)
- Text Encoding Initiative (Digital Textual Resources)

If you just need a simpler system to keep track of data within your lab, there are three main types of metadata addressed by most standards:

Descriptive: described the resource for identification and discovery
Structural: how objects are related to one another
Administrative: Creation date, file type, rights management, etc.

Schedule a Consultation with Dani Kirsch, Research Data Specialist.

Content adapted from Cornell University guides on metadata and describing data and writing README style metadata.

File Naming Considerations

Create a file naming schema to document your file naming conventions for different file types or projects
- Kristin Briney from Caltech Library created a worksheet that walks researchers through this process
Use unique file names so that files are not deleted or replaced by files of the same name
Limit file name lengths to 25-50 characters
- Abbreviations can save space (be sure to document them in your file naming schema!)
  - 2nd Quarter ➔ Q2
Avoid using spaces, special characters, or periods in file names
- Instead, use only alphanumeric characters, hyphens, and underscores
- Combine words with hyphens or CamelCase
  - Lion Genome Project ➔ LionGenomeProj
Use ISO 8601 formatting for dates (YYYYMMDD)
- 12-31-22 ➔ 2022-12-31 or 20221231
Include leading 0s for sequences
- run1, run 10, run 100 ➔ run001, run 010, run100
Choose components to include in the file name that will make each file distinct and be informative about its contents. Possible components include:
- Date of creation
- Initials/surname of person who created or transformed file data
- Project ID or abbreviation
- Sample ID
- Location data collected at
- Subject matter
- State of data (raw, transformed, final)
Version numbers can be included as part of file names, but in many cases it is better to use version control systems such as OSF and GitHub

File/Folder Organization Considerations

Like file names, folder names should indicate contents
Avoid long folder names or complex hierarchies
- Descriptive file names can reduce the need for long names or complex hierarchies
Divide subfolders using common themes (e.g., types of output ➔ data, code, docs)
Document file directory structure
- Can include as part of README file for project

Access a workshop recording on "File Naming, Folder Organization, and FAIR Data"

Schedule a Consultation with Dani Kirsch, Research Data Specialist.

Content adapted from guides on naming and organizing files and data from Brown University Library, Carnegie Mellon University Libraries, Oregon Health & Science University Library, University of Puerto Rico - Mayaguez Campus, Oregon State University Libraries, and Princeton University Library.

I am a work in progress. Please check back later for updates and thank you for your patience!

Electronic lab notebooks (ELNs), sometimes called electronic research notebooks, perform the same functions as hard-copy lab notebooks and are an important element of laboratory data management.

Laboratory Record-Keeping

Important functions of laboratory notebooks

Record daily research workflows
Document details of experiments which are later used for
- Publication and reproducibility
- Preparation of reports and presentations
Allow project transfer between students or researchers
Allow supervisor review and avoid research misconduct
Defend patents
Meet contractual obligations of funders
Validate research
Facilitate good research products through clear communication

Guidelines for keeping laboratory notebooks

Permanently bound book with pre-numbered pages
Make entries legibly in permanent black ink with no erasures
Record entries chronologically
Date each page and include descriptive information about the experiment or investigation being conducted
All entries should be in English
Printouts, graphs and tables should be printed and secured with permanent glue, signed and dated.
Explain acronyms and include units for all data entries
Enter observations immediately
Entries should include all relevant details that would allow someone else to repeat the experiment including:
- Instrument type, manufacturer, serial number, software version, calibration information
- Reagents and specimens including name, manufacturer, registry (CAS) number, lot number
- Diluted reagents should include any appropriate info about stock preparation, method of dilution and storage
- Methods or protocols should be detailed in full
Summarize findings on a regular basis

Benefits & Drawbacks of ELNs

Benefits

Often a cloud-based system, which facilitates safe storage and preservation of information
Remote access to lab notebook contents
Maintain legibility because entries are not hand-written
Enable sharing among collaborators
Allow for inclusion of images from instruments and programs
Automatically include timestamps

Drawbacks

Cost
- Large storage or user requirements may exceed limits of free or low-cost version
Dependent upon network connection
Time and resource commitment to finding and implementing an appropriate ELN for your lab

Considerations when Choosing ELNs

General Considerations

Cost
- ELNs range from free and open source to commercial implementations that can be charged on a per month per user basis.
- Most paid versions charge academic users ~$10-20 per user per month.
Functionality
- Not surprisingly, free options provide fewer functions and may restrict the number of users. More expensive commercial platforms will generally provide a higher degree of customization and more security features.
- Other considerations include
  - Version control
  - Whether pages can be locked and electronically signed
  - What kind of data storage is needed to meet government or funder requirements
Access and collaboration VS security
- Most ELN platforms are cloud-based, providing researchers flexibility and mobility for entering data outside of the lab.
- Collaboration and sharing of data can be done with PDFs or reports created from within the ELN or as a feature of the ELN
- A hierarchy of roles such as contributor, reviewer, editor determine who can see and approve project records
- After the work has been reviewed, the record is locked to prevent unauthorized changes

Special Considerations

Some types of research may dictate that an ELN have specific features.

Human subject research may require data security measures that are HIPAA compliant
Similar data privacy measures may be necessary to meet the requirements of The Family Educational Rights and Privacy Act (FERPA) that governs educational information
Some other features that might be necessary include the ability to meet audit, reporting and electronic signature guidelines. See Part 11 of the US Food and Drug Administration guidance document as an example (previously 21 CFR Part 11; Electronic Records; Electronic Signatures, Electronic Copies of Electronic Records)

ELN Assessments and Resources

Assessments and Comparisons of ELNs

ELN Vendor Wiki with links to pricing and demonstration videos
Splice 2023 Best Electronic Lab Notebook Review
The Electronic Lab Notebook in 2023: A comprehensive guide

A Few Popular ELNs

Microsoft OneNote
Open Science Framework (OSF)
Google Forms - some labs may find this a useful way of recording details from a particular trial run or experiment (e.g., what was the temperature? did I calibrate the instrument? who ran the test?)
LabArchives - available for researchers at OSU Center for Health Sciences & OSU College of Osteopathic Medicine at the Cherokee Nation

Content adapted from the following guides: Cold Spring Harbor Laboratory - ELNs,

I am a work in progress. Please check back later for updates and thank you for your patience!

Data Documentation comes in many forms and serves a dual purpose: 1) it enables the researcher(s) to have a record of important details from their project, and 2) it provides future users of the data with enough content and context to accurately utilize and interpret the data.

Codebooks
Data Dictionaries
Data Validation & Quality Control
Documentation of Data Cleaning and Transformation

Codebooks

Codebooks provide comprehensive information for all the variables in a data file and are most commonly found accompanying survey data. ICPSR has an excellent introductory guide to codebooks. The general components for a particular variable in a codebook are:

Variable name
Variable label
Question text
Values
Value labels
Summary statistics - frequency counts, minimum-median-maximum values
Missing data
Universe skip patterns
Notes

In some programs, it is possible to export a codebook from your analysis (YouTube video on running the Stata codebook command).

Data Dictionaries

Data dictionaries provide metadata for databases, systems, and datasets. They enable researchers to maintain consistency throughout a project and contribute to the use of Data Standards. USGS has an excellent introductory guide to data dictionaries and USDA has a blank template for creating a data dictionary from scratch. The general components of a data dictionary are:

List of data objects - names & definitions
Detailed properties of data elements - data type, size, nullability, etc.
Reference data
Missing data & quality indicator codes
Business rules - validation of schema/data quality, etc.

For researchers creating data dictionaries, it is important to identify any existing data standards in the research field. Resources such as the Environmental Information Exchange Network maintain lists of data standards relevant for particular disciplines.

Common Data Elements (CDEs) are a type of data standard used in the health sciences and NIH maintains a repository of CDEs.

Data Validation & Quality Control

It is essential for researchers to incorporate data validation and quality control procedures into their research workflow. Adopting best practices for data management - including organization and documentation - is a good start, and integrating the following methods will further enhance data quality:

Document any data inconsistencies
Check datasets for duplicates and errors - may require data review by a peer
Use data validation tools
Routinely inspect small subsets of data - most useful for those working with large volumes of data

There are a variety of tools available to help with data validation and quality control.

Excel (and open source equivalents such as LibreOffice's Calc) allow users to create rules to limit data types or values that can be inputted in certain cells.
- Access a brief overview of Excel Data Validation or Data Validity in Calc.
- A Data Carpentry lesson on Quality Control also demonstrates some useful tools for performing quality control on your data
Google Forms enables users to set rules for specific questions in a form so that the data entered meet necessary formatting and content criteria.

Data Cleaning and Transformation - Documentation

Data cleaning is focused primarily on ensuring that the final dataset includes the appropriate data. In other words, it accomplishes tasks such as eliminating duplicate entries, fixing structural errors, removing outliers, and addressing missing data. Documenting this process enables researchers to provide details and justification for edits they made in producing the final dataset used in their analyses.

Data transformation is less about "fixing" the dataset and more about converting data into other formats or arrangements to meet the needs of analysis and/or storage. Documenting this process enables researchers to have a clear set of steps they followed while modifying their data, which allows them and other future users to smoothly replicate the data transformation(s) that produced the final dataset and results.

Schedule a Consultation with Dani Kirsch, Research Data Specialist.

Content adapted from the following sources: ICPSR - What is a Codebook?, USGS - Data Dictionaries, UC Merced Library - What Is a Data Dictionary?, EPA - Learn About Data Standards, NIH - Common Data Elements (CDEs), Yale Library - Validate Data, Tableau - Guide to Data Cleaning: Definition, Benefits, Components, and How to Clean Your Data

Managing Research Data: Data Management

Good Metadata Practices

Discipline-Specific Metadata Guidance

File Naming Considerations

File/Folder Organization Considerations

Table of Contents

Laboratory Record-Keeping

Important functions of laboratory notebooks

Guidelines for keeping laboratory notebooks

Benefits & Drawbacks of ELNs

Benefits

Drawbacks

Considerations when Choosing ELNs

General Considerations

Special Considerations

ELN Assessments and Resources

Assessments and Comparisons of ELNs

A Few Popular ELNs

Table of Contents

Codebooks

Data Dictionaries

Data Validation & Quality Control

Data Cleaning and Transformation - Documentation

Research Data Specialist

Data Management Education