##viz Skip to Main Content

Managing Research Data: Data Management

Every discipline will have its own best practices for file naming conventions, versioning, metadata, and archiving. The information in this guide can serve as a starting point to help you learn more about these standards of data management in your discipline.


Below are some excellent resources to get you started with managing your data!


""View a workshop recording on "Better Data Management" at this link


""Schedule a Consultation with Dani Kirsch, Research Data Specialist.

Sharing your research data is impossible without proper data documentation. Metadata is data about data - structured information that describes the content and makes it easier to find or use. Metadata can be embedded within the data itself, or stored separately. Metadata can be included in any data file or file format.

While it is generally best to follow discipline-specific metadata formats (more on that later!), this generic README-style metadata template from Cornell University is an excellent starting point. The Dublin Core metadata standard is considered relatively simple and is widely used across many disciplines.


Some good practices to follow include:

  • Providing sufficient detail so that context and content of data are clear to future users
  • Clearly stating licenses/restrictions placed on the data
  • Reporting bibliographic information about the dataset, including citations to relevant publications
  • Summarizing key methodological information
    • Sampling methods (e.g., geography, dates, protocols)
    • Software (including versions, where applicable)
    • Processing or transformation of files and/or data
  • Describing file formats (e.g., csv, txt, tiff), contents, and hierarchies
  • Following FAIR principles when creating metadata
  • Asking a colleague to review your metadata and data files and suggest improvements or point out concerns

The following resources provide more information on discipline-specific metadata:


If you just need a simpler system to keep track of data within your lab, there are three main types of metadata addressed by most standards:

  1. Descriptive: described the resource for identification and discovery
  2. Structural: how objects are related to one another
  3. Administrative: Creation date, file type, rights management, etc.

""Schedule a Consultation with Dani Kirsch, Research Data Specialist.

 


Content adapted from Cornell University guides on metadata and describing data and writing README style metadata.

Some considerations for FILE NAMING:

  • Create a file naming schema to document your file naming conventions for different file types or projects
    • Kristin Briney from Caltech Library created a worksheet that walks researchers through this process
  • Use unique file names so that files are not deleted or replaced by files of the same name
  • Limit file name lengths to 25-50 characters
    • Abbreviations can save space (be sure to document them in your file naming schema!)
      • 2nd Quarter ➔ Q2
  • Avoid using spaces, special characters, or periods in file names
    • Instead, use only alphanumeric characters, hyphens, and underscores
    • Combine words with hyphens or CamelCase
      • Lion Genome Project ➔ LionGenomeProj
  • Use ISO 8601 formatting for dates (YYYYMMDD)
    • 12-31-22 ➔ 2022-12-31 or 20221231
  • Include leading 0s for sequences
    • run1, run 10, run 100 ➔ run001, run 010, run100
  • Choose components to include in the file name that will make each file distinct and be informative about its contents. Possible components include:
    • Date of creation
    • Initials/surname of person who created or transformed file data
    • Project ID or abbreviation
    • Sample ID
    • Location data collected at
    • Subject matter
    • State of data (raw, transformed, final)
  • Version numbers can be included as part of file names, but in many cases it is better to use version control systems such as OSF and GitHub

Some considerations for FILE ORGANIZATION:

  • Like file names, folder names should indicate contents
  • Avoid long folder names or complex hierarchies
    • Descriptive file names can reduce the need for long names or complex hierarchies
  • Divide subfolders using common themes (e.g., types of output ➔ data, code, docs)
  • Document file directory structure
    • Can include as part of README file for project

""
View a workshop recording on "File Naming, Folder Organization, and FAIR Data" at this link

 


""Schedule a Consultation with Dani Kirsch, Research Data Specialist.

 


Content adapted from guides on naming and organizing files and data from Brown University Library, Carnegie Mellon University Libraries, Oregon Health & Science University Library, University of Puerto Rico - Mayaguez Campus, Oregon State University Libraries, and Princeton University Library.

I am a work in progress. Please check back later for updates and thank you for your patience!

Electronic lab notebooks (ELNs), sometimes called electronic research notebooks, perform the same functions as hard-copy lab notebooks and are an important element of laboratory data management. 

Click on the following topics to skip to that section of the guide:


""
Laboratory Record-Keeping


 

Important functions of laboratory notebooks (both hard-copy and electronic):

  • Record daily research workflows
  • Document details of experiments which are later used for
    • Publication and reproducibility
    • Preparation of reports and presentations
  • Allow project transfer between students or researchers
  • Allow supervisor review and avoid research misconduct
  • Defend patents
  • Meet contractual obligations of funders
  • Validate research
  • Facilitate good research products through clear communication

Guidelines for keeping laboratory notebooks:

  • Permanently bound book with pre-numbered pages
  • Make entries legibly in permanent black ink with no erasures
  • Record entries chronologically
  • Date each page and include descriptive information about the experiment or investigation being conducted
  • All entries should be in English
  • Printouts, graphs and tables should be printed and secured with permanent glue, signed and dated.
  • Explain acronyms and include units for all data entries
  • Enter observations immediately 
  • Entries should include all relevant details that would allow someone else to repeat the experiment including:
    • Instrument type, manufacturer, serial number, software version, calibration information
    • Reagents and specimens including name, manufacturer, registry (CAS) number, lot number
    • Diluted reagents should include any appropriate info about stock preparation, method of dilution and storage
    • Methods or protocols should be detailed in full
  • Summarize findings on a regular basis

""
Benefits & Drawbacks of ELNs


 

Some of the major benefits provided by ELNs are:

  • Often a cloud-based system, which facilitates safe storage and preservation of information
  • Remote access to lab notebook contents
  • Maintain legibility because entries are not hand-written
  • Enable sharing among collaborators
  • Allow for inclusion of images from instruments and programs
  • Automatically include timestamps

Some of the major drawbacks of ELNs are:

  • Cost
    • Large storage or user requirements may exceed limits of free or low-cost version
  • Dependent upon network connection
  • Time and resource commitment to finding and implementing an appropriate ELN for your lab

""
Considerations when Choosing ELNs


 

General Considerations:

  • Cost - ELNs range from free and open source to commercial implementations that can be charged on a per month per user basis. 
    • Most paid versions charge academic users ~$10-20 per user per month.
  • Functionality - not surprisingly, free options provide fewer functions and may restrict the number of users. More expensive commercial platforms will generally provide a higher degree of customization and more security features. 
    • Other considerations include
      • Version control
      • Whether pages can be locked and electronically signed
      • What kind of data storage is needed to meet government or funder requirements
  • Access and collaboration vs security - most ELN platforms are cloud-based, providing researchers flexibility and mobility for entering data outside of the lab. 
    • Collaboration and sharing of data can be done with pdf's or reports created from within the ELN or as a feature of the ELN
    • A hierarchy of roles such as contributor, reviewer, editor determine who can see and approve project records
    • After the work has been reviewed, the record is locked to prevent unauthorized changes 

Special Considerations:

Some types of research may dictate that an ELN have specific features.


""ELN Assessments & Resources

The following are some assessments and comparisons of current top ELNs:

Below are some commonly used ELNs:

  • Microsoft OneNote
  • Open Science Framework (OSF)
  • Google Forms - some labs may find this a useful way of recording details from a particular trial run or experiment (e.g., what was the temperature? did I calibrate the instrument? who ran the test?)
  • LabArchives - available for researchers at OSU Center for Health Sciences & OSU College of Osteopathic Medicine at the Cherokee Nation


Content adapted from the following guides: Cold Spring Harbor Laboratory - ELNs,

I am a work in progress. Please check back later for updates and thank you for your patience!

Data Documentation comes in many forms and serves a dual purpose: 1) it enables the researcher(s) to have a record of important details from their project, and 2) it provides future users of the data with enough content and context to accurately utilize and interpret the data.

Click on the following topics to skip to that section of the guide:


""Codebooks

Codebooks provide comprehensive information for all the variables in a data file and are most commonly found accompanying survey data. ICPSR has an excellent introductory guide to codebooks. The general components for a particular variable in a codebook are:

  • Variable name
  • Variable label
  • Question text
  • Values
  • Value labels
  • Summary statistics - frequency counts, minimum-median-maximum values
  • Missing data
  • Universe skip patterns
  • Notes

In programs such as Stata, there are ways to export a codebook from your analysis (see this YouTube video for how to run the Stata codebook command).


""Data Dictionaries

Data dictionaries provide metadata for databases, systems, and datasets. They enable researchers to maintain consistency throughout a project and contribute to the use of Data Standards. USGS has an excellent introductory guide to data dictionaries and USDA has a blank template for creating a data dictionary from scratch. The general components of a data dictionary are:

  • List of data objects - names & definitions
  • Detailed properties of data elements - data type, size, nullability, etc.
  • Reference data
  • Missing data & quality indicator codes
  • Business rules - validation of schema/data quality, etc.

For researchers creating data dictionaries, it is important to identify any existing data standards in the research field. Resources such as the Environmental Information Exchange Network maintain lists of data standards relevant for particular disciplines.

Common Data Elements (CDEs) are a type of data standard used in the health sciences and NIH maintains a repository of CDEs.


""Data Validation & Quality Control

It is essential for researchers to incorporate data validation and quality control procedures into their research workflow. Adopting best practices for data management - including organization and documentation - is a good start, and integrating the following methods will further enhance data quality:

  • Document any data inconsistencies
  • Check datasets for duplicates and errors - may require data review by a peer
  • Use data validation tools
  • Routinely inspect small subsets of data - most useful for those working with large volumes of data

There are a variety of tools available to help with data validation and quality control.


""Data Cleaning & Transformation - Documentation

Data cleaning is focused primarily on ensuring that the final dataset includes the appropriate data. In other words, it accomplishes tasks such as eliminating duplicate entries, fixing structural errors, removing outliers, and addressing missing data. Documenting this process enables researchers to provide details and justification for edits they made in producing the final dataset used in their analyses.

Data transformation is less about "fixing" the dataset and more about converting data into other formats or arrangements to meet the needs of analysis and/or storage. Documenting this process enables researchers to have a clear set of steps they followed while modifying their data, which allows them and other future users to smoothly replicate the data transformation(s) that produced the final dataset and results.


""Schedule a Consultation with Dani Kirsch, Research Data Specialist.

 


Content adapted from the following sources: ICPSR - What is a Codebook?, USGS - Data Dictionaries, UC Merced Library - What Is a Data Dictionary?,  EPA - Learn About Data Standards, NIH - Common Data Elements (CDEs), Yale Library - Validate Data, Tableau - Guide to Data Cleaning: Definition, Benefits, Components, and How to Clean Your Data