- Library
- Guides
- Library Guides
- Managing Research Data
- Data Management
Managing Research Data: Data Management
Every discipline will have its own best practices for file naming conventions, versioning, metadata, and archiving. The information in this guide can serve as a starting point to help you learn more about these standards of data management in your discipline.
Below are some excellent resources to get you started with managing your data!
-
The University of Wisconsin - Madison: Research Data Microcourse
-
The University of Minnesota: Managing Your Research Data Tutorial Series
-
UK Data Archive: Research Data Management
-
See also their Data Management Checklist
-
-
Data Management in Large-Scale Education Research by Crystal Lewis
View a workshop recording on "Better Data Management" at this link
Schedule a Consultation with Dani Kirsch, Research Data Specialist.
Sharing your research data is impossible without proper data documentation. Metadata is data about data - structured information that describes the content and makes it easier to find or use. Metadata can be embedded within the data itself, or stored separately. Metadata can be included in any data file or file format.
While it is generally best to follow discipline-specific metadata formats (more on that later!), this generic README-style metadata template from Cornell University is an excellent starting point. The Dublin Core metadata standard is considered relatively simple and is widely used across many disciplines.
Some good practices to follow include:
- Providing sufficient detail so that context and content of data are clear to future users
- Clearly stating licenses/restrictions placed on the data
- Reporting bibliographic information about the dataset, including citations to relevant publications
- Summarizing key methodological information
- Sampling methods (e.g., geography, dates, protocols)
- Software (including versions, where applicable)
- Processing or transformation of files and/or data
- Describing file formats (e.g., csv, txt, tiff), contents, and hierarchies
- Following FAIR principles when creating metadata
- Asking a colleague to review your metadata and data files and suggest improvements or point out concerns
The following resources provide more information on discipline-specific metadata:
- Research Data Alliance's Metadata Standards Catalog - browse by subject to discover metadata schemas
- A few common metadata standards:
- Darwin Core (Biology)
- Data Documentation Initiative (Social/Behavioral Sciences)
- Ecological Metadata Language (Ecology)
- The R package eml facilitates creation of this metadata schema
- There are also metadata standards designed for specific types of data
- Recommended Metadata for Biological Images (Bioimaging Data)
- Astronomy Visualization Metadata (Astronomical Imagery from Telescopic Observations)
- PDBx / Macromolecular Crystallographic Information Framework (3-D Structure of Macromolecules)
- Text Encoding Initiative (Digital Textual Resources)
If you just need a simpler system to keep track of data within your lab, there are three main types of metadata addressed by most standards:
- Descriptive: described the resource for identification and discovery
- Structural: how objects are related to one another
- Administrative: Creation date, file type, rights management, etc.
Schedule a Consultation with Dani Kirsch, Research Data Specialist.
Content adapted from Cornell University guides on metadata and describing data and writing README style metadata.
Some considerations for FILE NAMING:
- Create a file naming schema to document your file naming conventions for different file types or projects
- Kristin Briney from Caltech Library created a worksheet that walks researchers through this process
- Use unique file names so that files are not deleted or replaced by files of the same name
- Limit file name lengths to 25-50 characters
- Abbreviations can save space (be sure to document them in your file naming schema!)
- 2nd Quarter ➔ Q2
- Abbreviations can save space (be sure to document them in your file naming schema!)
- Avoid using spaces, special characters, or periods in file names
- Instead, use only alphanumeric characters, hyphens, and underscores
- Combine words with hyphens or CamelCase
- Lion Genome Project ➔ LionGenomeProj
- Use ISO 8601 formatting for dates (YYYYMMDD)
- 12-31-22 ➔ 2022-12-31 or 20221231
- Include leading 0s for sequences
- run1, run 10, run 100 ➔ run001, run 010, run100
- Choose components to include in the file name that will make each file distinct and be informative about its contents. Possible components include:
- Date of creation
- Initials/surname of person who created or transformed file data
- Project ID or abbreviation
- Sample ID
- Location data collected at
- Subject matter
- State of data (raw, transformed, final)
- Version numbers can be included as part of file names, but in many cases it is better to use version control systems such as OSF and GitHub
Some considerations for FILE ORGANIZATION:
- Like file names, folder names should indicate contents
- Avoid long folder names or complex hierarchies
- Descriptive file names can reduce the need for long names or complex hierarchies
- Divide subfolders using common themes (e.g., types of output ➔ data, code, docs)
- Document file directory structure
- Can include as part of README file for project
View a workshop recording on "File Naming, Folder Organization, and FAIR Data" at this link
Schedule a Consultation with Dani Kirsch, Research Data Specialist.
Content adapted from guides on naming and organizing files and data from Brown University Library, Carnegie Mellon University Libraries, Oregon Health & Science University Library, University of Puerto Rico - Mayaguez Campus, Oregon State University Libraries, and Princeton University Library.
I am a work in progress. Please check back later for updates and thank you for your patience!
Electronic lab notebooks (ELNs), sometimes called electronic research notebooks, perform the same functions as hard-copy lab notebooks and are an important element of laboratory data management.
Click on the following topics to skip to that section of the guide:
- Elements of laboratory record keeping
- Benefits (and drawbacks) of ELNs over traditional hard-copy lab notebooks
- Necessary considerations when choosing ELNs
- ELN Assessments & Resources
Important functions of laboratory notebooks (both hard-copy and electronic):
- Record daily research workflows
- Document details of experiments which are later used for
- Publication and reproducibility
- Preparation of reports and presentations
- Allow project transfer between students or researchers
- Allow supervisor review and avoid research misconduct
- Defend patents
- Meet contractual obligations of funders
- Validate research
- Facilitate good research products through clear communication
Guidelines for keeping laboratory notebooks:
- Permanently bound book with pre-numbered pages
- Make entries legibly in permanent black ink with no erasures
- Record entries chronologically
- Date each page and include descriptive information about the experiment or investigation being conducted
- All entries should be in English
- Printouts, graphs and tables should be printed and secured with permanent glue, signed and dated.
- Explain acronyms and include units for all data entries
- Enter observations immediately
- Entries should include all relevant details that would allow someone else to repeat the experiment including:
- Instrument type, manufacturer, serial number, software version, calibration information
- Reagents and specimens including name, manufacturer, registry (CAS) number, lot number
- Diluted reagents should include any appropriate info about stock preparation, method of dilution and storage
- Methods or protocols should be detailed in full
- Summarize findings on a regular basis
Some of the major benefits provided by ELNs are:
- Often a cloud-based system, which facilitates safe storage and preservation of information
- Remote access to lab notebook contents
- Maintain legibility because entries are not hand-written
- Enable sharing among collaborators
- Allow for inclusion of images from instruments and programs
- Automatically include timestamps
Some of the major drawbacks of ELNs are:
- Cost
- Large storage or user requirements may exceed limits of free or low-cost version
- Dependent upon network connection
- Time and resource commitment to finding and implementing an appropriate ELN for your lab
Considerations when Choosing ELNs
General Considerations:
- Cost - ELNs range from free and open source to commercial implementations that can be charged on a per month per user basis.
- Most paid versions charge academic users ~$10-20 per user per month.
- Functionality - not surprisingly, free options provide fewer functions and may restrict the number of users. More expensive commercial platforms will generally provide a higher degree of customization and more security features.
- Other considerations include
- Version control
- Whether pages can be locked and electronically signed
- What kind of data storage is needed to meet government or funder requirements
- Other considerations include
- Access and collaboration vs security - most ELN platforms are cloud-based, providing researchers flexibility and mobility for entering data outside of the lab.
- Collaboration and sharing of data can be done with pdf's or reports created from within the ELN or as a feature of the ELN
- A hierarchy of roles such as contributor, reviewer, editor determine who can see and approve project records
- After the work has been reviewed, the record is locked to prevent unauthorized changes
Special Considerations:
Some types of research may dictate that an ELN have specific features.
- Human subject research may require data security measures that are HIPAA compliant
- Similar data privacy measures may be necessary to meet the requirements of The Family Educational Rights and Privacy Act (FERPA) that governs educational information
- Some other features that might be necessary include the ability to meet audit, reporting and electronic signature guidelines. See Part 11 of the US Food and Drug Administration guidance document as an example (previously 21 CFR Part 11; Electronic Records; Electronic Signatures, Electronic Copies of Electronic Records)
The following are some assessments and comparisons of current top ELNs:
- ELN Vendor Wiki with links to pricing and demonstration videos
- Splice 2023 Best Electronic Lab Notebook Review
- The Electronic Lab Notebook in 2023: A comprehensive guide
Below are some commonly used ELNs:
- Microsoft OneNote
- Open Science Framework (OSF)
- Google Forms - some labs may find this a useful way of recording details from a particular trial run or experiment (e.g., what was the temperature? did I calibrate the instrument? who ran the test?)
- LabArchives - available for researchers at OSU Center for Health Sciences & OSU College of Osteopathic Medicine at the Cherokee Nation
Content adapted from the following guides: Cold Spring Harbor Laboratory - ELNs,
I am a work in progress. Please check back later for updates and thank you for your patience!
Data Documentation comes in many forms and serves a dual purpose: 1) it enables the researcher(s) to have a record of important details from their project, and 2) it provides future users of the data with enough content and context to accurately utilize and interpret the data.
Click on the following topics to skip to that section of the guide:
- Codebooks
- Data Dictionaries
- Data Validation & Quality Control
- Documentation of Data Cleaning and Transformation
Codebooks provide comprehensive information for all the variables in a data file and are most commonly found accompanying survey data. ICPSR has an excellent introductory guide to codebooks. The general components for a particular variable in a codebook are:
- Variable name
- Variable label
- Question text
- Values
- Value labels
- Summary statistics - frequency counts, minimum-median-maximum values
- Missing data
- Universe skip patterns
- Notes
In programs such as Stata, there are ways to export a codebook from your analysis (see this YouTube video for how to run the Stata codebook command).
Data dictionaries provide metadata for databases, systems, and datasets. They enable researchers to maintain consistency throughout a project and contribute to the use of Data Standards. USGS has an excellent introductory guide to data dictionaries and USDA has a blank template for creating a data dictionary from scratch. The general components of a data dictionary are:
- List of data objects - names & definitions
- Detailed properties of data elements - data type, size, nullability, etc.
- Reference data
- Missing data & quality indicator codes
- Business rules - validation of schema/data quality, etc.
For researchers creating data dictionaries, it is important to identify any existing data standards in the research field. Resources such as the Environmental Information Exchange Network maintain lists of data standards relevant for particular disciplines.
Common Data Elements (CDEs) are a type of data standard used in the health sciences and NIH maintains a repository of CDEs.
Data Validation & Quality Control
It is essential for researchers to incorporate data validation and quality control procedures into their research workflow. Adopting best practices for data management - including organization and documentation - is a good start, and integrating the following methods will further enhance data quality:
- Document any data inconsistencies
- Check datasets for duplicates and errors - may require data review by a peer
- Use data validation tools
- Routinely inspect small subsets of data - most useful for those working with large volumes of data
There are a variety of tools available to help with data validation and quality control.
- Excel (and open source equivalents such as LibreOffice's Calc) allow users to create rules to limit data types or values that can be inputted in certain cells. See this brief overview of Excel Data Validation or Data Validity in Calc.
- This Data Carpentry lesson on Quality Control also demonstrates some useful tools for performing quality control on your data
- Google Forms enables users to set rules for specific questions in a form so that the data entered meet necessary formatting and content criteria.
Data Cleaning & Transformation - Documentation
Data cleaning is focused primarily on ensuring that the final dataset includes the appropriate data. In other words, it accomplishes tasks such as eliminating duplicate entries, fixing structural errors, removing outliers, and addressing missing data. Documenting this process enables researchers to provide details and justification for edits they made in producing the final dataset used in their analyses.
Data transformation is less about "fixing" the dataset and more about converting data into other formats or arrangements to meet the needs of analysis and/or storage. Documenting this process enables researchers to have a clear set of steps they followed while modifying their data, which allows them and other future users to smoothly replicate the data transformation(s) that produced the final dataset and results.
Schedule a Consultation with Dani Kirsch, Research Data Specialist.
Content adapted from the following sources: ICPSR - What is a Codebook?, USGS - Data Dictionaries, UC Merced Library - What Is a Data Dictionary?, EPA - Learn About Data Standards, NIH - Common Data Elements (CDEs), Yale Library - Validate Data, Tableau - Guide to Data Cleaning: Definition, Benefits, Components, and How to Clean Your Data