##viz Skip to Main Content

Managing Research Data

Managing Research Data: Documentation

Introduction

Ultimately, the purpose of good research documentation is to have a clear record of actions, decisions, problems, and other important details that occurred throughout the project. The Research Notebook guide covers generic research note-taking in more detail. This page focuses on more formalized types of research documentation such as protocols, computational notebooks, data dictionaries, and codebooks.

Protocols

Importance of Protocols

Creating and sharing protocols can help provide clarity and consistency for all part of a research project, including data collection, data processing, and data analysis. Following protocols helps ensure that important information is recorded (e.g., environmental conditions, date, time) and that all necessary data is collected using the correct methods, equipment, and measurement units.

Protocols are frequently shared with collaborators so that all project members can make sure they are being consistent in how they conduct various aspects of the research. Protocols can also be published and shared more broadly so that other researchers can reuse and/or modify existing protocols for their own work. 

protocols.io

Protocols.io is a platform that allows researchers to author and publish research methods and protocols, which subsequently receive a DOI. This makes protocols incredibly simple to reference in research publications, and other researchers can re-use and/or modify existing protocols to fit the needs of their research. 

Researchers can incorporate a wide variety of components in their protocols, including

  • Measurements (e.g., temperature, concentration, pH, pressure, thickness, amount)
  • Duration
    • Can be used as a timer when running the protocol
  • Well plate maps
  • Equipment, software, and reagents
    • Can add custom items if they don't exist in protocols.io already
  • Notes, expected results, and safety information
  • Smart components
    • Enable real-time input of information when running the protocol
    • Input is recorded and exportable as a CSV

Published protocols can be easily reused and cited using the DOI. Researchers who want to create modified versions of protocols can fork a protocol, make their own modifications, and publish that version (which remains linked to the original protocol so that credit can be properly attributed). Authors can also create new versions of their own protocols. Old protocols will remain accessible, but are marked as outdated.

Code Documentation

Commenting Code

Some researchers may prefer a simpler approach of incorporating comments (non-executed portions of code) in their code scripts that describe the steps they are taking without the hassle of switching between markdown and code. In programs such as R and Python, the pound sign or hashtag # is used to denote comments. When executing code, the program will ignore any and all content that exists to the right of a # in a line of code. 

For example, a researcher could provide brief notes in their code script to explain what a particular function does. They and any other future users who use the code can then refer to comments to learn more about the code itself.

# create a new data frame that only contains females
rodent_female <- rodent_data %>% filter(sex == "female")

 

Computational Notebooks

Computational notebooks - such as Jupyter Notebooks and R Markdown files - provide researchers with a way of synthesizing narrative text describing data processing, analysis, and visualization with code chunks to actually execute each of those steps. This documentation approach is a lot more flexible, as users can include additional elements alongside their code such as hyperlinks, images, tables, and equations.

For example, the image below demonstrates the appearance of a typical R Markdown file. Narrative text is formatted to include instructions on making a scatter plot, then the code itself is provided, followed by the graph it produces.

R Markdown file demonstrating a mix of narrative text, images, and code in this style of computational notebook.

For more information about using computational notebooks, visit:

Change Logs and Version Control

Change Logs

Change logs are one approach to recording (and potentially sharing) changes made to a project. For example, the "View History" tab on a Wikipedia page provides a list of revisions made to that page, including the user and timestamp. Other platforms such as OpenRefine also automatically track and record changes made, enabling users to reconstruct their work and even revert back to an earlier state of the dataset.

Although these kinds of change logs are generated automatically, you can also create manual change logs for portions of your research project. For example, when you start cleaning your data, you can create a new file for recording notes about data cleaning and save that file in the same folder as your data file(s). Some information that may be helpful to record while data cleaning are:

  • Current date
  • Researcher name (if multiple researchers are cleaning the data)
  • Data file being cleaned (if multiple different data files exist)
  • Errors and/or inconsistencies identified in the data
  • Corrections made to specified errors and/or inconsistencies
  • Decisions made to transform variables, add new variables, and/or remove data

Ultimately, a change log - whether generated manually or automatically - should enable you to reconstruct the actions you took when working on a particular file. Having this information clearly documented will help with preparation of presentations, articles, and other outputs because you will be able to easily references past decisions rather than relying on your own memory.

Version Control

Like change logs, version control can be either automated or manual. Manual version control is often done by changing the names of files to reflect a new version through the use of numbers (e.g., thesis_draft_v04.docx) or dates (e.g., dissertation_data_20251107.csv). Researchers may also maintain a record of major changes in a separate file (similar to the manual change log approach described above) to report important changes that were made between one version of a file and the next.

However, in some instances researchers may prefer automatic version control, as it does not rely on the user to remember to record changes and can often provide greater detail about the changes. The "Track Changes" option in Microsoft Word is one such example and it records the types of changes made, the individuals (accounts) that made those changes, and enables users to accept or reject changes as desired. Although this approach provides an automated record of changes made to a file, it can be difficult to restore previous versions of the file or clearly articulate what changes were made and why.

Git and GitHub

Version control systems provide additional functionality beyond just recording the who, what, and when of changes made. Individuals on a collaborative project can independently make changes to the same file, and those changes can be compared and merged back to the main copy of the file. As changes are made, users are prompted to provide brief descriptions of the changes they have made, which are recorded along with the rest of the version history (e.g., user, timestamp). 

Note: Git and GitHub work best at tracking plain text files such as TXT, JSON, and XML and coding scripts such as for R and Python. They can still track other types of files, but may not be able to provide the same level of detail regarding character-by-character changes to file contents.

 

For more information about version control with Git and GitHub, visit: