Data Storage, Access, and Preservation: Metadata and Data Documentation

Metadata and Data Documentation

Documenting your research project information and its data in a standardized form is an effective way to facilitate it will be found, understood, used, and cited by others in data centers and repositories. Creating information that follows a data standard is known as metadata. In other words, metadata is data about "something" and data documentation makes the research "be all that it can be." A basic guide, Understanding Metadata, published by the National Information Standards Organization (NISO) is a good source for learning about metadata in more detail.

To provide structure and help format the information that is being described, a set of metadata standards or schemas are utilized. The Dublin Core and DataCite are two general metadata schemas and considered cross-disciplinary for starting with using a standard. The Data Curation Centre (DDC) has discipline specific metadata standards for Biology, Earth Sciences, Physical Sciences, Social Sciences & Humanities, and General Research Data. The Data Documentation Initiative (DDI) is a standard for describing file formats for Statistical and Social Science data. In preparation of documenting the data consult with the repositories before starting to determined the metadata schema that is followed.

Successful metadata creations starts with a having a Data Management Plan (DMP). It is important to have the plan in place at the beginning of the project to ensure proper management of the data. Having the plan will help provide content needed for the metadata that will be used.

As part of the plan record the project data in a spreadsheet, CSV file, or Tab-delimited file to describe the: Data (e.g. type of data, how much data); Standards (e.g. file formats, identifiers, and software); and Access (e.g. creators, storage and backup, sharing requirements, password protected, and intellectual property rights).

Consider two levels of metadata to document: Project Level and Data Level.

The Project-level metadata describes the “who, what, where, when, how and why” of the dataset, which provides context for understanding why the data was collected and how the data is used. Examples of Project-level metadata:

Name of the project
Dataset title
Project description
Dataset abstract
Principal investigator and collaborators
Contact information
Dataset handle (DOI or URL)
Dataset citation
Data publication date
Geographic description
Time period of data collection
Subject/keywords
Project sponsor
Dataset usage rights

Best practices: Names of people and/or organizations who created the data. Include keys dates associated with the data including project start date, data modification, data release, and time period covered by the data. For subject access, assign terms from disciplinary vocabularies along with keywords.

Provide citations for material data derived from other sources, including details of where the source data is held and how it was accessed. Include any known intellectual property rights held for the data and access information such as where and how the data will be accessed by other researchers. Name(s) of the organizations or agencies who funded the research should be noted. Include spatial coverage where the data relates to a physical location. Provide the number used to identify the data and researchers such a Digital Object Identifier (DOI), Project number, and Open Researcher & Contributor ID (ORCID).

The Dataset-level metadata describes in detail the data and dataset. Examples of Data-level metadata:

Data origin: experimental, observational, raw or derived, physical collections, models, images, etc.
Data type: integer, Boolean, character, floating point, etc.
Instrument(s) used
Data acquisition details: sensor deployment methods, experimental design, sensor calibration methods, etc.
File type: CSV, mat, xlsx, tiff, HDF, NetCDF, etc.
Data processing methods, software used
Data processing scripts or codes
Dataset parameter list, including:
- Variables
- Description of each variable
- Units

Best Practices: Include how the data was generated, including equipment or software used, experimental protocol, and other materials. List all data files associated with the project, with their names and file extensions. List all formats of the data and any software required to read the data. Record any information on how and when the data has been altered or processed. Provide a glossary or code list to the explanation of codes, terms, abbreviations or variables used in the data or in the file naming structure.

Sources:
The following sites and materials were consulted in the development of this web page:

University of Oregon Library Research Data Management --Brian Westra

CalTech Library Research Data Management --Gail Clement