Metadata

Good documentation in research projects, describing how the datasets were created, how they are structured, and what they mean, is essential for making your data understandable. Metadata provides such ‘data about data’ , and may include information on the methodology used to collect the data, analytical and procedural information, definitions of variables, units of measurement, any assumptions made, the format and file type of the data and software used to collect and/or process the data.

Researchers are strongly encouraged to use community metadata standards where these are in place (see further down). Data repositories may also provide guidance about appropriate metadata standards and requirements e.g. ENA sample checklists. It is highly recommended to, already from the beginning of the project, structure e.g. sample metadata in a way that enables sequence data submission without having to reformat the metadata.

Ontologies

Ontologies, controlled vocabularies and data dictionaries are used to standardize the language used to describe the metadata. Think of the many ways to write that the organism is human (human, Human, homo sapiens, H. sapiens, Homo Sapiens, man, etc), using an ontology such as NCBI taxonomy unifies the language and makes it easier for both humans and machines to interpret and work with the data. While an ontology has a hierarchical structure, a controlled vocabulary is an unstructured set of terms. A Data Dictionary is a user-defined way of describing what all the variable names and values in your data really mean.

For a suggested list of ontologies appropriate for Life Science community please see FAIRsharing.org, filter on e.g. Domain.

Below are ontology resources, adapted from Table 2 in Griffin PC, Khadake J, LeMay KS et al. Best practice data life cycle approaches for the life sciences. F1000Research 2018, 6:1618. doi: 10.12688/f1000research.12344.2

  • Ontology Lookup Service - Discover different ontologies and their contents.
  • OBO Foundry - Table of open biomedical ontologies with information on development status, license and content.
  • Zooma - Assign ontology terms using curated mapping.
  • Webulous - Create new ontology terms easily.
  • Ontobee - A linked data server that facilitates ontology data sharing, visualization, and use.

Data and metadata standards Genomics data

A list of relevant data and metadata standards can be found in FAIRsharing, some specific examples are below.

Gene expression

Transcriptomics:

  • Preferred minimal metadata standard: MINSEQE
  • Preferred file formats (sequencing-based):
    • Raw sequences: .fastq (compression can be added with gzip)
    • Mapped sequences: .sam (compression with .bam or .cram). Please ensure that the used reference sequence is also publically available and that the @SQ header is present and unambiguously describes the used reference sequence.
    • Transcripts per million (TPM): .csv
  • Gene Structure: .gtf
  • Gene Features: .gff
  • Variant calling: .vcf. Please ensure that the used reference sequence is also publically available and that it is unambiguously referenced in the header of the .vcf file, e.g. using the URL field of the ##contig field.
  • Also see FAIRsharing using the query ‘transcriptomics’

Microarray-based gene expression data:

Genome-wide association studies (GWAS):

  • Preferred minimal metadata standard: MIxS
  • Preferred file formats

Metagenomics

  • MIxS - MIGS/MIMS Minimum Information about a (Meta)Genome Sequence. The MIMS extension includes key environmental metadata. Developed by the Genomic Standards Consortium. Numerous adopters including NCBI/EBI/DDBJ databases.
  • MIMARKS Minimum Information about a MARKer gene Sequence. This is an extension of MIGS/MIMS for environmental sequences. Developed by the Genomic Standards Consortium. Numerous adopters including NCBI/EBI/DDBJ databases.

Functional Annotation of Animal Genomes Consortium (FAANG) standards

Data and metadata standards Proteomics

For a curated list of relevant standards see FAIRsharing using the query ’proteomics

  • Use the minimal information model specified in MIAPE by the HUPO Proteomics Standards Initiative (HUPO PSI), and fill the model using the controlled vocabularies specified by the Proteomics Standards Initiative: PSI CVs
  • Recommended formats:
    • For gel electrophoresis gelML)
    • For transition lists TraML
    • For raw spectrometer output mzML
    • For reporting mzTab
    • For protein quantisation data mzQuantML
    • For protein identification data mzIdentML
    • For metadata ISA-TAB with conversion to PRIDE format

Data and metadata standards Metabolomics:

For a curated list of relevant standards see FAIRsharing using the query ‘metabolomics’.

  • Core Information for Metabolomics Reporting CIMR standard
  • For identifying chemical compounds use SMILES or InChl
  • To document Investigation/Study/Assay data, use the ISA Abstract Model, also implemented as a tabular format, ISA-Tab in MetaboLights. For an introduction to ISA, see (Sansone S-A et al., 2012)
  • Recommended formats for LC-MS data: ANDI-MS specification, an analytical data interchange protocol for chromatographic data representation and/or mzML
  • Recommended formats for NMR data: nmrCV, nmrML

Data and metadata standards Lipidomics:

  • Metadata should follow recommendations from the CIMR standard by the Metabolomics Standards Initiative. It should be made available as tab or comma separated files (.tsv or .csv).
  • Data can be stored in LC-MS file, in tab (.tsv) or comma (.csv) separated formats.

Data and metadata standards Structural data / Imaging

X-ray diffraction

Electron microscopy

  • Data archiving and validation standards for cryo-EM maps and models are coordinated internationally by EMDataResource (EMDR).
  • Cryo-EM structures (map, experimental metadata, and optionally coordinate model) are deposited and processed through the wwPDB OneDep system, following the same annotation and validation workflow also used for X-ray crystallography and nuclear magnetic resonance (NMR) structures. EMDB holds all workflow metadata while PDB holds a subset of the metadata.
  • Most electron microscopy data is stored in either raw data formats (binary, bitmap images, tiff, etc.) or proprietary formats developed by vendors (dm3, emispec, etc.).
  • Processed structural information is submitted to structural resources as PDBx/mmCIF.
  • Experimental metadata include information about the sample, specimen preparation, imaging, image processing, symmetry, reconstruction method, resolution and resolution method, as well as a description of the modeling/fitting procedures used and are described in EMDR, see also Lawson et al 2020.

NMR

  • There are no widely accepted standards for NMR raw data files. Generally these are stored and archived in single FID/SER files.
  • One effort for the standardization of NMR parameters extracted from 1D and 2D spectra of organic compounds to the proposed chemical structure is the NMReDATA format.
  • There is no universally accepted format, especially for crucial FID-associated metadata. NMR-STAR and its NMR-STAR Dictionary is the archival format used by the Biological Nuclear Magnetic Resonance data Bank (BMRB), the international repository of biomolecular NMR data and an archive of the Worldwide Protein Data Bank (wwPDB).
  • The nmrML format specification (XML Schema Definition (XSD) and an accompanying controlled vocabulary called nmrCV) are an open mark up language and ontology for NMR data.
  • Processed structural information is submitted in the PDBx/mmCIF format.

Neutron scattering

  • ENDF/B-VI of Cross-Section Evaluation Working Group (CSEWG) and JEFF of OECD/NEA have been widely utilized in the nuclear community. The latest versions of the two nuclear reaction data libraries are JEFF-3.3 and ENDF/B-VIII.0 (Brown et al., 2018) with a significant upgrade in data for a number of nuclides (Carlson et al., 2018).
  • Neutron scattering data are stored in the internationally-adopted ENDF-6 format maintained by CSEWG.
  • Processed structural information is submitted in the PDBx/mmCIF format.