Recommendations for Structural data¶
Repositories¶
Several different types of structural data are being collected for Covid-19 research. Some suitable repositories for these are:
- Structural data on proteins acquired using using any experimental technique (x-ray crystallography, nuclear magnetic resonance) should be deposited in the wwPDB: Worldwide Protein Data Bank via EBI PDBe.
Locating existing data¶
The COVID-19 Molecular Structure and Therapeutics Hub community data repository and curation service for structure, models, therapeutics, simulations and related computations for research into the COVID-19 pandemic is maintained by The Molecular Sciences Software Institute (MolSSI) and BioExcel.
Data and metadata standards¶
X-ray diffraction
- There are no widely accepted standards for X-ray raw data files. Generally these are stored and archived in the Vendor’s native formats. Metadata is stored in CBF/imgCIF format (see also Catalogue of Metadata Mesources for Crystallographic Applications).
- Processed structural information is submitted to structural databases in the PDBx/mmCIF format.
Electron microscopy
- Data archiving and validation standards for cryo-EM maps and models are coordinated internationally by EMDataResource (EMDR).
- Cryo-EM structures (map, experimental metadata, and optionally coordinate model) are deposited and processed through the wwPDB OneDep system, following the same annotation and validation workflow also used for X-ray crystallography and nuclear magnetic resonance (NMR) structures. EMDB holds all workflow metadata while PDB holds a subset of the metadata.
- Most electron microscopy data is stored in either raw data formats (binary, bitmap images, tiff, etc.) or proprietary formats developed by vendors (dm3, emispec, etc.).
- Processed structural information is submitted to structural resources as PDBx/mmCIF.
- Experimental metadata are described in EMDR, see also Lawson et al 2020
NMR
- There are no widely accepted standards for NMR raw data files. Generally these are stored and archived in single FID/SER files.
- One effort for the standardization of NMR parameters extracted from 1D and 2D spectra of organic compounds to the proposed chemical structure is the NMReDATA initiative format.
- There is no universally accepted format, especially for crucial FID-associated metadata. NMR-STAR and its NMR-STAR Dictionary is the archival format used by the Biological Nuclear Magnetic Resonance data Bank (BMRB), the international repository of biomolecular NMR data and an archive of the Worldwide Protein Data Bank (wwPDB).
- The nmrML format specification (XML Schema Definition (XSD) and an accompanying controlled vocabulary called nmrCV) is an open mark up language and ontology for NMR data (PhenoMeNal H2020 project, 2019).
- Processed structural information is submitted in the PDBx/mmCIF format.
Neutron scattering
- ENDF/B-VI of Cross-Section Evaluation Working Group (CSEWG) and JEFF of OECD/NEA have been widely utilized in the nuclear community. The latest versions of the two nuclear reaction data libraries are JEFF-3.3 and ENDF/B-VIII.0 (Brown et al., 2018) with a significant upgrade in data for a number of nuclides (Carlson et al., 2018).
- Neutron scattering data are stored in the internationally-adopted ENDF-6 format maintained by CSEWG.
- Processed structural information is submitted in the PDBx/mmCIF format.
Molecular Dynamics (MD) simulations
- Raw trajectory files containing all the coordinates, velocities, forces and energies of the simulation are stored as binary files: .trr, .dcd, .xtc and .netCDF
- Refined structural models from experimental structural data using MD simulations are stored in .pdb format
Computer-aided drug design data
- Virtual screening results are stored in 3D chemical data formats such as .pdb
- Structural formulas either in SMILES or IUPAC International Chemical Identifier (InChi), and identified through InChIKey, a non-proprietary identifier for chemical substances (Heller et al., 2015).