Recommendations for Host Genomics Data

Host genomics data is often coupled to human subjects. This comes with many ethical and legal obligations, such as apply for an ethics approval and report the data processing to your Data protection officer. See the page on Sensitive personal data for more information.

General Recommendations

  • Data sharing of not only summary statistics (or significant data) but also raw data (individual-level data) will foster a build-up of larger datasets. This will eventually allow identifying the determinants of phenotype more accurately.
  • Especially for raw sequencing data make sure to include Quality Control (QC) results and details of the sequencing platform used.
  • Common terminologies for reporting statistical tests (e.g with StatO) enable reuse and reproducibility.
  • Researchers interested in human leukocyte antigen (HLA) genomics are referred to the HLA COVID-19 consortium

Repositories

Several different types of host genomics data are being collected for COVID-19 research. Some suitable repositories for these are:

  • Gene expression: A curated list can be found in FAIRsharing some specific examples are listed below.

    • Transcriptomics of human subjects (i.e. requiring authorized access): NBIS is building a local federated version of the European Genome-phenome Archive (EGA) in Sweden (EGA-SE), allowing for the publication of sensitive personal data within a legal framework. Until local EGA is available, the dataset should remain in the secure analysis environment (eg at Bianca on Uppmax). We suggest to make a metadata-only record in the SciLifeLab Data Repository with contact details on how to get access, and for which a DOI (ie a persistent identifier) can be issued. The DOI can then be used in the article to refer to the dataset. Once the Swedish EGA is operational, and the dataset deposited there, the access information can be changed to point at the EGA ID. See https://doi.org/10.17044/scilifelab.12292778, for an example.
    • Transcriptomics (cell lines/animals): ArrayExpress. The submission portal gives information on what can be submitted and how. Underlying reads will automatically be deposited to European Nucleotide Archive ENA.
    • Microarray-based gene expression data: ArrayExpress. The submission portal gives information on what can be submitted and how. Data on the originating sample will automatically be deposited to BioSamples.
  • Genome-wide association studies (GWAS): GWAS Catalog; GWAS Central; for human data requiring restricted access please see section on Transcriptomics of human subjects above.

  • Adaptive Immune Receptor Repertoire sequencing (AIRR-seq): Samples the diversity of the immunoglobulins/antibodies and T cell receptors present in a host. The respective gene loci undergo random and irreversible rearrangement during lymphocyte development, therefore this data is fundamentally distinct from conventional genome sequencing.

    It is recommended that data be deposited using AIRR Community compliant processes and standards, in either of the following repositories:

Data and metadata standards

Gene expression

Transcriptomics:

Microarray-based gene expression data:

Genome-wide association studies (GWAS):

  • Preferred minimal metadata standard: MIxS
  • Preferred file formats

Adaptive Immune Receptor Repertoire sequencing (AIRR-seq).