Recommendations for Host Genomics Data¶
Host genomics data is often coupled to human subjects. This comes with many ethical and legal obligations, such as apply for an ethics approval and report the data processing to your Data protection officer. See the page on Sensitive personal data for more information.
General Recommendations¶
- Data sharing of not only summary statistics (or significant data) but also raw data (individual-level data) will foster a build-up of larger datasets. This will eventually allow identifying the determinants of phenotype more accurately.
- Especially for raw sequencing data make sure to include Quality Control (QC) results and details of the sequencing platform used.
- Common terminologies for reporting statistical tests (e.g with StatO) enable reuse and reproducibility.
- Researchers interested in human leukocyte antigen (HLA) genomics are referred to the HLA COVID-19 consortium
Repositories¶
Several different types of host genomics data are being collected for COVID-19 research. Some suitable repositories for these are:
Gene expression: A curated list can be found in FAIRsharing some specific examples are listed below.
- Transcriptomics of human subjects (i.e. requiring authorized access): NBIS is building a local federated version of the European Genome-phenome Archive (EGA) in Sweden (EGA-SE), allowing for the publication of sensitive personal data within a legal framework. Until local EGA is available, the dataset should remain in the secure analysis environment (eg at Bianca on Uppmax). We suggest to make a metadata-only record in the SciLifeLab Data Repository with contact details on how to get access, and for which a DOI (ie a persistent identifier) can be issued. The DOI can then be used in the article to refer to the dataset. Once the Swedish EGA is operational, and the dataset deposited there, the access information can be changed to point at the EGA ID. See https://doi.org/10.17044/scilifelab.12292778, for an example.
- Transcriptomics (cell lines/animals): ArrayExpress. The submission portal gives information on what can be submitted and how. Underlying reads will automatically be deposited to European Nucleotide Archive ENA.
- Microarray-based gene expression data: ArrayExpress. The submission portal gives information on what can be submitted and how. Data on the originating sample will automatically be deposited to BioSamples.
Genome-wide association studies (GWAS): GWAS Catalog; GWAS Central; for human data requiring restricted access please see section on Transcriptomics of human subjects above.
Adaptive Immune Receptor Repertoire sequencing (AIRR-seq): Samples the diversity of the immunoglobulins/antibodies and T cell receptors present in a host. The respective gene loci undergo random and irreversible rearrangement during lymphocyte development, therefore this data is fundamentally distinct from conventional genome sequencing.
It is recommended that data be deposited using AIRR Community compliant processes and standards, in either of the following repositories:
- AIRR-seq specific repositories that are part of the AIRR Data Commons, for example the iReceptor Public Archive or VDJServer
- INSDC repositories via NCBI SRA/Genbank, following the AIRR Community recommended NCBI submission processes
Data and metadata standards¶
Gene expression¶
Transcriptomics:¶
Microarray-based gene expression data:¶
- Preferred minimal metadata standard: MIAME
- Preferred file formats: tab-delimited text eg MAGE-TAB and ISA-TAB, and raw data file formats from commercial microarray platforms (Annotare accepted formats)
Genome-wide association studies (GWAS):¶
Adaptive Immune Receptor Repertoire sequencing (AIRR-seq).¶
- Preferred minimal metadata standards: MiAIRR
- Preferred file formats: AIRR repertoire metadata (formatted as .json or .yaml), AIRR rearrangements (formatted as .tsv)