Recommendations for Virus Genomics Data¶
Repositories¶
We suggest that raw virus sequence data as well as assembled and annotated genomes are submitted to ENA.
- There are several ways to submit data to ENA, including SARS-CoV-2 submission and extensive documentation on programmatic submissions.
- Before submission of raw sequence data (e.g. shotgun sequencing) it is necessary to remove contaminating human reads. This can be done using e.g. Metagen-FastQC, or ask for assistance at virus-dataflow@ebi.ac.uk.
Data and metadata standards¶
A list of relevant data and metadata standards can be found in FAIRsharing, some specific examples are below.
Data standards¶
We suggest that data is preferentially stored in the following formats, in order to maximize the interoperability with each other and with standard analysis pipelines:
- Raw sequences: .fastq, optionally add compression with gzip
- Genome contigs: .fastq if uncertainties of the assembler can be captured, otherwise use .fasta; optionally add compression with gzip
- De novo aligned sequences: .afa
- Gene Structure: .gtf
- Gene Features: .gff
- Sequences mapped to a genome: .sam or the compressed formats .bam or .cram. Please ensure that the used reference sequence is also publically available and that the @SQ header is present and unambiguously describes the used reference sequence.
- Variant calling: .vcf. Please ensure that the used reference sequence is also publically available and that it is unambiguously referenced in the header of the .vcf file, e.g. using the URL field of the ##contig field.
- Browser: .bed
Metadata standards¶
Consider annotating virus genomes using the ENA virus pathogen reporting standard checklist, which is a minimal information standard under development right now and the more general Viral Genome Annotation System (VGAS) (Zhang et al. 2019).
Phylogenetic analysis¶
For submitting data and metadata relating to phylogenetic relationships (including topology, branch lengths, and support values) consider using widely accepted formats such as Newick, NEXUS and PhyloXML. The Minimum Information About a Phylogenetic Analysis checklist provides a reference list of useful tree annotations.