Recommendations for Virus Genomics Data

Repositories

We suggest that raw virus sequence data as well as assembled and annotated genomes are submitted to ENA.

Data and metadata standards

A list of relevant data and metadata standards can be found in FAIRsharing, some specific examples are below.

Data standards

We suggest that data is preferentially stored in the following formats, in order to maximize the interoperability with each other and with standard analysis pipelines:

  • Raw sequences: .fastq, optionally add compression with gzip
  • Genome contigs: .fastq if uncertainties of the assembler can be captured, otherwise use .fasta; optionally add compression with gzip
  • De novo aligned sequences: .afa
  • Gene Structure: .gtf
  • Gene Features: .gff
  • Sequences mapped to a genome: .sam or the compressed formats .bam or .cram. Please ensure that the used reference sequence is also publically available and that the @SQ header is present and unambiguously describes the used reference sequence.
  • Variant calling: .vcf. Please ensure that the used reference sequence is also publically available and that it is unambiguously referenced in the header of the .vcf file, e.g. using the URL field of the ##contig field.
  • Browser: .bed

Metadata standards

Consider annotating virus genomes using the ENA virus pathogen reporting standard checklist, which is a minimal information standard under development right now and the more general Viral Genome Annotation System (VGAS) (Zhang et al. 2019).

Phylogenetic analysis

For submitting data and metadata relating to phylogenetic relationships (including topology, branch lengths, and support values) consider using widely accepted formats such as Newick, NEXUS and PhyloXML. The Minimum Information About a Phylogenetic Analysis checklist provides a reference list of useful tree annotations.