Welcome to SciLifeLab Data Guidelines!

SciLifeLab is committed to the principles of FAIR (Findable, Accessible, Interoperable and Reusable) research data, i.e. that data should be easily accessed, understood, exchanged and reused. We work actively to ensure that the investments done by the society in research infrastructure resources can achieve the highest possible impact.

Research data management concerns the organization, storage, preservation, and sharing of data that is collected or analysed during a research project. Proper planning and management of research data will make project management easier and more efficient while projects are being performed. It also facilitates sharing and allows others to validate as well as reuse the data.

The purpose of these guidelines is to serve as an information resource to researchers regarding research data management. Click on any of the data types for guidance on good data management practices during the data life cycle, including available infrastructures for data generation and analysis and appropriate data repositories for sharing. There is also overarching guidance, applicable to all data types, on e.g. metadata standards and managing sensitive data under General information.

Data types: Generic guidance:

COVID-19

General information

Please see the Swedish COVID-19 Data Portal for the latest information regarding Swedish efforts in COVID-19 research, including data generating facilities. Also see the European COVID-19 Data Portal and Horizon 2020 guidelines regarding COVID-19 for useful information on European level.

Data Life Cycle

The data life cycle is typically divided into design, generation, analysis, storage & archiving, and sharing. Below you will find information about standards and infrastructure resources available during these phases.

../../_images/data_life_cycle_circle_logos.png

Data design

During this phase you plan for wich data is needed to answer your research question. High quality science is often only possible if the resource facilities you intend to use gets involved already in the planning phase of a project. Consultation and advice regarding data management planning, data generation and data analysis are offered by NBIS and SciLifeLab.

It is wise to write a data management plan, using either a tool provided by your university or DS wizard.

Also, some resources have specific application periods and thus needs to be contacted well in advance. If your project includes sensitive human data, note that there are ethical and legal issues that you have to consider, such as apply for an ethics approval and report the data processing to your Data Protection Officer. See the page on Sensitive data for more information.

Data generation

SciLifeLab National Genomics Infrastructure (NGI) provide a wide range of sequencing technologies and can offer state-of-the-art solutions for many different types of COVID-19 sequencing projects. Chemical proteomics & proteogenomics and BioMS offers mass spectrometry support. For a complete list please visit Swedish COVID-19 Data Portal.

Data analysis

  • NBIS (National Bioinformatics Infrastructure Sweden) national research infrastructure offers bioinformatic support in various forms for a wide range of areas including NGS, proteomics, metabolomics and biostatistics.
  • SNIC (Swedish National Infrastructure for Computing) national research infrastructure makes available large scale high performance computing resources. Apply for Small, Medium, Large or Sensitive data allocation, depending on size and type of project.

Data storage and archiving

After the project is finished, the data needs to be stored in a backed-up fashion at least for 10 years, and for as long as the data is of scientific value. After this time, some of the data should be archived and some can be disposed. It is best to contact your university Research Data Office for information about the procedures for this.

SNIC offers storage for small and medium-sized datasets. In the future also large-sized storage will be offered.

Data sharing

The guidelines in all subsections regarding COVID-19 has been adapted from the Research Data Alliance 5th release of the COVID-19 Data Sharing Recommendations & Guidelines.

In general:

  • Think early about systematic naming of filenames. Not thinking about it early enough is often the cause of a lot of extra work when the data is not stored in a database and researchers have to rename a large number of files manually at a later stage.
  • Document the computing time and resources required for data processing. This could help other researchers to assess the time and resources required for the pipeline, therefore to decide whether it is feasible to proceed with the local resources available.
  • When selecting a repository for submission of the data, priority should be given to domain-specific repositories over generic (e.g. institutional) repositories. Domain-specific repositories are easier to find, and often have better visualization and selection facilities for re-users of the data.
  • The repositories listed for deposition are also prime locations for locating existing data. Many now have dedicated sections for new as well as pre-existing data relevant to Covid19 research.

The following subsections contain guidelines adressing specific covid-19 data types and resources:

Recommendations for Virus Genomics Data
Repositories

We suggest that raw virus sequence data as well as assembled and annotated genomes are submitted to ENA.

Data and metadata standards

A list of relevant data and metadata standards can be found in FAIRsharing, some specific examples are below.

Data standards

We suggest that data is preferentially stored in the following formats, in order to maximize the interoperability with each other and with standard analysis pipelines:

  • Raw sequences: .fastq, optionally add compression with gzip
  • Genome contigs: .fastq if uncertainties of the assembler can be captured, otherwise use .fasta; optionally add compression with gzip
  • De novo aligned sequences: .afa
  • Gene Structure: .gtf
  • Gene Features: .gff
  • Sequences mapped to a genome: .sam or the compressed formats .bam or .cram. Please ensure that the used reference sequence is also publically available and that the @SQ header is present and unambiguously describes the used reference sequence.
  • Variant calling: .vcf. Please ensure that the used reference sequence is also publically available and that it is unambiguously referenced in the header of the .vcf file, e.g. using the URL field of the ##contig field.
  • Browser: .bed
Metadata standards

Consider annotating virus genomes using the ENA virus pathogen reporting standard checklist, which is a minimal information standard under development right now and the more general Viral Genome Annotation System (VGAS) (Zhang et al. 2019).

Phylogenetic analysis

For submitting data and metadata relating to phylogenetic relationships (including topology, branch lengths, and support values) consider using widely accepted formats such as Newick, NEXUS and PhyloXML. The Minimum Information About a Phylogenetic Analysis checklist provides a reference list of useful tree annotations.

Recommendations for Host Genomics Data

Host genomics data is often coupled to human subjects. This comes with many ethical and legal obligations, such as apply for an ethics approval and report the data processing to your Data protection officer. See the page on Sensitive personal data for more information.

General Recommendations
  • Data sharing of not only summary statistics (or significant data) but also raw data (individual-level data) will foster a build-up of larger datasets. This will eventually allow identifying the determinants of phenotype more accurately.
  • Especially for raw sequencing data make sure to include Quality Control (QC) results and details of the sequencing platform used.
  • Common terminologies for reporting statistical tests (e.g with StatO) enable reuse and reproducibility.
  • Researchers interested in human leukocyte antigen (HLA) genomics are referred to the HLA COVID-19 consortium
Repositories

Several different types of host genomics data are being collected for COVID-19 research. Some suitable repositories for these are:

  • Gene expression: A curated list can be found in FAIRsharing some specific examples are listed below.

    • Transcriptomics of human subjects (i.e. requiring authorized access): NBIS is building a local federated version of the European Genome-phenome Archive (EGA) in Sweden (EGA-SE), allowing for the publication of sensitive personal data within a legal framework. Until local EGA is available, the dataset should remain in the secure analysis environment (eg at Bianca on Uppmax). We suggest to make a metadata-only record in the SciLifeLab Data Repository with contact details on how to get access, and for which a DOI (ie a persistent identifier) can be issued. The DOI can then be used in the article to refer to the dataset. Once the Swedish EGA is operational, and the dataset deposited there, the access information can be changed to point at the EGA ID. See https://doi.org/10.17044/NBIS/G000014, for an example.
    • Transcriptomics (cell lines/animals): ArrayExpress. The submission portal gives information on what can be submitted and how. Underlying reads will automatically be deposited to European Nucleotide Archive ENA.
    • Microarray-based gene expression data: ArrayExpress. The submission portal gives information on what can be submitted and how. Data on the originating sample will automatically be deposited to BioSamples.
  • Genome-wide association studies (GWAS): GWAS Catalog; GWAS Central; for human data requiring restricted access please see section on Transcriptomics of human subjects above.

  • Adaptive Immune Receptor Repertoire sequencing (AIRR-seq): Samples the diversity of the immunoglobulins/antibodies and T cell receptors present in a host. The respective gene loci undergo random and irreversible rearrangement during lymphocyte development, therefore this data is fundamentally distinct from conventional genome sequencing.

    It is recommended that data be deposited using AIRR Community compliant processes and standards, in either of the following repositories:

Data and metadata standards
Gene expression
Transcriptomics:
Microarray-based gene expression data:
Genome-wide association studies (GWAS):
  • Preferred minimal metadata standard: MIxS
  • Preferred file formats
Adaptive Immune Receptor Repertoire sequencing (AIRR-seq).
Recommendations for Structural data
Repositories

Several different types of structural data are being collected for Covid-19 research. Some suitable repositories for these are:

  • Structural data on proteins acquired using using any experimental technique (x-ray crystallography, nuclear magnetic resonance) should be deposited in the wwPDB: Worldwide Protein Data Bank via EBI PDBe.
Locating existing data

The COVID-19 Molecular Structure and Therapeutics Hub community data repository and curation service for structure, models, therapeutics, simulations and related computations for research into the COVID-19 pandemic is maintained by The Molecular Sciences Software Institute (MolSSI) and BioExcel.

Data and metadata standards

X-ray diffraction

Electron microscopy

  • Data archiving and validation standards for cryo-EM maps and models are coordinated internationally by EMDataResource (EMDR).
  • Cryo-EM structures (map, experimental metadata, and optionally coordinate model) are deposited and processed through the wwPDB OneDep system, following the same annotation and validation workflow also used for X-ray crystallography and nuclear magnetic resonance (NMR) structures. EMDB holds all workflow metadata while PDB holds a subset of the metadata.
  • Most electron microscopy data is stored in either raw data formats (binary, bitmap images, tiff, etc.) or proprietary formats developed by vendors (dm3, emispec, etc.).
  • Processed structural information is submitted to structural resources as PDBx/mmCIF.
  • Experimental metadata are described in EMDR, see also Lawson et al 2020

NMR

Neutron scattering

  • ENDF/B-VI of Cross-Section Evaluation Working Group (CSEWG) and JEFF of OECD/NEA have been widely utilized in the nuclear community. The latest versions of the two nuclear reaction data libraries are JEFF-3.3 and ENDF/B-VIII.0 (Brown et al., 2018) with a significant upgrade in data for a number of nuclides (Carlson et al., 2018).
  • Neutron scattering data are stored in the internationally-adopted ENDF-6 format maintained by CSEWG.
  • Processed structural information is submitted in the PDBx/mmCIF format.

Molecular Dynamics (MD) simulations

  • Raw trajectory files containing all the coordinates, velocities, forces and energies of the simulation are stored as binary files: .trr, .dcd, .xtc and .netCDF
  • Refined structural models from experimental structural data using MD simulations are stored in .pdb format

Computer-aided drug design data

Recommendations for Proteomics

Proteomics studies are used to find biomarkers for disease and susceptibility.

Repositories

For a curated list of relevant repositories see FAIRsharing using the query ’proteomics’. The ProteomeXchange Consortium enables searches across the following deposition databases, following common standards.

Data and metadata standards

For a curated list of relevant standards see FAIRsharing using the query ’proteomics’.

  • Use the minimal information model specified in MIAPE by the HUPO Proteomics Standards Initiative (HUPO PSI), and fill the model using the controlled vocabularies specified by the Proteomics Standards Initiative, PSI CVs
  • Recommended formats:
    • For gel electrophoresis gelML
    • For transition lists TraML
    • For raw spectrometer output mzML
    • For reporting mzTab
    • For protein quantisation data mzQuantML
    • For protein identification data mzIdentML
    • For metadata ISA-TAB with conversion to PRIDE format
Recommendations for Metabolomics

Metabolomics studies are used to find biomarkers for disease and susceptibility. Lipidomics is a special form of metabolomics, but is also described in more detail in a separate section because of its special relevance to COVID-19 research.

Repositories

For a curated list of relevant repositories see FAIRsharing using the query ‘metabolomics‘.

Data and metadata standards

For a curated list of relevant standards see FAIRsharing using the query ‘metabolomics‘.

  • Core Information for Metabolomics Reporting CIMR standard
  • For identifying chemical compounds use SMILES or InChl
  • To document Investigation/Study/Assay data, use the ISA Abstract Model, also implemented as a tabular format, ISA-Tab in MetaboLights. For an introduction to ISA, see (Sansone S-A et al., 2012)
  • Recommended formats for LC-MS data: ANDI-MS specification, an analytical data interchange protocol for chromatographic data representation and/or mzML
  • Recommended formats for NMR data: nmrCV, nmrML
Recommendations for Lipidomics

Lipidomics revealed an altered lipid composition in infected cells and serum lipid levels in patients with preexisting conditions. Lipid rafts (lipid microdomains) play a critical role in viral infections facilitating virus entry, replication, assembly and budding. Lipid rafts are enriched in glycosphingolipids, sphingomyelin and cholesterol. It is likely that SARS-CoV-2 enters the cell via angiotensin-converting enzyme-2 (ACE2) that depends on the integrity of lipid rafts in the infected cell membrane.

General Recommendations for Researchers

Lipidomics analysis should follow the guidelines of the Lipidomic Standards Initiative

Repositories

The largest repository for lipidomics data is Metabolights

Data and metadata standards
  • Metadata should follow recommendations from the CIMR standard by the Metabolomics Standards Initiative. It should be made available as tab or comma separated files (.tsv or .csv).
  • Data can be stored in LC-MS file, in tab (.tsv) or comma (.csv) separated formats.
Data analysis
  • Most of the analysis is usually performed using the software delivered by the suppliers of the instrumentation. In line with generic software recommendations it should be made sure that the process and parameters are well described, and that the output is converted to a standard format.
  • Workflow for Metabolomics (W4M) is a collaborative portal dedicated to metabolomics data processing, analysis and annotation for Metabolomics community.
  • Data processing using R software and associated packages from Bioconductor (xcms, camera, mixOmics) is a flexible and reproducible way for lipidomic data analysis.
Compound identification

Genomics

The following sections contain guidelines for different genomics data types. Click on any of them for guidance on good data management practices during the data life cycle, including available infrastructures for data generation and analysis and appropriate data repositories for sharing.

Data types:

Whole-genome sequencing (human)

The data life cycle is typically divided into design, generation, analysis, storage & archiving, and sharing. Below you will find information about infrastructure resources available during these phases.

../../_images/data_life_cycle_circle_logos2.png

Data design

During this phase you plan for which data is needed to answer your research question. High quality science is often only possible if the resource facilities you intend to use gets involved already in the planning phase of a project. Consultation and advice regarding data management planning, data generation and data analysis are offered by NBIS and SciLifeLab.

It is wise to write a data management plan, using either a tool provided by your university or DS wizard.

Also, some resources have specific application periods and thus needs to be contacted well in advance. If your project includes sensitive human data, note that there are ethical and legal issues that you have to consider, such as apply for an ethics approval and report the data processing to your Data Protection Officer. See the page on Sensitive data for more information.

Data generation

Consider to upload the raw data to a repository already when receiving them, under an embargo. This way you always have an off-site backup with the added benefit of making the Data sharing phase more efficient.

Facilities which offer data generation services for Whole-genome sequencing (human):

  • NGI (National Genomics Infrastructure) offers an infrastructure equipped with a comprehensive range of technology platforms for next generation sequencing (NGS) and genotyping.

Data analysis

Facilities which offer data analysis services for Whole-genome sequencing (human):

  • NBIS support (National Bioinformatics Infrastructure Sweden) national research infrastructure offers bioinformatic support in various forms for a wide range of areas including NGS, proteomics, metabolomics and biostatistics.
Computational resources

SNIC (Swedish National Infrastructure for Computing) national research infrastructure makes available large scale high performance computing resources. The SNIC-SENS Bianca system is designed to be used for processing sensitive human data. See the instructions for applying for a Bianca project.

  • Note that in case you work at a different institute than Uppsala University, you need a data processing agreement between your institute and UPPMAX/Uppsala University for using the Bianca system - see instructions at UPPMAX.

Data storage and archiving

After the project is finished, the data needs to be stored in a backed-up fashion at least for 10 years, and for as long as the data is of scientific value. After this time, some of the data should be archived and some can be disposed. It is best to contact your university for information about the procedures for this.

Data sharing

In the era of FAIR (Findable, Accessible, Interoperable and Reusable) and Open science, datasets should be made available to the public.

Repositories for Whole-genome sequencing data (human)

NBIS is building a local federated version of the European Genome-phenome Archive (EGA) in Sweden (EGA-SE), allowing for the publication of sensitive personal data within a legal framework. Until local EGA is available, the dataset should remain in the secure analysis environment (eg at Bianca on Uppmax). We suggest to make a metadata-only record in the SciLifeLab Data Repository with contact details on how to get access, and for which a DOI (ie a persistent identifier) can be issued. The DOI can then be used in the article to refer to the dataset. Once the Swedish EGA is operational, and the dataset deposited there, the access information can be changed to point at the EGA ID. See https://doi.org/10.17044/NBIS/G000014, for an example.

Metadata

For information about metadata and how to find appropriate standards please see the page Metadata.

Feedback

Any comments or questions? Please don’t hesitate to send an email to data-management@scilifelab.se

RNA sequencing

The data life cycle is typically divided into design, generation, analysis, storage & archiving, and sharing. Below you will find information about infrastructure resources available during these phases.

../../_images/data_life_cycle_circle_logos2.png

Data design

During this phase you plan for which data is needed to answer your research question. High quality science is often only possible if the resource facilities you intend to use gets involved already in the planning phase of a project. Consultation and advice regarding data management planning, data generation and data analysis are offered by NBIS and SciLifeLab.

It is wise to write a data management plan, using either a tool provided by your university or DS wizard.

Also, some resources have specific application periods and thus needs to be contacted well in advance. If your project includes sensitive human data, note that there are ethical and legal issues that you have to consider, such as apply for an ethics approval and report the data processing to your Data Protection Officer. See the page on Sensitive data for more information.

Data generation

Consider to upload the raw data to a repository already when receiving them, under an embargo. This way you always have an off-site backup with the added benefit of making the Data sharing phase more efficient.

Facilities which offer data generation services for RNA sequencing:

  • NGI (National Genomics Infrastructure) offers an infrastructure equipped with a comprehensive range of technology platforms for next generation sequencing (NGS) and genotyping.

Data analysis

Facilities which offer data analysis services for RNA-seq datasets:

Standard analyses (pipelines):

NGI nf-core RNA-seq pipeline

Project-specific analyses:

NBIS support (National Bioinformatics Infrastructure Sweden) national research infrastructure offers bioinformatic support in various forms for a wide range of areas including NGS, proteomics, metabolomics and biostatistics.

Computational resources

SNIC (Swedish National Infrastructure for Computing) national research infrastructure makes available large scale high performance computing resources. Apply for Small, Medium, Large or Sensitive data allocation, depending on size and type of project.

Data storage and archiving

After the project is finished, the data needs to be stored in a backed-up fashion at least for 10 years, and for as long as the data is of scientific value. After this time, some of the data should be archived and some can be disposed. It is best to contact your university for information about the procedures for this.

SNIC offers storage for small and medium-sized datasets. In the future also large-sized storage will be offered.

Data sharing

In the era of FAIR (Findable, Accessible, Interoperable and Reusable) and Open science, datasets should be made available to the public.

Repositories for RNA sequencing data (non-human)
  • European Nucleotide Archive (ENA) for raw read data
  • ArrayExpress for experiment descriptions and processed expression data

Samples are linked between databases to make sure each part of the dataset is findable. Submitted data can be kept private until the associated research article is published (embargo).

ENA

The ENA hosts an instance of the Sequence Read Archive (SRA), the same archive that exists on NCBI. SRA accepts raw sequence data from any sequencing platform, generated in any research project.

There are several ways to submit data to ENA, including extensive documentation on programmatic submissions.

ArrayExpress

ArrayExpress is tighty integrated with ENA and similar to NCBI’s Gene Expression Omnibus database it can be used to archive experimental designs and analysis files based on the raw sequence reads.

ArrayExpress has its own submission portal where information is available on what can be submitted and how.

Repositories for RNA sequencing data (human)

NBIS is building a local federated version of the European Genome-phenome Archive (EGA) in Sweden (EGA-SE), allowing for the publication of sensitive personal data within a legal framework. Until local EGA is available, the dataset should remain in the secure analysis environment (eg at Bianca on Uppmax). We suggest to make a metadata-only record in the SciLifeLab Data Repository with contact details on how to get access, and for which a DOI (ie a persistent identifier) can be issued. The DOI can then be used in the article to refer to the dataset. Once the Swedish EGA is operational, and the dataset deposited there, the access information can be changed to point at the EGA ID. See https://doi.org/10.17044/NBIS/G000014, for an example.

Other repositories

For other domain-specific repositories, see e.g. ELIXIR Deposition databases, Scientific Data recommended repositories, EBI archive wizard (help to find the right repository depending on data type), or FAIRsharing (the latter can also assist in finding metadata standards suitable for describing your datasets). For datasets that do not fit into domain-specifik repositories, use an institutional repository when available (e.g. SciLifeLab Data Repository) or a general repository such as Figshare and Zenodo.

Metadata

For information about metadata and how to find appropriate standards please see the page Metadata.

Feedback

Any comments or questions? Please don’t hesitate to send an email to data-management@scilifelab.se

Single-cell genomics

The data life cycle is typically divided into design, generation, analysis, storage & archiving, and sharing. Below you will find information about infrastructure resources available during these phases.

../../_images/data_life_cycle_circle_logos2.png

Data design

During this phase you plan for which data is needed to answer your research question. High quality science is often only possible if the resource facilities you intend to use gets involved already in the planning phase of a project. Consultation and advice regarding data management planning, data generation and data analysis are offered by NBIS and SciLifeLab.

It is wise to write a data management plan, using either a tool provided by your university or DS wizard.

Also, some resources have specific application periods and thus needs to be contacted well in advance. If your project includes sensitive human data, note that there are ethical and legal issues that you have to consider, such as apply for an ethics approval and report the data processing to your Data Protection Officer. See the page on Sensitive data for more information.

Data generation

Consider to upload the raw data to a repository already when receiving them, under an embargo. This way you always have an off-site backup with the added benefit of making the Data sharing phase more efficient.

Facilities which offer data generation services for Single-cell genomics:

Data analysis

Facilities which offer data analysis services for Single-cell genomics:

  • NBIS support (National Bioinformatics Infrastructure Sweden) national research infrastructure offers bioinformatic support in various forms for a wide range of areas including NGS, proteomics, metabolomics and biostatistics.
Computational resources

SNIC (Swedish National Infrastructure for Computing) national research infrastructure makes available large scale high performance computing resources. Apply for Small, Medium, Large or Sensitive data allocation, depending on size and type of project.

Data storage and archiving

After the project is finished, the data needs to be stored in a backed-up fashion at least for 10 years, and for as long as the data is of scientific value. After this time, some of the data should be archived and some can be disposed. It is best to contact your university for information about the procedures for this.

SNIC offers storage for small and medium-sized datasets. In the future also large-sized storage will be offered.

Data sharing

In the era of FAIR (Findable, Accessible, Interoperable and Reusable) and Open science, datasets should be made available to the public.

Repositories for Single-cell genomics data (non-human)
  • European Nucleotide Archive (ENA) for raw read data
  • ArrayExpress for experiment descriptions and processed expression data

Samples are linked between databases to make sure each part of the dataset is findable. Submitted data can be kept private until the associated research article is published (embargo).

ENA

The ENA hosts an instance of the Sequence Read Archive (SRA), the same archive that exists on NCBI. SRA accepts raw sequence data from any sequencing platform, generated in any research project.

There are several ways to submit data to ENA, including extensive documentation on programmatic submissions.

ArrayExpress

ArrayExpress is tighty integrated with ENA and similar to NCBI’s Gene Expression Omnibus database it can be used to archive experimental designs and analysis files based on the raw sequence reads.

ArrayExpress has its own submission portal where information is available on what can be submitted and how.

Repositories for Single-cell genomics data (human)

NBIS is building a local federated version of the European Genome-phenome Archive (EGA) in Sweden (EGA-SE), allowing for the publication of sensitive personal data within a legal framework. Until local EGA is available, the dataset should remain in the secure analysis environment (eg at Bianca on Uppmax). We suggest to make a metadata-only record in the SciLifeLab Data Repository with contact details on how to get access, and for which a DOI (ie a persistent identifier) can be issued. The DOI can then be used in the article to refer to the dataset. Once the Swedish EGA is operational, and the dataset deposited there, the access information can be changed to point at the EGA ID. See https://doi.org/10.17044/NBIS/G000014, for an example.

Other repositories

For other domain-specific repositories, see e.g. ELIXIR Deposition databases, Scientific Data recommended repositories, EBI archive wizard (help to find the right repository depending on data type), or FAIRsharing (the latter can also assist in finding metadata standards suitable for describing your datasets). For datasets that do not fit into domain-specifik repositories, use an institutional repository when available (e.g. SciLifeLab Data Repository) or a general repository such as Figshare and Zenodo.

Metadata

For information about metadata and how to find appropriate standards please see the page Metadata.

Feedback

Any comments or questions? Please don’t hesitate to send an email to data-management@scilifelab.se

Functional genomics & Epigenomics

The data life cycle is typically divided into design, generation, analysis, storage & archiving, and sharing. Below you will find information about infrastructure resources available during these phases.

../../_images/data_life_cycle_circle_logos2.png

Data design

During this phase you plan for which data is needed to answer your research question. High quality science is often only possible if the resource facilities you intend to use gets involved already in the planning phase of a project. Consultation and advice regarding data management planning, data generation and data analysis are offered by NBIS and SciLifeLab.

It is wise to write a data management plan, using either a tool provided by your university or DS wizard.

Also, some resources have specific application periods and thus needs to be contacted well in advance. If your project includes sensitive human data, note that there are ethical and legal issues that you have to consider, such as apply for an ethics approval and report the data processing to your Data Protection Officer. See the page on Sensitive data for more information.

For information regarding data collections, experimental guidelines and data standards, we recommend the Encode Project.

Data generation

Consider to upload the raw data to a repository already when receiving them, under an embargo. This way you always have an off-site backup with the added benefit of making the Data sharing phase more efficient.

Facilities which offer data generation services for Functional genomics & Epigenomics:

  • NGI (National Genomics Infrastructure) offers an infrastructure equipped with a comprehensive range of technology platforms for next generation sequencing (NGS) and genotyping.

Data analysis

Facilities which offer data analysis services for Functional genomics & Epigenomics:

  • NBIS support (National Bioinformatics Infrastructure Sweden) national research infrastructure offers bioinformatic support in various forms for a wide range of areas including NGS, proteomics, metabolomics and biostatistics.
Computational resources

SNIC (Swedish National Infrastructure for Computing) national research infrastructure makes available large scale high performance computing resources. Apply for Small, Medium, Large or Sensitive data allocation, depending on size and type of project.

Data storage and archiving

After the project is finished, the data needs to be stored in a backed-up fashion at least for 10 years, and for as long as the data is of scientific value. After this time, some of the data should be archived and some can be disposed. It is best to contact your university for information about the procedures for this.

SNIC offers storage for small and medium-sized datasets. In the future also large-sized storage will be offered.

Data sharing

In the era of FAIR (Findable, Accessible, Interoperable and Reusable) and Open science, datasets should be made available to the public.

Repositories for Functional genomics & Epigenomics data (non-human)
ENA

The ENA hosts an instance of the Sequence Read Archive (SRA), the same archive that exists on NCBI. SRA accepts raw sequence data from any sequencing platform, generated in any research project.

There are several ways to submit data to ENA, including extensive documentation on programmatic submissions.

Repositories for Functional genomics & Epigenomics data (human)

NBIS is building a local federated version of the European Genome-phenome Archive (EGA) in Sweden (EGA-SE), allowing for the publication of sensitive personal data within a legal framework. Until local EGA is available, the dataset should remain in the secure analysis environment (eg at Bianca on Uppmax). We suggest to make a metadata-only record in the SciLifeLab Data Repository with contact details on how to get access, and for which a DOI (ie a persistent identifier) can be issued. The DOI can then be used in the article to refer to the dataset. Once the Swedish EGA is operational, and the dataset deposited there, the access information can be changed to point at the EGA ID. See https://doi.org/10.17044/NBIS/G000014, for an example.

Other repositories

For other domain-specific repositories, see e.g. ELIXIR Deposition databases, Scientific Data recommended repositories, EBI archive wizard (help to find the right repository depending on data type), or FAIRsharing (the latter can also assist in finding metadata standards suitable for describing your datasets). For datasets that do not fit into domain-specifik repositories, use an institutional repository when available (e.g. SciLifeLab Data Repository) or a general repository such as Figshare and Zenodo.

Metadata

For information about metadata and how to find appropriate standards please see the page Metadata.

Feedback

Any comments or questions? Please don’t hesitate to send an email to data-management@scilifelab.se

De novo genome sequencing

The data life cycle is typically divided into design, generation, analysis, storage & archiving, and sharing. Below you will find information about infrastructure resources available during these phases.

../../_images/data_life_cycle_circle_logos2.png

Data design

During this phase you plan for which data is needed to answer your research question. High quality science is often only possible if the resource facilities you intend to use gets involved already in the planning phase of a project. Consultation and advice regarding data management planning, data generation and data analysis are offered by NBIS and SciLifeLab.

It is wise to write a data management plan, using either a tool provided by your university or DS wizard.

Also, some resources have specific application periods and thus needs to be contacted well in advance. If your project includes sensitive human data, note that there are ethical and legal issues that you have to consider, such as apply for an ethics approval and report the data processing to your Data Protection Officer. See the page on Sensitive data for more information.

Data generation

Consider to upload the raw data to a repository already when receiving them, under an embargo. This way you always have an off-site backup with the added benefit of making the Data sharing phase more efficient.

Facilities which offer data generation services for De novo genome sequencing:

  • NGI (National Genomics Infrastructure) offers an infrastructure equipped with a comprehensive range of technology platforms for next generation sequencing (NGS) and genotyping.

Data analysis

Facilities which offer data analysis services for De novo genome sequencing:

  • NBIS support (National Bioinformatics Infrastructure Sweden) national research infrastructure offers bioinformatic support in various forms for a wide range of areas including NGS, proteomics, metabolomics and biostatistics.
Computational resources

SNIC (Swedish National Infrastructure for Computing) national research infrastructure makes available large scale high performance computing resources. Apply for Small, Medium, Large or Sensitive data allocation, depending on size and type of project.

Data storage and archiving

After the project is finished, the data needs to be stored in a backed-up fashion at least for 10 years, and for as long as the data is of scientific value. After this time, some of the data should be archived and some can be disposed. It is best to contact your university for information about the procedures for this.

SNIC offers storage for small and medium-sized datasets. In the future also large-sized storage will be offered.

Data sharing

In the era of FAIR (Findable, Accessible, Interoperable and Reusable) and Open science, datasets should be made available to the public.

Repositories for de novo genome sequencing data (non-human)
ENA

The ENA hosts an instance of the Sequence Read Archive (SRA), the same archive that exists on NCBI. SRA accepts raw sequence data from any sequencing platform, generated in any research project.

There are several ways to submit data to ENA, including extensive documentation on programmatic submissions.

Repositories for de novo genome sequencing data (human)

NBIS is building a local federated version of the European Genome-phenome Archive (EGA) in Sweden (EGA-SE), allowing for the publication of sensitive personal data within a legal framework. Until local EGA is available, the dataset should remain in the secure analysis environment (eg at Bianca on Uppmax). We suggest to make a metadata-only record in the SciLifeLab Data Repository with contact details on how to get access, and for which a DOI (ie a persistent identifier) can be issued. The DOI can then be used in the article to refer to the dataset. Once the Swedish EGA is operational, and the dataset deposited there, the access information can be changed to point at the EGA ID. See https://doi.org/10.17044/NBIS/G000014, for an example.

Other repositories

For other domain-specific repositories, see e.g. ELIXIR Deposition databases, Scientific Data recommended repositories, EBI archive wizard (help to find the right repository depending on data type), or FAIRsharing (the latter can also assist in finding metadata standards suitable for describing your datasets). For datasets that do not fit into domain-specifik repositories, use an institutional repository when available (e.g. SciLifeLab Data Repository) or a general repository such as Figshare and Zenodo.

Metadata

For information about metadata and how to find appropriate standards please see the page Metadata.

Feedback

Any comments or questions? Please don’t hesitate to send an email to data-management@scilifelab.se

Metagenomics

The data life cycle is typically divided into design, generation, analysis, storage & archiving, and sharing. Below you will find information about infrastructure resources available during these phases.

../../_images/data_life_cycle_circle_logos2.png

Data design

During this phase you plan for which data is needed to answer your research question. High quality science is often only possible if the resource facilities you intend to use gets involved already in the planning phase of a project. Consultation and advice regarding data management planning, data generation and data analysis are offered by NBIS and SciLifeLab.

It is wise to write a data management plan, using either a tool provided by your university or DS wizard.

Also, some resources have specific application periods and thus needs to be contacted well in advance. If your project includes sensitive human data, note that there are ethical and legal issues that you have to consider, such as apply for an ethics approval and report the data processing to your Data Protection Officer. See the page on Sensitive data for more information.

Data generation

Consider to upload the raw data to a repository already when receiving them, under an embargo. This way you always have an off-site backup with the added benefit of making the Data sharing phase more efficient.

Facilities which offer data generation services for Metagenomics:

  • NGI (National Genomics Infrastructure) offers an infrastructure equipped with a comprehensive range of technology platforms for next generation sequencing (NGS) and genotyping.

Data analysis

Facilities which offer data analysis services for Metagenomics:

  • NBIS support (National Bioinformatics Infrastructure Sweden) national research infrastructure offers bioinformatic support in various forms for a wide range of areas including NGS, proteomics, metabolomics and biostatistics.
Computational resources

SNIC (Swedish National Infrastructure for Computing) national research infrastructure makes available large scale high performance computing resources. Apply for Small, Medium, Large or Sensitive data allocation, depending on size and type of project.

Data storage and archiving

After the project is finished, the data needs to be stored in a backed-up fashion at least for 10 years, and for as long as the data is of scientific value. After this time, some of the data should be archived and some can be disposed. It is best to contact your university for information about the procedures for this.

SNIC offers storage for small and medium-sized datasets. In the future also large-sized storage will be offered.

Data sharing

In the era of FAIR (Findable, Accessible, Interoperable and Reusable) and Open science, datasets should be made available to the public.

Repositories for Metagenomics data (non-human)
ENA

The ENA hosts an instance of the Sequence Read Archive (SRA), the same archive that exists on NCBI. SRA accepts raw sequence data from any sequencing platform, generated in any research project.

There are several ways to submit data to ENA, including extensive documentation on programmatic submissions.

Repositories for de Metagenomics data (human)

NBIS is building a local federated version of the European Genome-phenome Archive (EGA) in Sweden (EGA-SE), allowing for the publication of sensitive personal data within a legal framework. Until local EGA is available, the dataset should remain in the secure analysis environment (eg at Bianca on Uppmax). We suggest to make a metadata-only record in the SciLifeLab Data Repository with contact details on how to get access, and for which a DOI (ie a persistent identifier) can be issued. The DOI can then be used in the article to refer to the dataset. Once the Swedish EGA is operational, and the dataset deposited there, the access information can be changed to point at the EGA ID. See https://doi.org/10.17044/NBIS/G000014, for an example.

Other repositories

For other domain-specific repositories, see e.g. ELIXIR Deposition databases, Scientific Data recommended repositories, EBI archive wizard (help to find the right repository depending on data type), or FAIRsharing (the latter can also assist in finding metadata standards suitable for describing your datasets). For datasets that do not fit into domain-specifik repositories, use an institutional repository when available (e.g. SciLifeLab Data Repository) or a general repository such as Figshare and Zenodo.

Metadata

For information about metadata and how to find appropriate standards please see the page Metadata.

Feedback

Any comments or questions? Please don’t hesitate to send an email to data-management@scilifelab.se

Imaging

The data life cycle is typically divided into design, generation, analysis, storage & archiving, and sharing. Below you will find information about infrastructure resources available during these phases.

../../_images/data_life_cycle_circle_logos3.png

Data design

During this phase you plan for which data is needed to answer your research question. High quality science is often only possible if the resource facilities you intend to use gets involved already in the planning phase of a project. Consultation and advice regarding data management planning, data generation and data analysis are offered by NBIS and SciLifeLab.

It is wise to write a data management plan, using either a tool provided by your university or DS wizard.

Also, some resources have specific application periods and thus needs to be contacted well in advance. If your project includes sensitive human data, note that there are ethical and legal issues that you have to consider, such as apply for an ethics approval and report the data processing to your Data Protection Officer. See the page on Sensitive data for more information.

Data generation

Consider to upload the raw data to a repository already when receiving them, under an embargo. This way you always have an off-site backup with the added benefit of making the Data sharing phase more efficient.

Facilities which offer data generation services for Imaging:

  • ALM Advanced Light Microscopy facility give support with advanced fluorescence microscopy for nanoscale biological visualization using SIM, STED, STORM/PALM superresolution imaging. The facility also support single molecule spectroscopy measurement and analysis with fluorescence correlation spectroscopy (FCS), as well as combined with superresolution dynamical studies (STED-FCS). Moreover, light-sheet fluorescence microscopy support allow users to image live and/or optically cleared larger samples. Submit your application at the NMI (National Microscopy Infrastructure) project portal.
  • Cryo-EM offers access to state-of-the-art equipment and expertise in single particle cryo-EM and cryo-tomography (cryo-ET).

Data analysis

Facilities which offer data analysis services for imaging:

Computational resources

SNIC (Swedish National Infrastructure for Computing) national research infrastructure makes available large scale high performance computing resources. Apply for Small, Medium, Large or Sensitive data allocation, depending on size and type of project.

Data storage and archiving

After the project is finished, the data needs to be stored in a backed-up fashion at least for 10 years, and for as long as the data is of scientific value. After this time, some of the data should be archived and some can be disposed. It is best to contact your university for information about the procedures for this.

SNIC offers storage for small and medium-sized datasets. In the future also large-sized storage will be offered.

Data sharing

In the era of FAIR (Findable, Accessible, Interoperable and Reusable) and Open science, datasets should be made available to the public.

Repositories for Imaging data:

Depending on the type of image data you have, different public repositories are available, please see the table at BioImage Archive.

If you have data that requires controlled access because of personal privacy issues, informed consents, and/or ethical approvals etc, we suggest to store the data locally in a secure environment and make a metadata-only record in the SciLifeLab Data Repository with contact details on how to get access, and for which a DOI (ie a persistent identifier) can be issued. The DOI can then be used in the article to refer to the dataset.

Other repositories

For other domain-specific repositories, see e.g. ELIXIR Deposition databases, Scientific Data recommended repositories, EBI archive wizard (help to find the right repository depending on data type), or FAIRsharing (the latter can also assist in finding metadata standards suitable for describing your datasets). For datasets that do not fit into domain-specifik repositories, use an institutional repository when available (e.g. SciLifeLab Data Repository) or a general repository such as Figshare and Zenodo.

Metadata

For information about metadata and how to find appropriate standards please see the page Metadata.

Feedback

Any comments or questions? Please don’t hesitate to send an email to data-management@scilifelab.se

Metabolomics

The data life cycle is typically divided into design, generation, analysis, storage & archiving, and sharing. Below you will find information about infrastructure resources available during these phases.

../../_images/data_life_cycle_circle_logos4.png

Data design

During this phase you plan for which data is needed to answer your research question. High quality science is often only possible if the resource facilities you intend to use gets involved already in the planning phase of a project. Consultation and advice regarding data management planning, data generation and data analysis are offered by NBIS and SciLifeLab.

It is wise to write a data management plan, using either a tool provided by your university or DS wizard.

Also, some resources have specific application periods and thus needs to be contacted well in advance. If your project includes sensitive human data, note that there are ethical and legal issues that you have to consider, such as apply for an ethics approval and report the data processing to your Data Protection Officer. See the page on Sensitive data for more information.

Data generation

Consider to upload the raw data to a repository already when receiving them, under an embargo. This way you always have an off-site backup with the added benefit of making the Data sharing phase more efficient.

Facilities which offer data generation services for Metabolomics:

Data analysis

Facilities which offer data analysis services for Metabolomics:

  • NBIS support (National Bioinformatics Infrastructure Sweden) national research infrastructure offers bioinformatic support in various forms for a wide range of areas including NGS, proteomics, metabolomics and biostatistics.

Computational resources

SNIC (Swedish National Infrastructure for Computing) national research infrastructure makes available large scale high performance computing resources. Apply for Small, Medium, Large or Sensitive data allocation, depending on size and type of project.

Data storage and archiving

After the project is finished, the data needs to be stored in a backed-up fashion at least for 10 years, and for as long as the data is of scientific value. After this time, some of the data should be archived and some can be disposed. It is best to contact your university for information about the procedures for this.

SNIC offers storage for small and medium-sized datasets. In the future also large-sized storage will be offered.

Data sharing

In the era of FAIR (Findable, Accessible, Interoperable and Reusable) and Open science, datasets should be made available to the public.

Repositories for Metabolomics data:

MetaboLights is a database for Metabolomics experiments and derived information. The database is cross-species, cross-technique and covers metabolite structures and their reference spectra as well as their biological roles, locations and concentrations, and experimental data from metabolic experiments.

If you have data that requires controlled access because of personal privacy issues, informed consents, and/or ethical approvals etc, we suggest to store the data locally in a secure environment and make a metadata-only record in the SciLifeLab Data Repository with contact details on how to get access, and for which a DOI (ie a persistent identifier) can be issued. The DOI can then be used in the article to refer to the dataset.

Other repositories

For other domain-specific repositories, see e.g. ELIXIR Deposition databases, Scientific Data recommended repositories, EBI archive wizard (help to find the right repository depending on data type), or FAIRsharing (the latter can also assist in finding metadata standards suitable for describing your datasets). For datasets that do not fit into domain-specifik repositories, use an institutional repository when available (e.g. SciLifeLab Data Repository) or a general repository such as Figshare and Zenodo.

Metadata

For information about metadata and how to find appropriate standards please see the page Metadata.

Feedback

Any comments or questions? Please don’t hesitate to send an email to data-management@scilifelab.se

Proteomics

The data life cycle is typically divided into design, generation, analysis, storage & archiving, and sharing. Below you will find information about infrastructure resources available during these phases.

../../_images/data_life_cycle_circle_logos5.png

Data design

During this phase you plan for which data is needed to answer your research question. High quality science is often only possible if the resource facilities you intend to use gets involved already in the planning phase of a project. Consultation and advice regarding data management planning, data generation and data analysis are offered by NBIS and SciLifeLab.

It is wise to write a data management plan, using either a tool provided by your university or DS wizard.

Also, some resources have specific application periods and thus needs to be contacted well in advance. If your project includes sensitive human data, note that there are ethical and legal issues that you have to consider, such as apply for an ethics approval and report the data processing to your Data Protection Officer. See the page on Sensitive data for more information.

Data generation

Consider to upload the raw data to a repository already when receiving them, under an embargo. This way you always have an off-site backup with the added benefit of making the Data sharing phase more efficient.

Facilities which offer data generation services for Proteomics:

  • BioMS Swedish National Infrastructure for Biological Mass Spectrometry national infrastructure enables cutting-edge mass spectrometry and related advanced technology platforms to be part of your research projects.
  • Chemical proteomics & proteogenomics national facility offers state-of-the art mass spectrometry (MS)-based proteomics support, including experimental planning, MS analysis and data analysis related to proteogenomics and chemical proteomics.

Data analysis

Facilities which offer data analysis services for Proteomics:

  • NBIS support (National Bioinformatics Infrastructure Sweden) national research infrastructure offers bioinformatic support in various forms for a wide range of areas including NGS, proteomics, metabolomics and biostatistics.

Computational resources

SNIC (Swedish National Infrastructure for Computing) national research infrastructure makes available large scale high performance computing resources. Apply for Small, Medium, Large or Sensitive data allocation, depending on size and type of project.

Data storage and archiving

After the project is finished, the data needs to be stored in a backed-up fashion at least for 10 years, and for as long as the data is of scientific value. After this time, some of the data should be archived and some can be disposed. It is best to contact your university for information about the procedures for this.

SNIC offers storage for small and medium-sized datasets. In the future also large-sized storage will be offered.

Data sharing

In the era of FAIR (Findable, Accessible, Interoperable and Reusable) and Open science, datasets should be made available to the public.

Repositories for Proteomics data:

ProteomeXchange Consortium provide globally coordinated standard data submission and dissemination pipelines involving the main proteomics repositories:

  • PRIDE - admits protein and peptide identification/quantification data with the accompanying mass spectra evidence and any other related data types. Submission is done using the PX Submission Tool, see tutorial.
  • PeptideAtlas - for SRM/MRM data that does not fit into PRIDE (targeted datasets). Submission is done via PASSEL.

If you have data that requires controlled access because of personal privacy issues, informed consents, and/or ethical approvals etc, we suggest to store the data locally in a secure environment and make a metadata-only record in the SciLifeLab Data Repository with contact details on how to get access, and for which a DOI (ie a persistent identifier) can be issued. The DOI can then be used in the article to refer to the dataset.

Other repositories

For other domain-specific repositories, see e.g. ELIXIR Deposition databases, Scientific Data recommended repositories, EBI archive wizard (help to find the right repository depending on data type), or FAIRsharing (the latter can also assist in finding metadata standards suitable for describing your datasets). For datasets that do not fit into domain-specifik repositories, use an institutional repository when available (e.g. SciLifeLab Data Repository) or a general repository such as Figshare and Zenodo.

Metadata

For information about metadata and how to find appropriate standards please see the page Metadata.

Feedback

Any comments or questions? Please don’t hesitate to send an email to data-management@scilifelab.se

General information

The following sections contain general guidelines, independent of datatype. Metadata contains information about appropriate standards for (meta)data formats. If sensitive data is part of your project, we recommend reading the Sensitive data page. Also, there is a collection of Data protection officers (for sensitive data processing) and Research data offices (for data management guidance) at the different universities who can assist you further.

FAIR principles

FAIR stands for Findable, Accessible, Interoperable and Reusable:

  • To be Findable: Data and metadata should be easy to find by both humans and computer systems. Basic machine readable descriptive metadata allows the discovery of interesting data sets and services.
  • To be Accessible: Data and metadata should be stored for the long term such that they can be easily accessed and downloaded or locally used by machines and humans using standard communication protocols.
  • To be Interoperable: Data should be ready to be exchanged, interpreted and combined in a (semi)automated way with other data sets by humans as well as computer systems.
  • To be Reusable: Data and metadata are sufficiently well-described to allow data to be reused in future research, allowing for integration with other compatible data sources. Proper citation must be facilitated, and the conditions under which the data can be used should be clear to machines and humans.

In Wilkinson, et al 2016 a set of principles were defined for each of these properties. Below, each of the principles are explained further as adapted from FAIR principles translation.

F1. (meta)data are assigned a globally unique and persistent identifier

Explanation: Each data set is assigned a globally unique and persistent identifier (PID), for example a DOI. These identifiers allow to find, cite and track (meta)data.

Action: Ensure that each data set is assigned a globally unique and persistent identifier. Certain repositories automatically assign identifiers to data sets as a service. If not, researchers must obtain a PID via a PID registration service.

F2. data are described with rich metadata (defined by R1 below)

Explanation: Each data set is thoroughly (see below, in R1) described: these metadata document how the data was generated, under what term (license) and how it can be (re)used, and provide the necessary context for proper interpretation. This information needs to be machine-readable.

Action: Fully document each data set in the metadata, which may include descriptive information about the context, quality and condition, or characteristics of the data. Another researcher in any field, or their computer, should be able to properly understand the nature of your dataset. Be as generous as possible with your metadata (see R1).

F3. metadata clearly and explicitly include the identifier of the data it describes

Explanation: The metadata and the data set they describe are separate files. The association between a metadata file and the data set is obvious thanks to the mention of the data set’s PID in the metadata.

Action: Make sure that the metadata contains the data set’s PID.

F4. (meta)data are registered or indexed in a searchable resource

Explanation: Metadata are used to build easily searchable indexes of data sets. These resources will allow to search for existing data sets similarly to searching for a book in a library.

Action: Provide detailed and complete metadata for each data set (see F2).

A1. (meta)data are retrievable by their identifier using a standardized communications protocol

Explanation: If one knows a data set’s identifier and the location where it is archived, one can access at least the metadata. Furthermore, the user knows how to proceed to get access to the data.

Action: Clearly define who can access the actual data, and specify how. It is possible that data will actually not be downloaded, but rather reused in situ. If so, the metadata must specify the conditions under which this is allowed (sometimes versus the conditions needed to fulfill for external usage/“download”).

A1.1 the protocol is open, free, and universally implementable

Explanation: Anyone with a computer and an internet connection can access at least the metadata.

A1.2 the protocol allows for an authentication and authorization procedure, where necessary

Explanation: It often makes sense to request users to create a user account on a repository. This allows to authenticate the owner (or contributor) of each data set, and to potentially set user specific rights.

A2. metadata are accessible, even when the data are no longer available

Explanation: Maintaining all data sets in a readily usable state eternally would require an enormous amount of curation work (adapting to new standards for formats, converting to different format if specifically needed software is discontinued, etc.). Keeping the metadata describing each data set accessible, however, can be done with much less resources. This allows to build comprehensive data indexes including all current, past and potentially arising data sets.

Action: Provide detailed and complete metadata for each data set (see below in R1).

I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation

Explanation: Interoperability typically means that each computer system has at least knowledge of the other system’s formats in which data is exchanged. If (meta)data are to be searchable and if compatible data sources should be combinable in a (semi)automatic way, computer systems need to be able to decide if the content of data sets are comparable. Obvious issues arise when different languages are used to describe the data or when spelling errors make the comparison of descriptions and variable names more difficult. It is critical to use controlled vocabularies and a well-defined framework to describe and structure (meta)data in order to ensure findability and interoperability of datasets.

Action: Provide machine readable data and metadata in an accessible language, using a well-established formalism. In particular, data and metadata are annotated with resolvable vocabularies/ontologies/thesauri that are commonly used in the field. The RDF extensible knowledge representation model is a way to describe and structure datasets. You can refer to the Dublin Core Schema as an example.

I2. (meta)data use vocabularies that follow FAIR principles

Explanation: The controlled vocabulary used to describe data sets needs to be documented. This documentation needs to be easily findable and accessible by anyone who uses the data set.

Action: The vocabularies/ontologies/thesauri are themselves findable, accessible, interoperable and thoroughly documented, hence FAIR. Researchers can refer to metrics assessing the FAIRness of a digital resource (if available).

I3. (meta)data include qualified references to other (meta)data

Explanation: If the data set builds on another data set, if additional data sets are needed to complete the data, or if complementary information is stored in a different data set, this needs to be specified. In particular, the scientific link between the data sets needs to be described. Furthermore, all data sets need to be properly cited (i.e. including their persistent identifiers).

Action: Properly cite relevant/associated data sets, in particular by providing their persistent identifiers, in the metadata, and describe the scientific link/relation to your data set.

R1. meta(data) are richly described with a plurality of accurate and relevant attributes

Explanation: Description of a data set is required at two different levels:

  • metadata describing the data set (intrinsic): what does the data set contain, how was the data generated, how has it been processed, how can it be reused …
  • metadata describing the data (submitterdefined): any needed information to properly use the data, such as definitions of the variable names

Action: Provide complete metadata for each data file. Some points to take into consideration (non-exhaustive list):

  • Scope of your data: for what purpose was it generated/collected?
  • Particularities or limitations about the data that other users should be aware of.
  • Date of the data set generation, lab conditions, who prepared the data, parameter settings, name and version of the software used.
  • Is it raw or processed data?
  • Variable names are explained or self-explanatory (i.e. defined in the research field’s controlled vocabulary).
  • Version of the archived and/or reused data is clearly specified and documented.
R1.1. (meta)data are released with a clear and accessible data usage license

Explanation: The conditions under which the data can be used should be clear to machines and humans. This has to be specified in the metadata describing a data set.

Action: Include information about the license in the metadata. If a particular license is needed, you have to provide it along with the data set. Where possible it is suggested to use common licenses, such as CC 0, CC BY, etc., which can be referred to by URL.

R1.2. (meta)data are associated with detailed provenance

Explanation: Detailed information about the provenance of data is necessary for reuse: this will, for example, allow researchers to understand how the data was generated, in which context it can be reused, and how reliable it is. Provenance is a central issue in scientific databases to validate data.

Action: The metadata to thoroughly describe the workflow that led to your data: Who generated or collected it? How has it been processed? Has it been published before? Does it contain data from someone else, potentially transformed or completed? Ideally the workflow is described in a machine-readable format. Criterion I3 is closely linked to this issue when reusing published data sets.

R1.3. (meta)data meet domainrelevant community standards

Explanation: It is easier to reuse data sets if they are similar: same type of data, data organized in a standardized way, well-established and sustainable file formats, documentation (metadata) following a common template and using common vocabulary. If community standards or best practices for data archiving and sharing exist, they should be followed. Note that quality issues are not addressed by the FAIR principles. How reliable data is lies in the eye of the beholder and depends on the foreseen application.

Action: Prepare your (meta)data according to community standards and best practices for data archiving and sharing in your research field. There might be situations where good practice exist for the type of data to be submitted but the submitter has valid and specified reasons to divert from the standard practice. This needs to be addressed in the metadata.

Metadata

Good documentation in research projects, describing how the datasets were created, how they are structured, and what they mean, is essential for making your data understandable. Metadata provides such ‘data about data’ , and may include information on the methodology used to collect the data, analytical and procedural information, definitions of variables, units of measurement, any assumptions made, the format and file type of the data and software used to collect and/or process the data.

Researchers are strongly encouraged to use community metadata standards where these are in place (see further down). Data repositories may also provide guidance about appropriate metadata standards and requirements e.g. ENA sample checklists. It is highly recommended to, already from the beginning of the project, structure e.g. sample metadata in a way that enables sequence data submission without having to reformat the metadata.

Ontologies

Ontologies, controlled vocabularies and data dictionaries are used to standardize the language used to describe the metadata. Think of the many ways to write that the organism is human (human, Human, homo sapiens, H. sapiens, Homo Sapiens, man, etc), using an ontology such as NCBI taxonomy unifies the language and makes it easier for both humans and machines to interpret and work with the data. While an ontology has a hierarchical structure, a controlled vocabulary is an unstructured set of terms. A Data Dictionary is a user-defined way of describing what all the variable names and values in your data really mean.

For a suggested list of ontologies appropriate for Life Science community please see FAIRsharing.org, filter on e.g. Domain.

Below are ontology resources, adapted from Table 2 in Griffin PC, Khadake J, LeMay KS et al. Best practice data life cycle approaches for the life sciences. F1000Research 2018, 6:1618. doi: 10.12688/f1000research.12344.2

  • Ontology Lookup Service - Discover different ontologies and their contents.
  • OBO Foundry - Table of open biomedical ontologies with information on development status, license and content.
  • Zooma - Assign ontology terms using curated mapping.
  • Webulous - Create new ontology terms easily.
  • Ontobee - A linked data server that facilitates ontology data sharing, visualization, and use.

Data and metadata standards Genomics data

A list of relevant data and metadata standards can be found in FAIRsharing, some specific examples are below.

Gene expression
Transcriptomics:
  • Preferred minimal metadata standard: MINSEQE
  • Preferred file formats (sequencing-based):
    • Raw sequences: .fastq (compression can be added with gzip)
    • Mapped sequences: .sam (compression with .bam or .cram). Please ensure that the used reference sequence is also publically available and that the @SQ header is present and unambiguously describes the used reference sequence.
    • Transcripts per million (TPM): .csv
  • Gene Structure: .gtf
  • Gene Features: .gff
  • Variant calling: .vcf. Please ensure that the used reference sequence is also publically available and that it is unambiguously referenced in the header of the .vcf file, e.g. using the URL field of the ##contig field.
  • Also see FAIRsharing using the query ‘transcriptomics’
Microarray-based gene expression data:
Genome-wide association studies (GWAS):
  • Preferred minimal metadata standard: MIxS
  • Preferred file formats
Metagenomics
  • MIxS - MIGS/MIMS Minimum Information about a (Meta)Genome Sequence. The MIMS extension includes key environmental metadata. Developed by the Genomic Standards Consortium. Numerous adopters including NCBI/EBI/DDBJ databases.
  • MIMARKS Minimum Information about a MARKer gene Sequence. This is an extension of MIGS/MIMS for environmental sequences. Developed by the Genomic Standards Consortium. Numerous adopters including NCBI/EBI/DDBJ databases.
Functional Annotation of Animal Genomes Consortium (FAANG) standards

Data and metadata standards Proteomics

For a curated list of relevant standards see FAIRsharing using the query ’proteomics

  • Use the minimal information model specified in MIAPE by the HUPO Proteomics Standards Initiative (HUPO PSI), and fill the model using the controlled vocabularies specified by the Proteomics Standards Initiative: PSI CVs
  • Recommended formats:
    • For gel electrophoresis gelML)
    • For transition lists TraML
    • For raw spectrometer output mzML
    • For reporting mzTab
    • For protein quantisation data mzQuantML
    • For protein identification data mzIdentML
    • For metadata ISA-TAB with conversion to PRIDE format

Data and metadata standards Metabolomics:

For a curated list of relevant standards see FAIRsharing using the query ‘metabolomics’.

  • Core Information for Metabolomics Reporting CIMR standard
  • For identifying chemical compounds use SMILES or InChl
  • To document Investigation/Study/Assay data, use the ISA Abstract Model, also implemented as a tabular format, ISA-Tab in MetaboLights. For an introduction to ISA, see (Sansone S-A et al., 2012)
  • Recommended formats for LC-MS data: ANDI-MS specification, an analytical data interchange protocol for chromatographic data representation and/or mzML
  • Recommended formats for NMR data: nmrCV, nmrML

Data and metadata standards Lipidomics:

  • Metadata should follow recommendations from the CIMR standard by the Metabolomics Standards Initiative. It should be made available as tab or comma separated files (.tsv or .csv).
  • Data can be stored in LC-MS file, in tab (.tsv) or comma (.csv) separated formats.

Data and metadata standards Structural data / Imaging

X-ray diffraction

Electron microscopy

  • Data archiving and validation standards for cryo-EM maps and models are coordinated internationally by EMDataResource (EMDR).
  • Cryo-EM structures (map, experimental metadata, and optionally coordinate model) are deposited and processed through the wwPDB OneDep system, following the same annotation and validation workflow also used for X-ray crystallography and nuclear magnetic resonance (NMR) structures. EMDB holds all workflow metadata while PDB holds a subset of the metadata.
  • Most electron microscopy data is stored in either raw data formats (binary, bitmap images, tiff, etc.) or proprietary formats developed by vendors (dm3, emispec, etc.).
  • Processed structural information is submitted to structural resources as PDBx/mmCIF.
  • Experimental metadata include information about the sample, specimen preparation, imaging, image processing, symmetry, reconstruction method, resolution and resolution method, as well as a description of the modeling/fitting procedures used and are described in EMDR, see also Lawson et al 2020.

NMR

  • There are no widely accepted standards for NMR raw data files. Generally these are stored and archived in single FID/SER files.
  • One effort for the standardization of NMR parameters extracted from 1D and 2D spectra of organic compounds to the proposed chemical structure is the NMReDATA format.
  • There is no universally accepted format, especially for crucial FID-associated metadata. NMR-STAR and its NMR-STAR Dictionary is the archival format used by the Biological Nuclear Magnetic Resonance data Bank (BMRB), the international repository of biomolecular NMR data and an archive of the Worldwide Protein Data Bank (wwPDB).
  • The nmrML format specification (XML Schema Definition (XSD) and an accompanying controlled vocabulary called nmrCV) are an open mark up language and ontology for NMR data.
  • Processed structural information is submitted in the PDBx/mmCIF format.

Neutron scattering

  • ENDF/B-VI of Cross-Section Evaluation Working Group (CSEWG) and JEFF of OECD/NEA have been widely utilized in the nuclear community. The latest versions of the two nuclear reaction data libraries are JEFF-3.3 and ENDF/B-VIII.0 (Brown et al., 2018) with a significant upgrade in data for a number of nuclides (Carlson et al., 2018).
  • Neutron scattering data are stored in the internationally-adopted ENDF-6 format maintained by CSEWG.
  • Processed structural information is submitted in the PDBx/mmCIF format.

Sensitive personal data

The following is a list of Ethical, Legal and Social Implications (ELSI) that should be considered when working with human data. The content on this page is based on a checklist that has been developed in the Tryggve project. It is intended be used as a tool to document these considerations, and is available as:

  1. An MS Word file that can be downloaded from the Tryggve project pages.
  2. In the SciLifeLab Data Stewardship Wizard (SciLifeLab DSW)
  • Log in to the SciLifeLab DSW using your university credentials
  • Select Questionnaires in the left sidebar, and click the Create button
  • Choose Tryggve checklist… from the Knowledge Model drop-down menu

Note that the checklist was created with cross-border collaborative projects in mind, but it should be useful for other research projects as well.

Before the collection of personal data has begun you should always consult with the Data Protection Officer of your organisation.

GDPR (more info)

  • What is the purpose of processing of the personal data?
  • Who is the Controller(s) of the personal data?
  • What is the legal basis processing of the personal data?
  • What are the exemptions for the prohibition for processing of special categories of data (such as health and genetic data) under Art. 9 GDPR used?
  • Have data processing agreements been established between the data controller(s) and any data processors?
  • Has a Data Protection Impact Assessments (DPIA) been performed for the personal data?
  • What happens with the data after project completion?
  • What technical and procedural safeguards will be established for processing the data?

Other considerations (more info)

  • Are there other relevant national legislation considerations that has to be taken into account?
    • e.g. regarding public access to information, biobank acts, etc.
  • Are there other Terms & conditions for data access (in particular if presenting obstacles for cross-border processing of health data)?
    • e.g. register data access policies
  • Are there other legal agreements between project parties that should be considered?
    • e.g. conditions regarding data reuse and intellectual property

Clarifications and comments

Ethical reviews and informed consents

The purpose of these questions is to spell out what uses the subjects have consented to, and/or for what uses ethical approvals have been given. Then, given the stated research purpose of this project, are the consents and ethical approvals for the datasets compatible with this.

GDPR
State the purpose of processing the personal data

The GDPR stipulates that to process personal data the controller must do that with stated purposes, and not further process the data in a manner that is incompatible with those purposes (Article 5 - Principles relating to processing of personal data).

Who are the data controller of the personal data processed in the project?

Article 4 (7): “‘controller’ means the natural or legal person, public authority, agency or other body which, alone or jointly with others, determines the purposes and means of the processing of personal data; […].” The Controller is typically the university employer of the PI, and the PI should act as a representative of her university employer and is responsible for ensuring that personal data is handled correctly in her projects. If the project involves more than one legal entity, and joint controllership is considered, make sure that all parties understand their obligations, and it is probably good to define the terms for this in an agreement between the parties.

What are the exemptions for the prohibition for processing of special categories of data (such as health and genetic data) under Art. 9 GDPR used?

Processing of certain categories of personal data is not allowed unless there are exemptions in law to allow this. Among these categories (“sensitive data”) are “‘[…] data revealing racial or ethnic origin, […] genetic data, […] data concerning health’”. Most types of personal data collected in biomedical research will fall under these categories. Article 9 (2) lists a number of exemptions that apply, of which consent and scientific research are most likely to be relevant for research. Please consult with your Data Protection Officer of your organisation.

Have data processing agreements been established between the data controller(s) and any data processors?

Article 4 (8): “‘processor’ means a natural or legal person, public authority, agency or other body which processes personal data on behalf of the controller.” Examples of this is if you use a secure computing environment provided by another organisation to do your analysis or to store the data, along with several other scenarios. In the case that you do, there needs to be a legal agreement established between the controller(s) and processor(s) as defined in Article 28 (3): “Processing by a processor shall be governed by a contract or other legal act under Union or Member State law, that is binding on the processor with regard to the controller and that sets out the subject-matter and duration of the processing, the nature and purpose of the processing, the type of personal data and categories of data subjects and the obligations and rights of the controller. […]” Article 28 also lists the required contents of such an agreement. Your organisation and/or the processor organisation will probably have agreement templates that you can use.

Have Data Protection Impact Assessments (DPIA) been performed for the personal data?

Where a type of processing is likely to result in a high risk to the rights and freedoms of natural persons, the controller shall, prior to the processing, carry out an assessment of the impact of the envisaged processing operations on the protection of personal data, a so called Data Protection Impact Assessment (DPIA) - Article 35. To clarify when this is necessary, the Swedish Data Protection Authority (DPA) “Datainspektionen” has issued guidance of when an impact assessment is required. Large-scale processing of sensitive data such as genetic or other health related data is listed as requiring DPIAs. The French DPA has made a PIA tool (endorsed by several other DPAs) available that can help in performing these impact assessments. Please also consult your Data Protection Officer of your organisation.

What technical and procedural safeguards have been established for processing the data?

To ensure that the personal data that you process in the project is protected at an appropriate level, you should apply technical and procedural safeguards to ensure that the rights of the data subjects are not violated. Examples of such measures include, but are not limited to, pseudonymisation end encryption of data, the use of computing and storage environments with heightened security, and clear and documented procedures for project members to follow.

What happens with the data after project completion?

The GDPR states that the processing (including storing) of personal data should stop when the intended purpose of the processing is done. There are, however, exemptions to this e.g. when the processing is done for research purposes. Also, from a research ethics point of view, research data should be kept to make it possible for others to validate published research findings and reuse data for new discoveries. This is also governed by what the data subjects have been informed about regarding how you will treat the data after project completion. The recommendation is to deposit the sensitive data in the appropriate controlled access repositories if such are available, but this requires that the data subjects are informed and have agreed to this.

Other considerations

There might also exist other national legal or procedural considerations for cross-border research collaborations. Other laws might affect how and if data can or cannot be made available outside the country of origin. The operating procedures of government authorities or other organisations might create obstacles for sharing data across borders. To make sure that it is clear how original and derived data, as well as results, can be used by the parties after the project completion, consider establishing legal agreements that defines this. This can include e.g. reuse of data for other projects or intellectual property rights derived from the research project.

Data Protection Officer (dataskyddssombud)

This is the person that is responsible for ensuring that the data processing of sensitive data adheres to the GDPR. You should report personal data processing to this person.

University Email Link
Gothenburg University dataskydd@gu.se Routines for processing personal data
Karolinska Institutet dataskyddsombud@ki.se Personal data in research
KTH Royal Institute of Technology dataskyddsombud@kth.se General Data Protection Regulation (GDPR)
Linköping University dataskyddsombud@liu.se Inventory of personal data processing in research projects
Lund University dataskyddsombud@lu.se Registration of personal data processing
Stockholm University gdpr@su.se GDPR at SU
Umeå University pulo@umu.se Anmäl din personuppgiftsbehandling
Uppsala University dataskyddsombud@uu.se General Data Protection Regulation (GDPR) – how it works

Reseach Data Office (RDO)

Some of the universities have established RDO or Data Access Unit (DAU), in order to help with data management questions. Also, the libraries can most often give advice or redirect to local instances.

University Link
Gothenburg University Management of research data
Karolinska Institutet Research Data Office; Research data management
KTH Royal Institute of Technology Research data and data management plan; Manage research data
Linköping University Five advice for strategic management of research data
Lund University Research data
Stockholm University Forskningsdata
Umeå University Scholary publishing
Uppsala University Research data

For other sites, the Swedish National Data Service (SND) network is listed here.

These pages are provided to you by NBIS data management team and SciLifeLab Data Centre. You can reach us by sending an email to data-management@scilifelab.se.