Genome Standards

From WormBaseWiki
Revision as of 10:54, 12 December 2011 by Khowe (talk | contribs)
Jump to navigationJump to search


Overview

This is the list of criteria that must be satisfied by a new genome assembly/project in order for us to accept it for integration into WormBase

Requirements

Keep us informed of your progress

Contact us at help@wormbase.org early on in your project. We can then advise you about details and be prepared for your data when you publish it. We like to be informed of upcoming genome submissions as early as possible.

Submission to INSDC

Our minimal requirement is that the genome assembly (bottom-level genomic sequences, and higher-level structure) should have been deposited with the one of the collaborating partners of the International Nucleotide Sequence Database Collaboration www.insdc.org. We can then easily extract the data from the public archives and have confidence that assembly is regarded as reasonably stable by the authors.

Gene models

Ideally, you should provide a canonical gene set with your genome sequence. Without a gene set, we cannot include your species in our orthology infererence pipelines, resulting in only limited integration with the other species in WormBase.

If the gene models and other annotation have been deposited with an INSDC partner along with the genome sequence, there will be no need to submit any additional files direct to WormBase. However, we do appreciate that it is sometimes not possible to deposit the annotation in this way (either for technical reasons, or due to the requirement for embargo until publication).

We therefore consider direct submissions of annotation in GFF2 or GFF3.

GFF allows gene models to be represented in a variety of ways. We are in the process of solidifying a document that lays out the precise form of GFF3 that will enable the cleanest and most rapid integration into WormBase. In the meantime, we encourage authors to contact us in advance of the preparation of the files so that we can advise on content and form.

Sanity checks

We apply the following sanity checks on gene models; any models failing these checks are excluded

Translation

Protein-coding models should translate into orthodox proteins (no stop codons or partial codons)

Minimum length

Protein-coding gene models should be at least 3 amino acids in length. This restriction is mainly for BLAST analysis (as this is the minimum word size), but prevents other problem too.

Minimum N50

We do not require a minimum N50. If you are happy for your data to be publicly available from the ENA/GenBank database, then we will consider including it in WormBase, but we reserve the right to reject genomes that are too fragmented for us to be able to load into our data structures.

Meta data

When writing to help@wormbase.org to tell us about the new genome assembly that you would like included in WormBase, it would be helpful to include the following information:

Handle to assembly in public reposiroties

This could be one (or both) of:

Assembly statistics

As much information as you have available, but minimally the stage which your assembly project has reached - one of:

  • Standard Draft
  • High-Quality Draft
  • Improved High-Quality Draft
  • Annotation-Directed Improvement
  • Noncontiguous Finished
  • Finished

These stages are defined in the paper "Genome Project Standards in a New Era of Sequencing"

Provenance

  • Species name
  • Strain ID (preferrably the CGC strain ID)
  • Origin of the Strain
  • Some references describing the species

Attribution

  • Primary Data Contact:
  • Primary Data Contact Email:
  • Project URL:
  • FTP Site:
  • Citation: