Difference between revisions of "Genome Standards"

From WormBaseWiki
Jump to navigationJump to search
 
(27 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
__TOC__
 
__TOC__
  
<center>
 
= WormBase Genome Integration Standards =
 
  
 +
= Overview =
  
</center>
+
'''This is the list of criteria that must be satisfied by a new genome assembly/project in order for us to accept it for integration into WormBase'''
  
== Overview: ==
+
= Requirements =
  
'''This is the list of criteria that a genome assembly/project should attain before WormBase can agree to integrate the Organism into the database.'''
+
== Keep us informed of your progress ==
  
Submission of genomes to the Wormbase database is a common requirement. This document is a guide for this who wish their nematode genomes to be included in Wormbase.
+
Contact us at '''help@wormbase.org''' early on in your project. We can then advise you about details and be prepared for your data when you publish it. We like to be informed of upcoming genome submissions as early as possible.
  
== Requirements ==
+
== Submission to INSDC ==
  
=== Keep us informed of your progress ===
+
Our minimal requirement is that the genome assembly (bottom-level genomic sequences, and higher-level structure) should have been deposited with the one of the collaborating partners of the International Nucleotide Sequence Database Collaboration [http://www.insdc.org www.insdc.org]. We can then easily extract the data from the public archives and have confidence that assembly is regarded as reasonably stable by the authors.
  
Contact us at '''help@wormbase.org''' early on in your project. We can then advise you about details and be prepared for your data when you publish it.
+
WormBase can offer advice/help on how to go about doing this.
  
We like to be informed of upcoming genome submissions as early as possible, for example a typical lab that has contributed data is the [http://www.sanger.ac.uk/research/projects/parasitegenomics/ Sanger Helminth group]
+
== Assembly quality ==
  
They run a pipeline whereby there are 4 phases, we would normally integrate the data after their fourth phase
+
We do not have an official threshold on assembly quality, e.g. a minimum N50. If you are happy for your data to be publicly available from the ENA/GenBank database, then we will consider including it in WormBase, but we reserve the right to reject genomes that are too fragmented for us to be able to load into our data structures.
  
1) Production (X months) - No Interest for Wormbase
+
== Gene models ==
      |
 
2) Finishing  (3 Months) - No Interest for Wormbase
 
      |
 
3) Analysis  (3 months) - Wormbase are interested from this stage on - contact us at help@wormbase.org
 
      |
 
2) Repeat - Finishing
 
      |
 
3) Repeat - Analysis
 
      |
 
      x n (This means that it can go through several rounds before it is published)
 
      |
 
4) Publish and submit to ENA/GenBank when the genome/gene set meets their standards
 
      |
 
    Confirm it meets the Wormbase standards for integration, add to Wormbase.
 
  
=== Submission to ENA/GenBank ===
+
You should provide a canonical gene set with your genome sequence. Without a gene set, we cannot include your species in our orthology infererence pipelines, resulting in only limited integration with the other species in WormBase.
  
Normally we expect the genome to be submitted to the public nucleotide databases.
+
If the gene models and other annotation have been deposited with an INSDC partner along with the genome sequence, there will be no need to submit any additional files direct to WormBase. However, we do appreciate that it is sometimes not possible to deposit the annotation in this way (either for technical reasons, or due to the requirement for embargo until publication). In those cases, we will therefore consider direct submissions of annotation in [http://www.sequenceontology.org/gff3.shtml GFF3]. We encourage authors to contact us in advance of the preparation of the files so that we can advise on content and form.  
  
We can then easily extract the data from the public databases and the assembly will be regarded as reasonably stable by the authors.
+
=== Sanity checks ===
  
=== Gene model prediction ===
+
We apply the following sanity checks on gene models; any models failing these checks are excluded
  
We prefer you to do some prediction of the gene models in your genome.
+
==== Translation ====
  
All gene models should be at least 3 amino acids in length - mainly for Blast analysis as this is the minumum word size.
+
Protein-coding models should translate into orthodox proteins (no stop codons or partial codons)
  
If prediction of gene models is not possible, we would like to have your estimate of the number of genes in the genome.
+
==== Minimum length ====
  
To get an estimated of how many genes could be expected in an assembly, take the average gene length and see haw many genes can be found based on the available contigs.  
+
Protein-coding gene models should be at least 3 amino acids in length. This restriction is mainly for BLAST analysis (as this is the minimum word size), but prevents other problem too.
  
-----
+
== Identifiers ==
  
<center>
+
If your genome assembly and annotation has been deposited with INSDC, we will use the INSDC identifiers, as these are guaranteed to be globally unique. If the submission is pending, and accessions are not yet available, WormBase may take the liberty of renaming your objects on initial submission, to ensure that they are globally unique within our database.
  
 +
If re-submitting a new/updated version of data for a species that we already have in WormBase, care must be taken with identifiers, particularly those of the genes, transcripts and proteins. If possible, you should ensure that identifiers are propagated forward to new versions of the annotation, such that a specific gene (for example) retains the same identifier. Without this, users may have trouble migrating their work on a given species to the new data.
  
=== Assembly statisitics ===
+
== Meta data ==
  
Standard Draft:
+
When writing to help@wormbase.org to tell us about the new genome assembly that you would like included in WormBase, it would be helpful to include the following information:
  
High-Quality Draft:
+
=== Handle to assembly in public reposiroties ===
  
Improved High-Quality Draft:
+
This could be one (or both) of:
  
Annotation-Directed Improvement:
+
* An INSDC accession number;
 +
* A direct link to the data under [ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/invertebrates the NCBI genome FTP portal]
  
Noncontiguous Finished:
+
=== Assembly statistics ===
  
Finished:
+
As much information as you have available, but minimally the stage which your assembly project has reached - one of:
  
=== Attribution ===
+
* Standard Draft
 +
* High-Quality Draft
 +
* Improved High-Quality Draft
 +
* Annotation-Directed Improvement
 +
* Noncontiguous Finished
 +
* Finished
  
Primary Data Contact:  
+
These stages are defined in the paper [http://www.sciencemag.org/cgi/content/full/326/5950/236#R2 "Genome Project Standards in a New Era of Sequencing"]
  
Primary Data Contact Email:
+
=== Provenance ===
  
Project URL:
+
* Species name
 +
* Strain ID (preferrably the CGC strain ID)
 +
* Origin of the Strain
 +
* Some references describing the species
  
FTP Site:
+
=== Attribution ===
 
 
Citation:
 
 
 
==Minimum Standards==
 
 
 
* Submission to a public Nucleotide Repository
 
 
 
* Wiki description of the Species, submitted by data producer if possible.
 
 
 
* N50 of ?
 
 
 
*
 
 
 
== Submitted Files ==
 
 
 
''If the data cannot be submitted to ENA/GenBank, the data should be provided to Wormbase in standardized formats using the following conventions''
 
 
 
'''g_species''' refers to the Genus name then the species name joined by a '_' character, for example '''c_elegans'''.
 
 
 
* Genomic Sequence
 
 
 
File format : FASTA
 
File name  : g_species.genome.fa
 
 
 
* Conceptual transcripts (spliced)
 
 
 
File format : FASTA
 
File name  : g_species.mrna.fa
 
 
 
* Conceptual transcripts (unspliced) - the gene models
 
 
 
File format : GFF2 or GFF3
 
File name  : g_species.gff2 or g_species.gff3
 
 
 
* Conceptual translations
 
 
File format : FASTA
 
File name  : g_species.pep
 
 
 
* Genomic Features
 
 
 
File format : GFF2 or GFF3
 
File name  : g_species.gff2 or g_species.gff3 (these should be in the same file as the gene models)
 
  
== See Also ==
+
* Primary Data Contact
 +
* Primary Data Contact Email
 +
* Project URL
 +
* Citation
  
The Helminth Co-ordinator discusses some requirements for submitting genomes in the paper [http://www.sciencemag.org/cgi/content/full/326/5950/236#R2 "Genome Project Standards in a New Era of Sequencing"]
 
  
 
[[Category:User Guide]]
 
[[Category:User Guide]]
 
[[Category:Curation]]
 
[[Category:Curation]]

Latest revision as of 09:02, 22 May 2012


Overview

This is the list of criteria that must be satisfied by a new genome assembly/project in order for us to accept it for integration into WormBase

Requirements

Keep us informed of your progress

Contact us at help@wormbase.org early on in your project. We can then advise you about details and be prepared for your data when you publish it. We like to be informed of upcoming genome submissions as early as possible.

Submission to INSDC

Our minimal requirement is that the genome assembly (bottom-level genomic sequences, and higher-level structure) should have been deposited with the one of the collaborating partners of the International Nucleotide Sequence Database Collaboration www.insdc.org. We can then easily extract the data from the public archives and have confidence that assembly is regarded as reasonably stable by the authors.

WormBase can offer advice/help on how to go about doing this.

Assembly quality

We do not have an official threshold on assembly quality, e.g. a minimum N50. If you are happy for your data to be publicly available from the ENA/GenBank database, then we will consider including it in WormBase, but we reserve the right to reject genomes that are too fragmented for us to be able to load into our data structures.

Gene models

You should provide a canonical gene set with your genome sequence. Without a gene set, we cannot include your species in our orthology infererence pipelines, resulting in only limited integration with the other species in WormBase.

If the gene models and other annotation have been deposited with an INSDC partner along with the genome sequence, there will be no need to submit any additional files direct to WormBase. However, we do appreciate that it is sometimes not possible to deposit the annotation in this way (either for technical reasons, or due to the requirement for embargo until publication). In those cases, we will therefore consider direct submissions of annotation in GFF3. We encourage authors to contact us in advance of the preparation of the files so that we can advise on content and form.

Sanity checks

We apply the following sanity checks on gene models; any models failing these checks are excluded

Translation

Protein-coding models should translate into orthodox proteins (no stop codons or partial codons)

Minimum length

Protein-coding gene models should be at least 3 amino acids in length. This restriction is mainly for BLAST analysis (as this is the minimum word size), but prevents other problem too.

Identifiers

If your genome assembly and annotation has been deposited with INSDC, we will use the INSDC identifiers, as these are guaranteed to be globally unique. If the submission is pending, and accessions are not yet available, WormBase may take the liberty of renaming your objects on initial submission, to ensure that they are globally unique within our database.

If re-submitting a new/updated version of data for a species that we already have in WormBase, care must be taken with identifiers, particularly those of the genes, transcripts and proteins. If possible, you should ensure that identifiers are propagated forward to new versions of the annotation, such that a specific gene (for example) retains the same identifier. Without this, users may have trouble migrating their work on a given species to the new data.

Meta data

When writing to help@wormbase.org to tell us about the new genome assembly that you would like included in WormBase, it would be helpful to include the following information:

Handle to assembly in public reposiroties

This could be one (or both) of:

Assembly statistics

As much information as you have available, but minimally the stage which your assembly project has reached - one of:

  • Standard Draft
  • High-Quality Draft
  • Improved High-Quality Draft
  • Annotation-Directed Improvement
  • Noncontiguous Finished
  • Finished

These stages are defined in the paper "Genome Project Standards in a New Era of Sequencing"

Provenance

  • Species name
  • Strain ID (preferrably the CGC strain ID)
  • Origin of the Strain
  • Some references describing the species

Attribution

  • Primary Data Contact
  • Primary Data Contact Email
  • Project URL
  • Citation