Difference between revisions of "Genome Standards"

From WormBaseWiki
Jump to navigationJump to search
Line 1: Line 1:
 
__TOC__
 
__TOC__
  
<center>
 
= WormBase Genome Integration Standards =
 
  
 +
= Overview =
  
</center>
+
'''This is the list of criteria that must be satisfied by a new genome assembly/project in order for us to accept it for integration into WormBase'''
 
 
== Overview: ==
 
 
 
'''This is the list of criteria that a genome assembly/project should attain before WormBase can agree to integrate the Organism into the database.'''
 
 
 
Submission of genomes to the Wormbase database is a common requirement. This document is a guide for this who wish their nematode genomes to be included in Wormbase.
 
  
 
We are happy to accept data pre-publication so that we can prepare it ready for adding to the database when you publish.
 
We are happy to accept data pre-publication so that we can prepare it ready for adding to the database when you publish.
  
== Requirements ==
+
= Requirements =
  
=== Keep us informed of your progress ===
+
== Keep us informed of your progress ==
  
 
Contact us at '''help@wormbase.org''' early on in your project. We can then advise you about details and be prepared for your data when you publish it.
 
Contact us at '''help@wormbase.org''' early on in your project. We can then advise you about details and be prepared for your data when you publish it.
Line 31: Line 24:
 
  3) Analysis  (3 months) - Wormbase are interested from this stage on - contact us at help@wormbase.org
 
  3) Analysis  (3 months) - Wormbase are interested from this stage on - contact us at help@wormbase.org
 
       |
 
       |
  2) Repeat - Finishing
+
  2') Repeat - Finishing
 
       |
 
       |
  3) Repeat - Analysis
+
  3') Repeat - Analysis
 
       |
 
       |
 
       x n (This means that it can go through several rounds before it is published)
 
       x n (This means that it can go through several rounds before it is published)
Line 41: Line 34:
 
     Confirm it meets the Wormbase standards for integration, add to Wormbase.
 
     Confirm it meets the Wormbase standards for integration, add to Wormbase.
  
=== Submission to ENA/GenBank ===
+
== Submission to INSDC ==
 
 
Normally we expect the genome to be submitted to the public nucleotide databases.
 
  
We can then easily extract the data from the public databases and the assembly will be regarded as reasonably stable by the authors.
+
Our minimal requirement is that the genome assembly (bottom-level genomic sequences, and higher-level structure) should have been deposited with the one of the collaborating partners of the International Nucleotide Sequence Database Collaboration (www.insdc.org). We can then easily extract the data from the public archives and have confidence that assembly ise regarded as reasonably stable by the authors.
  
=== Gene model prediction ===
+
== Gene models ==
  
 
We prefer you to do some prediction of the gene models in your genome.
 
We prefer you to do some prediction of the gene models in your genome.
Line 55: Line 46:
 
Gene models should translate into orthodox proteins (no stop codons or partial codons)
 
Gene models should translate into orthodox proteins (no stop codons or partial codons)
  
=== Minimum N50 ===
+
== Minimum N50 ==
  
 
We do not require a minimum N50. If you are happy for your data to be publicly available from the ENA/GenBank database, then we will consider including it in WormBase, but we reserve the right to reject genomes that are too fragmented for us to be able to load into our data structures.
 
We do not require a minimum N50. If you are happy for your data to be publicly available from the ENA/GenBank database, then we will consider including it in WormBase, but we reserve the right to reject genomes that are too fragmented for us to be able to load into our data structures.
 +
 +
== Meta data ==
 +
 +
When writing to help@wormbase.org to tell us about the new genome assembly that you would like included in WormBase, it would be helpful to include the following information:
 +
 +
=== Handle to assembly in public reposiroties ===
 +
 +
This could be one (or both) of:
 +
 +
* An INSDC accession number;
 +
* A direct link to the data under [ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/invertebrates the NCBI genome FTP portal]
  
 
=== Assembly statistics ===
 
=== Assembly statistics ===
  
Please tell us the stage which your assembly project has reached, one of:
+
As much information as you have available, but minimally the stage which your assembly project has reached - one of:
  
 
* Standard Draft
 
* Standard Draft
Line 73: Line 75:
  
 
=== Provenance ===
 
=== Provenance ===
 
Please give us the following information:
 
  
 
* Species name
 
* Species name
 
 
* Strain ID (preferrably the CGC strain ID)
 
* Strain ID (preferrably the CGC strain ID)
 
 
* Origin of the Strain
 
* Origin of the Strain
 
+
* Some references describing the species
* Some references describing the species so we can make a wiki page about it
 
  
 
=== Attribution ===
 
=== Attribution ===
 
Please give us the following information:
 
  
 
* Primary Data Contact:  
 
* Primary Data Contact:  
 
 
* Primary Data Contact Email:
 
* Primary Data Contact Email:
 
 
* Project URL:
 
* Project URL:
 
 
* FTP Site:  
 
* FTP Site:  
 
 
* Citation:
 
* Citation:
  
== Submitted Files ==
+
== Submitted of annotation ==
 
 
''If the data cannot be submitted to ENA/GenBank, the data should be provided to Wormbase in standardized formats using the following conventions''
 
 
 
'''species''' refers to the species name for example '''c_elegans'''.
 
 
 
* Genomic Sequence
 
 
 
File format : FASTA
 
File name  : species.genome.fa
 
 
 
* Conceptual transcripts (unspliced) - the gene models
 
 
 
File format : GFF2 or GFF3
 
File name  : species.gff2 or species.gff3
 
 
 
 
 
* Genomic Features (optional but preferred)
 
 
 
File format : GFF2 or GFF3
 
File name  : species.gff2 or species.gff3 (these should be in the same file as the gene models)
 
 
 
* AGP file (optional but preferred)
 
 
 
File format : AGP
 
File name  : species.agp
 
 
 
* Contigs file (optional but preferred)
 
  
File format : FASTA
+
If the gene models and other annotation have been deposited with an INSDC partner along with the genome sequence, there will be no need to submit any additional files direct to WormBase. However, we do appreciate that it is sometimes not possible to deposit the annotation in this way (either for technical reasons, or due to the requirement for embargo until publication).  
File name : species.contigs.fa
 
  
 +
We are therefore happy to accept direct submissions of annotation in [http://www.sanger.ac.uk/resources/software/gff GFF2] or [http://www.sequenceontology.org/gff3.shtml GFF3].
  
 +
TODO: Specify precise GFF2/GFF3 required (structure, source/type fields etc) for automatic flow into WormBase. Filenames less important.
  
 
[[Category:User Guide]]
 
[[Category:User Guide]]
 
[[Category:Curation]]
 
[[Category:Curation]]

Revision as of 10:01, 19 May 2011


Overview

This is the list of criteria that must be satisfied by a new genome assembly/project in order for us to accept it for integration into WormBase

We are happy to accept data pre-publication so that we can prepare it ready for adding to the database when you publish.

Requirements

Keep us informed of your progress

Contact us at help@wormbase.org early on in your project. We can then advise you about details and be prepared for your data when you publish it.

We like to be informed of upcoming genome submissions as early as possible, for example a typical lab that has contributed data is the Sanger Helminth group

They run a pipeline whereby there are 4 phases, we would normally integrate the data after their fourth phase

1) Production (X months) - No Interest for Wormbase
      |
2) Finishing  (3 Months) - No Interest for Wormbase
      |
3) Analysis   (3 months) - Wormbase are interested from this stage on - contact us at help@wormbase.org
      |
2') Repeat - Finishing
      |
3') Repeat - Analysis
      |
      x n (This means that it can go through several rounds before it is published)
      |
4) Publish and submit to ENA/GenBank when the genome/gene set meets their standards
      |
   Confirm it meets the Wormbase standards for integration, add to Wormbase.

Submission to INSDC

Our minimal requirement is that the genome assembly (bottom-level genomic sequences, and higher-level structure) should have been deposited with the one of the collaborating partners of the International Nucleotide Sequence Database Collaboration (www.insdc.org). We can then easily extract the data from the public archives and have confidence that assembly ise regarded as reasonably stable by the authors.

Gene models

We prefer you to do some prediction of the gene models in your genome.

All gene models should be at least 3 amino acids in length - mainly for Blast analysis as this is the minumum word size.

Gene models should translate into orthodox proteins (no stop codons or partial codons)

Minimum N50

We do not require a minimum N50. If you are happy for your data to be publicly available from the ENA/GenBank database, then we will consider including it in WormBase, but we reserve the right to reject genomes that are too fragmented for us to be able to load into our data structures.

Meta data

When writing to help@wormbase.org to tell us about the new genome assembly that you would like included in WormBase, it would be helpful to include the following information:

Handle to assembly in public reposiroties

This could be one (or both) of:

Assembly statistics

As much information as you have available, but minimally the stage which your assembly project has reached - one of:

  • Standard Draft
  • High-Quality Draft
  • Improved High-Quality Draft
  • Annotation-Directed Improvement
  • Noncontiguous Finished
  • Finished

These stages are defined in the paper "Genome Project Standards in a New Era of Sequencing"

Provenance

  • Species name
  • Strain ID (preferrably the CGC strain ID)
  • Origin of the Strain
  • Some references describing the species

Attribution

  • Primary Data Contact:
  • Primary Data Contact Email:
  • Project URL:
  • FTP Site:
  • Citation:

Submitted of annotation

If the gene models and other annotation have been deposited with an INSDC partner along with the genome sequence, there will be no need to submit any additional files direct to WormBase. However, we do appreciate that it is sometimes not possible to deposit the annotation in this way (either for technical reasons, or due to the requirement for embargo until publication).

We are therefore happy to accept direct submissions of annotation in GFF2 or GFF3.

TODO: Specify precise GFF2/GFF3 required (structure, source/type fields etc) for automatic flow into WormBase. Filenames less important.