Difference between revisions of "Genome Standards"
|Line 56:||Line 56:|
=== Assembly statisitics ===
=== Assembly statisitics ===
Revision as of 10:29, 18 May 2011
WormBase Genome Integration Standards
This is the list of criteria that a genome assembly/project should attain before WormBase can agree to integrate the Organism into the database.
Submission of genomes to the Wormbase database is a common requirement. This document is a guide for this who wish their nematode genomes to be included in Wormbase.
Keep us informed of your progress
Contact us at email@example.com early on in your project. We can then advise you about details and be prepared for your data when you publish it.
We like to be informed of upcoming genome submissions as early as possible, for example a typical lab that has contributed data is the Sanger Helminth group
They run a pipeline whereby there are 4 phases, we would normally integrate the data after their fourth phase
1) Production (X months) - No Interest for Wormbase | 2) Finishing (3 Months) - No Interest for Wormbase | 3) Analysis (3 months) - Wormbase are interested from this stage on - contact us at firstname.lastname@example.org | 2) Repeat - Finishing | 3) Repeat - Analysis | x n (This means that it can go through several rounds before it is published) | 4) Publish and submit to ENA/GenBank when the genome/gene set meets their standards | Confirm it meets the Wormbase standards for integration, add to Wormbase.
Submission to ENA/GenBank
Normally we expect the genome to be submitted to the public nucleotide databases.
We can then easily extract the data from the public databases and the assembly will be regarded as reasonably stable by the authors.
Gene model prediction
We prefer you to do some prediction of the gene models in your genome.
All gene models should be at least 3 amino acids in length - mainly for Blast analysis as this is the minumum word size.
If prediction of gene models is not possible, we would like to have your estimate of the number of genes in the genome.
To get an estimated of how many genes could be expected in an assembly, take the average gene length and see haw many genes can be found based on the available contigs.
Improved High-Quality Draft:
Primary Data Contact:
Primary Data Contact Email:
- Submission to a public Nucleotide Repository
- Wiki description of the Species, submitted by data producer if possible.
- N50 of ?
If the data cannot be submitted to ENA/GenBank, the data should be provided to Wormbase in standardized formats using the following conventions
g_species refers to the Genus name then the species name joined by a '_' character, for example c_elegans.
- Genomic Sequence
File format : FASTA File name : g_species.genome.fa
- Conceptual transcripts (spliced)
File format : FASTA File name : g_species.mrna.fa
- Conceptual transcripts (unspliced) - the gene models
File format : GFF2 or GFF3 File name : g_species.gff2 or g_species.gff3
- Conceptual translations
File format : FASTA File name : g_species.pep
- Genomic Features
File format : GFF2 or GFF3 File name : g_species.gff2 or g_species.gff3 (these should be in the same file as the gene models)
The Helminth Co-ordinator discusses some requirements for submitting genomes in the paper "Genome Project Standards in a New Era of Sequencing"