Difference between revisions of "Genome Standards"

From WormBaseWiki
Jump to navigationJump to search
Line 38: Line 38:
 
To get an estimated of how many genes could be expected in an assembly, you could take the average gene length and see how many can be at maximum being predicted based on the available contigs (if you can come up with the intergenic percentage you can include that). Take this in correlation to a wild guess on the gene number (like 20k) and you got a % completeness.
 
To get an estimated of how many genes could be expected in an assembly, you could take the average gene length and see how many can be at maximum being predicted based on the available contigs (if you can come up with the intergenic percentage you can include that). Take this in correlation to a wild guess on the gene number (like 20k) and you got a % completeness.
  
 +
<pre>
 
Example (Brugia):
 
Example (Brugia):
  
Line 49: Line 50:
 
168 contigs with an expected maximum of four genes
 
168 contigs with an expected maximum of four genes
 
195 contigs with an expected maximum of four genes
 
195 contigs with an expected maximum of four genes
....
+
...
 +
</pre>
  
 
basically assuming no intergenic sequence, you could expect a maximum of 32924 genes in that assembly ... and we predicted 18348.
 
basically assuming no intergenic sequence, you could expect a maximum of 32924 genes in that assembly ... and we predicted 18348.
Line 56: Line 58:
  
 
<center>
 
<center>
 +
 
== WormBase Standards Document==
 
== WormBase Standards Document==
 
</center>
 
</center>

Revision as of 11:56, 13 July 2010

WormBase Genome Integration Standards

This is a stub so feel free to add content.

Overview:

At the last ABM we were asked by the board to come up with a list of criteria that a genome assembly/project should attain before WormBase would agree to integrate the Organism into the database.

This topic has re-surfaced as SangerWB have been in contact with the Sanger Helminth group regarding future integration of their data.

They run a pipeline whereby there are 4 phases

1) Production (X months) - No Interest for WB 
      |
2) Finishing  (3 Months) - No Interest for WB
      |
3) Analysis   (3 months) - Depending on our criteria we could be interested here
      |
2) Repeat - Finishing
      |
3) Repeat - Analysis
      |
      x n
      |
4) Publish when the genome/gene set meets theIR standards 

The Helminth Co-ordinator is going to send some information regarding the standards they will be working to but these will be based on the paper "Genome Project Standards in a New Era of Sequencing" so this might be a good starting point for WormBase to base our list of criteria.....(It's not very detailed :( )


Discussion

expected number of genes

To get an estimated of how many genes could be expected in an assembly, you could take the average gene length and see how many can be at maximum being predicted based on the available contigs (if you can come up with the intergenic percentage you can include that). Take this in correlation to a wild guess on the gene number (like 20k) and you got a % completeness.

Example (Brugia):

average predicted gene length is 2126bp

27210 contigs total
24074 contigs smaller than one gene
1424 contigs with an expected maximum of one gene
333 contigs with an expected maximum of two genes
98 contigs with an expected maximum of three genes
168 contigs with an expected maximum of four genes
195 contigs with an expected maximum of four genes
...

basically assuming no intergenic sequence, you could expect a maximum of 32924 genes in that assembly ... and we predicted 18348.


WormBase Standards Document

Assembly stats

Standard Draft:

High-Quality Draft:

Improved High-Quality Draft:

Annotation-Directed Improvement:

Noncontiguous Finished:

Finished:


Gene Models