NGASP
This is the website for nGASP, the nematode genome annotation assessment project. It outlines the methodology and resources to be used for evaluating various gene predictors on the C. elegans genome.
Contents
Introduction
Reflecting the growing importance of comparative genomics in biomedical research, The Washington University Genome Sequencing Center is currently sequencing the genomes of the Caenorhabditis nematodes C. remanei, C. japonica and C. brenneri (formerly sp. CB5161). This will bring the number of Caenorhabditis genomes to five when added to the existing C. elegans and C. briggsae genome sequences.
While the C. elegans genome has been extensively annotated, comparatively little analyses have been performed on the remaining nematode genomes. Therefore, the WormBase consortium is launching a community-wide assessment of gene prediction software. The main goal of this nematode genome annotation assessment project (nGASP) is to objectively assess the accuracy of the current state of the art for protein-encoding gene prediction algorithms in C. elegans, and to apply this knowledge to the annotation of the other Caenorhabditis genomes.
The nGASP project parallels recent computational prediction initiatives including CASP, GASP, and EGASP. Participation is open to all academic, private sector, and government researchers and results will be made immediately available through a public database. A summary of the results will be submitted for peer-reviewed publication.
For nGASP, a set of regions representing ~10% (10 Mb) of the C. elegans genome release WS160 has been selected to evaluate the performance of the participating gene predictors. We have selected two sets of regions: the training set (10 Mb) is to be used in training your software (if needed) and includes curated gene models from WormBase. We will use the test set (also 10 Mb) for evaluating your gene prediction software.
Gene finders that have been previously optimized using a large fraction of C. elegans confirmed genes or other data outside the supplied training data sets must be retrained solely on the training set provided by the nGASP project.
We ask that you do not consult WormBase, GenBank or other databases for the curated gene models in the test regions.
Sensitivity and specificity assessment will be based on metrics utilized in GASP and EGASP. Gene predictions will be compared against two sets of high quality gene models: 1) A 'sensitivity/accuracy' set will comprise genes supported by full-length cDNAs, while 2) a 'full set' will contain all curated genes from the selected genomic regions. This competition differs from the Drosophila GASP and human EGASP since correct gene structures for C. elegans are already publicly available, but should not, however, be used in making predictions.
nGASP will be held in two phases. The first phase of the competition is open to gene prediction programs that operate on genomic sequence (i.e. ab initio gene predictors), and to those that operate on combinations of protein and/or nucleotide alignments, including genome to genome alignments. After the first phase of the competition is complete, we will post the output of each of the predictors to the nGASP web site and begin phase two of the competition, which will be open to gene combiners. We will announce the deadline for the combiner phase of the project after all entries for phase I are submitted and have had formatting validated.
Recognizing that computation algorithms for gene finding are increasingly sophisticated and often require optimization, the nGASP project requests that all phase I gene predictions be submitted by December 31, 2006. nGASP participants are required to include all parameter files and command line options used to generate each submitted gene set prediction. Assessment of phase I predictions will be made by January 31, 2007 and released following the complete analysis of phase II predictions. All phase II (combiners) gene predictions must be submitted by March 31, 2007.
Methods
We will evaluate various gene predictors on a 10% sampling (10 Mb) of the C. elegans genome, release WS160.
Test & training data sets and nGASP prediction categories
Test and training sets will each be comprised of 10 representative 1 Mb regions, roughly two per chromosome. The test set consists of 10 1-Mb windows. The training set also consists of 10 1-Mb windows, chosen with the same criteria as the test set. None of the test and training set windows overlap. There is a Training Region Browser for browsing the training set regions. The test and training regions were chosen following the ENCODE methodology. There is a webpage describing the methods used to choose the nGASP test and training regions here: Test and training regions.
To better compare different gene prediction methods, nGASP defines four prediction categories (below). If you want to submit a gene prediction set for the test set regions for a particular category of gene-finder, you must only use the allowable training and test data for that category of gene-finder in making your gene set, as described in the following section.
NOTE: Gene finders that have been previously optimized using a large fraction of C. elegans confirmed genes or other data outside the supplied training data sets must be retrained solely on the training set provided by the nGASP project.
- Category 1 = ab initio gene-finders. Allowed input data for training and test set regions:
- the C. elegans WS160 genome sequence for the training set regions
- the C. elegans WS160 genome sequence for the test set regions
- Repeat sequences in the test set regions and training set regions
- Confirmed Genes (GFF3 format) in the training set regions. Confirmed Genes are those genes that have at least one transcript that has been confirmed end-to-end. Please filter this file on the "Status=Confirmed" to find individual mRNA splice forms that have been fully confirmed. We ask that you do not consult this file, WormBase, GenBank or other databases for the curated gene models in the test regions.
- Unconfirmed Genes (GFF3 format) in the training set regions. Unconfirmed Genes are predicted and/or partially-confirmed genes. We ask that you do not consult this file, WormBase, GenBank or other databases for the curated gene models in the test regions.
- Category 2 = dual/multi-genome gene-finders. Allowed input data for test set regions:
- the C. elegans WS160 genome sequence for the training set regions
- the C. elegans WS160 genome sequence for the test set regions
- the cb1 C. briggsae genome assembly
- the pcap2 C. remanei genome assembly
- you are free to use the MLAGAN multi-genome alignment for the training-set regions (CLUSTAL format), (CLUSTAL format with orientation) and test-set regions (CLUSTAL format), (CLUSTAL format with orientation) that we provide, or to use another program to make a multi-genome alignment. You must state what multi-genome alignment method you used, when you submit your gene prediction set.
- Repeat sequences in the test set regions and training set regions
- Confirmed Genes (GFF3 format) in the training set regions. Confirmed Genes are those genes that have at least one transcript that has been confirmed end-to-end. Please filter this file on the "Status=Confirmed" to find individual mRNA splice forms that have been fully confirmed. We ask that you do not consult this file, WormBase, GenBank or other databases for the curated gene models in the test regions.
- Unconfirmed Genes (GFF3 format) in the training set regions. Unconfirmed Genes are predicted and/or partially-confirmed genes. We ask that you do not consult this file, WormBase, GenBank or other databases for the curated gene models in the test regions.
- Category 3 = gene-finders that use alignments of proteins, ESTs, or mRNAs. Allowed input data:
- the C. elegans WS160 genome sequence for the training set regions
- the C. elegans WS160 genome sequence for the test set regions
- the files of proteins with matches (FASTA format) and of ESTs/cDNAs with matches (FASTA format) that we provide
- you are free to use EST/cDNA-to-genome alignments (GFF3 format) and protein-to-genome alignments (GFF3 format) that we provide, or to use another program to align these ESTs, mRNAs and proteins to the C. elegans test set regions. You must state what alignment method you used, when you submit your gene prediction set. Note that the files we provide contain ESTs and full-length cDNAs aligned to the genome using the BLAT program.
- Repeat sequences in the test set regions and training set regions
- Confirmed Genes (GFF3 format) in the training set regions. Confirmed Genes are those genes that have at least one transcript that has been confirmed end-to-end. Please filter this file on the "Status=Confirmed" to find individual mRNA splice forms that have been fully confirmed. We ask that you do not consult this file, WormBase, GenBank or other databases for the curated gene models in the test regions.
- Unconfirmed Genes (GFF3 format) in the training set regions. Unconfirmed Genes are predicted and/or partially-confirmed genes. We ask that you do not consult this file, WormBase, GenBank or other databases for the curated gene models in the test regions.
- Category 4 = combiners, which use all available evidence. Allowed input data:
- the C. elegans WS160 genome sequence for the training set regions
- the C. elegans WS160 genome sequence for the test set regions
- all training data allowed for Categories 1, 2, and 3
- the gene predictions submitted in Categories 1, 2, and 3 during Phase I of nGASP [[NGASP#phaseI_list|*]
- confirmed genes in the 5' half of the test regions (Upon gene submission, Phase 2 participants must explicitly state if these test region confirmed genes were used for training)
Several of the files we have provided are in GFF3 format. Information on the GFF3 format can be found at the Sequence Ontology web site.
Evaluating Accuracy
To evaluate the accuracy, sensitivity, and specificity of the submitted gene predictions, we will utilize GASP 2000 and/or ENCODE EGASP 2006 metrics.
We will develop two sets of high quality gene predictions to evaluate the gene predictors against.
- sensitivity/accuracy set - based on genes supported by full-length cDNAs (~100-200/10 Mb)
- full set - all curated genes from the selected genomic regions
A summary of the results will be submitted for peer-reviewed publication.
Target Dates
Dec 31, 2006 - phase I gene predictions submitted
Jan 31, 2007 - evaluation of phase I gene predictions completed, submission deadline of phase II announced
March 31, 2007 - phase II gene predictions submitted
TBA - preliminary results available (dependent on testing and crosschecking)
nGASP rules
To make nGASP a fair competition, participants must adhere to the following rules. If you have any questions, please contact Tristan Fiedler (see contact information).
Conditions for Participation:
- You agree that the submitted predictions will be evaluated by the nGASP team as specified on the wiki site.
- You agree that the nGASP team may publish the results of these evaluations and your prediction sets both in a journal and on the web.
- You certify that your predictions are created using only the sequence and training data specified for your category of gene prediction algorithm. *
- Prediction sets may be updated before the deadline stated below, but after this deadline, the predictions may not be updated for any reason.
- Predictions not submitted in validated GFF3, will not be evaluated.
- nGASP participants are required to specify which of the four nGASP categories of gene-finders they have used
- If you want to submit a gene prediction set for the test set regions for a particular category of gene-finder, you must only use the allowable training and test data for that category of gene-finder in making your gene set, as described above.
- Gene finders that have been previously optimized using a large fraction of C. elegans confirmed genes or other data outside the supplied training data sets must be retrained solely on the training set provided by the nGASP project.
- For very standard parameters such as poly-A site models, non-worm data may be used. Any parameters not derived from the provided training data must be explicitly stated (with training code and data sets used, as necessary). Participants must justify why these parameters could not be created with the training data provided. Published weight matrices specific for worm genomes may not be used.
- We ask that you do not consult WormBase, GenBank or other databases for the curated gene models in the test regions.
- nGASP participants are required to include all parameter files and command line options used to generate each submitted gene set prediction. This information will be released with the gene predictions at the conclusion of the project.
- All gene predictions must be submitted in GFF3 format.
Gene models may be submitted in either format below:
gene→mRNA→{CDS,*_prime_UTR} mRNA→{CDS,*_prime_UTR}
Introns and exons need not be annotated since they can be derived from CDS and *_prime_UTR data.
- The nGASP project requests that all phase I gene predictions be submitted by December 31, 2006.
Submitting a Gene Prediction Set to nGASP
nGASP will be held in two phases. The first phase of the competition is open to gene prediction programs that operate on genomic sequence (i.e. ab initio gene predictors), and to those that operate on combinations of protein and/or nucleotide alignments, including genome to genome alignments. After the first phase of the competition is complete, we will post the output of each of the predictors to the nGASP web site and begin phase two of the competition, which will be open to gene combiners. We will announce the deadline for the combiner phase of the project after all entries for phase I are submitted and have had formatting validated.
The nGASP project requests that all phase I gene predictions be submitted by December 31, 2006 and phase II predictions by March 31, 2007. nGASP participants are required to include all parameter files and command line options used to generate each submitted gene set prediction. All gene predictions must be submitted in GFF3 format. Please note the gene prediction [[NGASP_submit|file upload procedure].
GFF3 files should use the genome coordinates rather than the testing/training coordinates as shown below.
Correct
I Coding_transcript gene 2064302 2064690 . + . Alias=Y37E3.3;ID=Gene:WBGene00021347;Name=WBGene00021347
Incorrect
I:2000001..3000000 Coding_transcript gene 64302 64690 . + . Alias=Y37E3.3;ID=Gene:WBGene00021347;Name=WBGene00021347
Thank you for considering participation in nGASP. Through the participation of software development groups, projects such as this will continue to improve the annotation of forthcoming genome sequences. To enter the competition, please indicate your interest by sending your name and the name of your software package(s) to Tristan Fiedler.
Contact Information
Questions and comments may be sent to <ngasp-help@wormbase.org>. To receive updates concerning the nGASP project, please send an e-mail to <majordomo@wormbase.org> with 'subscribe ngasp-announce' in the body of the message.
Resources and Discussion
Here are some details of Resources created for nGASP, as well as who made them.
Here is some additional community feedback and Discussion for nGASP.
Here is a list of some gene finders. This is only intended as a preliminary and incomplete list of gene-finders, and we hope very much that many other authors of gene-finders will participate in nGASP. Participation is open to all academic, private sector, and government researchers. We will be happy to add the name of your gene-finder to this list, when you contact us.
Public Announcement
- The nGASP announcement may be distributed without restriction to researchers who may be interested in participating.
- [http://www.wormbase.org/mailarch/ngasp-announce/ List archives for ngasp-announce[ (username: wormbase, password: wormbase)
- Contact methods: direct email, wormbase-analysis listserv, WormBase & WormBook websites, MOD websites & listservs (e.g. eGASP - encode_egpw05@list.nih.gov)
Bioinformatics - Proposed conference paper/Letter to Editor
Genome Research - Insight/Outlook article
NAR - Survey/Summary paper
PLoS One - for community input!
Genome Biology - minireview/open letter
Gene Predictions
Submissions
Gene predictions for the following applications were received by December 31, 2006:
- AGENE
- CRAIG
- EUGENE
- FGENESH
- FGENESH++
- G3A (renamed to mGene)
- GENEMARKHMM
- SNAP
- AUGUSTUS
- ENSEMBL
- EXONHUNTER
- GENEID
- GLIMMERHMM
- MAKER
- NSCAN
- SGP2
Data
- Gene predictions and supporting data are available via the wormbase ftp site.
- nGASP training and test regions, including phase 1 gene predictions, can be browsed at http://dev.wormbase.org/ngasp
Results
Gene prediction evaluations are now available!
These are the nucleotide-level and exon-level results for NGASP. Transcript-level and gene-level results will be announced soon. The nucleotide-level and exon-level results were calculated by using a benchmark consisting of all confirmed and non-confirmed isoforms in all genes in the test set regions.