Difference between revisions of "Gene Ontology"

From WormBaseWiki
Jump to navigationJump to search
Line 258: Line 258:
 
====Progress Report 2011====
 
====Progress Report 2011====
  
 +
='''WormBase Gene Ontology Progress Report, December 2010'''=
 +
 +
=='''Staff'''==
 +
'''Juancarlos Chan'''
 +
 +
Developer, WormBase, Caltech, Pasadena, CA.
 +
 +
'''Ranjana Kishore'''
 +
 +
Curator, WormBase, Caltech, Pasadena, CA.
 +
 +
'''Paul Sternberg'''
 +
 +
PI, WormBase, Caltech, Pasadena, CA.
 +
 +
'''Kimberly Van Auken'''
 +
 +
Curator, WormBase, Caltech, Pasadena, CA.
 +
 +
 +
'''Jolene Fernandes'''
 +
 +
Phenotype Curation, WormBase, Caltech, Pasadena, CA.
 +
 +
'''Gary Schindelman'''
 +
 +
Phenotype Curation, WormBase, Caltech, Pasadena, CA.
 +
 +
 +
'''Ruihua Fang'''
 +
 +
Bioinformatician, Developer, WormBase, Caltech, Pasadena, CA
 +
 +
 +
'''Additional technical support: '''
 +
 +
'''Anthony Rogers'''
 +
 +
WormBase, Sanger Center, Hinxton, UK
 +
 +
'''Gary Williams'''
 +
 +
WormBase, Sanger Center, Hinxton, UK
 +
 +
 +
'''Textpresso:'''
 +
 +
'''Hans Michael Muller'''
 +
 +
Project Leader, Textpresso, Caltech, Pasadena, CA
 +
 +
'''Arun Rangarajan'''
 +
 +
Developer, Textpresso, Caltech, Pasadena, CA
 +
 +
 +
 +
=='''Annotation Progress'''==
 +
'''Table 1: Number of Genes Annotated'''
 +
 +
{| class="wikitable" style="text-align:center"
 +
{| border="1" cellpading="2"
 +
|-
 +
! Type of Annotation !! Number of Genes Annotated, Dec 2010 !! % Change from Dec 2009!! Number of Unique GO Terms !! Total Number of GO Terms
 +
|-
 +
! Manual Annotation
 +
| 2,098 || +19.8 ||1,840||10,467
 +
|-
 +
!Phenotype2GO Mappings
 +
| 6309 || -6.1 || 113 || 42,349
 +
|-
 +
!IEA/Electronic
 +
|12,954 || +0.83 || 1,476 || 55,091
 +
|-
 +
!Total
 +
|15,799 || +0.84 || 2,937 || 107,907
 +
|}
 +
 +
 +
 +
=='''Methods and Strategies for Annotation'''==
 +
 +
'''
 +
===Literature Curation===
 +
'''
 +
 +
Manual curation of the C. elegans literature remains our highest curation priority, contributing to ~90% of our total curation efforts. Curators use a GO curation check-out form that affords curators easy visual access to the curation status of all named C. elegans gene, e.g. vha-6 or egl-9. Genes are displayed in a list that includes the current number of published papers (references) indexed to that gene and the last date for which annotations to any of the three ontologies were made. Curators can query and sort the list according to reference count, gene name, and curation status.
 +
 +
'''
 +
 +
=== Computational Methods ===
 +
'''
 +
 +
InterPro2GO mappings for IEA annotations: These annotations are annotations of C. elegans proteins to GO terms based on electronic matching of protein motifs/domains to those documented in the InterPro database (http://www.ebi.ac.uk/interpro/), and their mapping to GO terms provided by the InterPro2GO file generated by the EBI (PMID:12654719, PMID:12520011). Note that the 'IEA' annotations are not reviewed for accuracy by human curators. As such, all of these annotations use the evidence code 'IEA'.
 +
 +
TMHMM2GO mapping for IEA annotations: We also include the results of an internal pipeline that maps proteins containing a transmembrane domain, as predicted by the TMHMM algorithm, to the GO Cellular Component term, integral to membrane. About 6,710 gene products are annotated to the term 'integral to membrane' via this pipeline.
 +
 +
InterPro2GO and TMHMM2GO annotations are updated at every database release.
 +
 +
'''
 +
 +
=== Semi-automated Methods ===
 +
 +
'''Review and improvement of the Phenotype2GO data pipeline'''
 +
 +
WormBase uses a well defined phenotype ontology to annotate gene-allele and RNAi-based phenotypes. A total of 201 phenotype terms used in annotation have been mapped to a GO term. These mappings are used to automatically generate Biological Process annotations to genes using the IMP evidence code, at every WormBase database build . The complete list of WormBase phenotype to GO term mappings can be found here: http://www.wormbase.org/wiki/index.php/Gene_Ontology.
 +
We have begun a detailed review of our phenotype to GO term mappings. We are in the process of making changes to this pipeline so that annotations are made with a stricter use of the IMP evidence code, as recently described in GO consortium annotation policies. This process will involve removing some high-level phenotype term to GO term mappings and/or removal of certain RNAi experiments/papers from being included in this pipeline and the review and changing of scripts.
 +
 +
'''Textpresso-Based Cellular Component Curation'''
 +
 +
As a complimentary approach to our manual curation pipeline, we continue to employ the Textpresso information retrieval system to annotate C. elegans gene products to the Cellular Component ontology.  Textpresso searches for component annotations use three different categories (Assay Terms, Component Terms, and Verbs) plus a category of C. elegans protein names.  Searches using these categories return sentences that contain a match to at least one term in each category.  We use this approach to annotate newly published papers as well as papers published prior to 2010.  For newly published papers, we prioritize our searches by first searching through papers that, as determined by a Support Vector Machine document classifier, have a relatively high probability of containing expression data. 
 +
 +
Over the past year, through our manual and Textpresso-based pipelines, we added new cellular component annotations to 292 genes.  Of these, 113 genes were annotated from papers published in 2010, with the remainder of the annotations coming from previously published papers.  For the 2010 papers, Textpresso’s annotation recall was 91.3%.  We have not yet measured the recall on papers annotated this year but published prior to 2010.
 +
 +
==='''Priorities for Annotation'''===
 +
 +
Our annotation priorities are as follows:
 +
 +
1) Reference Genome genes
 +
 +
2) Genes presented for annotation via our Textpresso-based semi-automated Cellular Component curation pipeline
 +
 +
3) Genes from training set papers used for piloting semi-automated Textpresso-based Molecular Function curation
 +
 +
4) Newly described genes for which previous annotation was not available
 +
 +
5) C. elegans orthologs of human disease genes
 +
 +
=='''Presentations and Publications'''==
 +
 +
[[Publications, Talks, Posters 2010-]]
 +
 +
=='''Ontology Development Contributions'''==
 +
 +
WormBase curators have contributed to ontology discussion and development in the areas of: <br>
 +
Biology of the cilium:
 +
*updates/revisions to terms added in 2005
 +
 +
Biology of the phagosome-lysosome during apoptotic cell clearance, terms added:
 +
*phagosome maturation involved in apoptotic cell clearance
 +
*phagosome acidification involved in apoptotic cell clearance
 +
*phagolysosome assembly involved in apoptotic cell clearance
 +
*phagosome-lysosome docking involved in apoptotic cell clearance
 +
*phagosome-lysosome fusion involved in apoptotic cell clearance
 +
 +
Biology of muscle, terms added:
 +
*striated muscle contraction involved in embryonic body morphogenesis
 +
*striated muscle myosin thick filament assembly
 +
*striated muscle paramyosin thick filament assembly (2010)
 +
*alpha-tubulin acetylation
 +
 +
Biology of nematode larval development, terms added:
 +
*regulation (includes positive and negative regulation child terms) of nematode larval development
 +
*regulation of (includes positive and negative regulation terms) dauer larval development
 +
 +
Other terms added were:
 +
*neuropeptide receptor binding
 +
*determination of left/right asymmetry in the nervous system
 +
*regulation of locomotion (including positive and negative regulation child terms) involved in locomotory behavior
 +
*detoxification of arsenic
 +
*chondroitin sulfate proteoglycan binding
 +
*chondroitin sulfate binding
 +
*octapamine/tyramine signaling involved in the response to food (and the regulation terms)
 +
 +
=='''Annotation Outreach and User Advocacy Efforts'''==
 +
 +
Kimberly Van Auken continues to participate in the go-help rotation. Ranjana Kishore continues to participate in the efforts of the GO News group.
 +
 +
 +
=='''Other Highlights'''==
 +
 +
===Curation Tools: Ontology Annotator===
 +
 +
We have continued development on our web-based curation tool, the Ontology Annotator, that can be used to annotate genes to any ontology, including the Gene Ontology and the WormBase Phenotype Ontology. The Ontology Annotator incorporates and expands upon much of the functionality of the Phenote curation tool. Some of the more useful features of the tool include bulk annotation capabilities, autocomplete functions, retrieving data and filtering of the retrieved data for editing purposes. We have improved functionalities for existing annotation interfaces and have added the following curation interfaces that are fully functional: antibody, small molecule and gene regulation.  We are currently working on 2 new curation interfaces: gene regulation and expression pattern related pictures.
 +
 +
===Textpresso- and HMM-Based Molecular Function Curation===
 +
 +
In addition to using Textpresso for Cellular Component Curation, we have developed pipelines for semi-automated Molecular Function (MF) curation.  Given that Molecular Function annotations can be made based upon evidence from a wide variety of experiments, our approach has been to divide molecular function curation into two broad categories: '''1) macromolecular interactions''' and '''2) enzymatic and transporter activities'''.
 +
 +
'''Macromolecular Interactions'''
 +
 +
For the former, we employ a two-step curation pipeline.  First, we use an SVM-based document classification algorithm to identify new papers likely to contain reports of macromolecular interactions.  These papers are then searched for matching sentences using Textpresso categories developed specifically for identifying sentences describing macromolecular interactions.  Our initial results indicate that the SVM performs very well on predicting high confidence true positive papers (precision=86.4%) as well as true negative papers (precision=88.9%).  The Textpresso categories also have a high recall (89.7%), but have a relatively low precision (precision=47.3%).  In practice, this means that while curators are able to retrieve and curate nearly all macromolecular interactions, they still must examine a number of false positive sentences, so the curation efficiency needs to be improved.  We are in the process of fine tuning the Textpresso categories for macromolecular interactions in hopes of addressing this issue and improving curation efficiency.  The new Textpresso categories for this data are scheduled to be in place at the end of December 2010.
 +
 +
'''Enzymatic and Transporter Activities'''
 +
In parallel to these efforts, we are also investigating semi-automated curation methods for annotating enzymatic and transporter activities. For this data type, we are developing Textpresso categories as well as training a Hidden Markov Model (HMM) to identify curatable sentences.  For the former, we have collected 419 sentences from 64 papers.  Using these sentences, we have developed two new Textpresso categories for enzymatic and transporter activities.  As for the new macromolecular interaction categories, we plan to implement these categories on Textpresso by the end of December 2010.
 +
 +
In collaboration with Hans-Michael Mueller, we are also training a Hidden Markov Model to identify sentences describing enzymatic and transporter activities.  At present, we are in the third round of training and evaluation for the model.  We hope to complete an initial evaluation of the model by early next year and will report on its performance.
 +
 +
===Textpresso-Based identification of literature for human disease gene orthologs===
 +
Human disease gene orthologs are a high priority annotation list for WormBase and other model organism databases. Sequence-based or similarity searches for human disease gene orthologs provide useful candidates but the biological information needs to be extracted manually from the literature. We are working on a project that combines the use of Textpresso-based categories and key word searches to identify ''elegans'' papers which describe the study of a human disease gene ortholog.  Sentences in which a ''C. elegans'' gene co-occurs with a human disease term were deemed important for this process. A new 'human disease' Textpresso category was formed using the human disease ontology in the OBO foundry, the Neuroscience Information Framework Standardized (NIFSTD) ontology and the existing Textpresso human disease category. Several disease terms that would increase the number of false positives were removed iteratively by a manual process. As performing an 'AND' query on just two categories--human disease and ''C. elegans'' gene, returned too many false positives, a third category was formed with the words 'ortholog’, `homolog’, `similar’, `relate’ and `model’, and added to the 'AND' query. The query is being fine-tuned for increasing precision and recall. Once established, it will automate the flagging of new articles that have disease gene ortholog data. Subsequently this method can be used for the extraction of relevant information.
 +
 +
===Textpresso-Based Curation Pipelines for Other MODs===
 +
We have been collaborating with The Arabidopsis Information Resource (TAIR) and dictyBase to develop and implement Textpresso-based curation pipelines for Cellular Component annotation.  At present, we have been working with TAIR to develop a pipeline by which they can perform Textpresso searches on the Arabidopsis corpus from 2008 to identify potentially new annotations that were not previously captured in their existing GO curation pipeline.  The results of these searches are being presented in a curation form that will allow TAIR curators to make new annotations and retrieve the annotations in both a gene_association file format and a more basic, three-column format that is similar to their user submission format.  We hope to generalize this pipeline so that other groups, including dictyBase, will be able to perform Textpresso queries, send the results to a curation form, and retrieve the output of the curation as a gene_association file.
 +
 +
===Collaboration with BioGRID===
 +
In late summer, we begin a collaboration with the BioGRID (Biological General Repository for Interaction Datasets) to add protein binding annotations (IPI evidence code) curated for GO into BioGRID.  In addition, we also plan to add curated genetic interactions (IGI evidence code) to BioGRID.  Our initial work will focus on adding interactions curated as part of the Reference Genome’s Wnt signaling pathway annotation project.  Concurrently, we will also add any newly curated protein binding annotations (Wnt pathway or otherwise) to BioGRID.
  
  

Revision as of 16:13, 8 December 2011

Contents

Manual Literature Curation

Reference Genome (see also Reference Genome Inferential Annotations)

  1. Lung Development Targets (November 2009 - February 2010)

Transcription-related re-annotation

This summarizes the annotations that may need to be revised due to changes in the GO's representation of transcription.

7 Molecular Function terms will be obsoleted. They are listed below with the number of manual elegans annotations associated:

  • GO:0003704 specific RNA polymerase II transcription factor activity - 8 (all Kimberly)
 ceh-24 - ISS - changed to  GO:0000981 sequence-specific DNA binding RNA polymerase II transcription factor activity 
 ceh-27 - ISS - same as above
 ceh-28 - ISS - same as above
 elt-1 - IDA - same as above
 elt-3 - IMP - for WBPaper00004593, removed MF term (no longer comfortable with IMP MF terms from this type of experiment); 
   also made corresponding BP term less granular, from positive regulation to just gene-specific transcription from pol II promoter
 elt-3 - IMP - same as for elt-3 above
 hlh-3 - IMP - for WBPaper00031977, removed MF term for same reason as above, also made BP term less granular as above
 zip-2 - IMP - for WBPaper00035891, same as above for elt-3 and hlh-3

Semi-Automated Methods of Curation

Textpresso-Based Curation

GO Cellular Component Curation - MOD-Specific Pages
General specifications
dictyBase
FlyBase
TAIR (this is the older page no longer used)
TAIR_CCC
WormBase
GO Cellular Component Curation - General Issues
Processing Gene and Protein Names for Searches and Curation
Specifications for CCC Curation from Textpresso Search Page
  • MFC - GO Molecular Function Curation using Textpresso
mf_hmm tool
in vitro flagging

Phenotype2GO pipeline (Sanger and Caltech)

  • The old Sanger script that generates the gene_association file (from Igor's work in January 2009) was changed. Instead of an exclusion list and 'include list' that comprises papers (mostly large scale genome-wide studies) is provided to the script. This list is curator approved and explicitly agreed upon for the propagation of GO terms to genes based on their RNAi phenotypes.
  • A new script is used, to use it invoke the script with the -includelist option, e.g.: Run parse_go_terms_new.pl -o gene_association.wb -rnai -include includelist.txt (this example only parses RNAi experiments, to generate full file, you should also give '-gene -var' options as before).
  • If you invoke it with '-acefile <filename>' option, the script will also generate Gene-GO_term connections derived from phenotypes. This is currently done by the phenotype procedure of the inherit_GO_terms.pl script.
  • The old script: inherit_GO_terms.pl does not consult any exclusion/inclusion files. To alter Sanger's version of parse_go_terms_new.pl, a patch file was provided.
  • Current status:From Igor's e-mail, March 2009: I don't think the phenotype option of the inherit_go_terms script has been disabled. The script should be run without the '-variation' option, but the gene_association file still has those. Try this:

grep -i wbpheno gene_association.WS200.wb.ce |grep -v RNAi This is now resolved.

InterPro2GO Mappings for IEA Annotations
Reference Genome Inferential Annotations

Software Developement: Tools and Scripts

Reference Genome Reports - Annotation Coverage

Ontology Annotator - The GO annotation interface

Textpresso related forms

Taxon Constraints

From Chris Mungall, 8/19/2011:

The taxon checks are run weekly, and the reports deposited here:

   http://www.geneontology.org/quality_control/annotation_checks/taxon_checks/

Note that this service will be subsumed into a more comprehensive annotation QC service (apologies if you weren't at the USC meeting, where this was demoed). This is, in general, the plan for many of the ad-hoc scripts and cron reports we perform now. I will send an email to the GOC list next week describing the roll-out process for this.

For the QC checks, the idea is to push the checking as far upstream as possible. A weekly report is too reactive. This could be done at the time of submission. Even better, the annotation tool could use the central web service at the time of annotation.

WormBase contributions to Gene Ontology content

2011
  • basement membrane assembly involved in embryonic body morphogenesis
  • parentage of dauer larval development - also include dormancy process
  • regulation, positive, negative of dipeptide transport
  • regulation of phospholipid transport
  • regulation, positive, negative of endocytic recyling
  • suggested change to InterPro2GO mapping for GoLoco motif
  • GABAergic neuron differentiation
  • pre-mRNA binding
  • nitric oxide sensory activity
  • age-dependent behavioral decline
  • regulation, positive, negative of anterograde axon cargo transport and retrograde axon cargo transport
  • aggrephagy
  • germ cell proliferation
  • in progress - centrosome maturation - when, what, how (done - 4420) (2011)
  • defecation motor program (2011)
  • modifications to terms and definitions of cilium assembly and sensory cilium assembly (2011)
  • ciliary transition zone (2011)
  • regulation and pos/neg regulation of microtubule motor activity (2011)
  • nickel ion homeostasis and cellular nickel ion homeostasis (2011)
  • neurotransmitter receptor catabolic process (2011)
2010
  • regulation of defecation, positive and negative children (2010)
  • mitochondrial prohibitin complex (2010)
  • cilium terms (2010, updates/revisions to terms added in 2005)
  • octapamine/tyramine signaling involved in the response to food (and the regulation terms) (2010)
  • alpha-tubulin acetylation (2010)
  • phagosome maturation involved in apoptotic cell clearance (2010)
  • phagosome acidification involved in apoptotic cell clearance(2010)
  • phagolysosome assembly involved in apoptotic cell clearance (2010)
  • phagosome-lysosome docking involved in apoptotic cell clearance (2010
  • phagosome-lysosome fusion involved in apoptotic cell clearance (2010)
  • neuropeptide receptor binding (2010)
  • striated muscle contraction involved in embryonic body morphogenesis (2010)
  • striated muscle myosin thick filament assembly (2010)
  • striated muscle paramyosin thick filament assembly (2010)
  • determination of left/right asymmetry in the nervous system (2010)
  • regulation of locomotion (including positive and negative regulation child terms) involved in locomotory behavior (2010)
  • detoxification of arsenic (2010)
  • chondroitin sulfate proteoglycan binding (2010)
  • chondroitin sulfate binding (2010)
  • regulation (includes positive and negative regulation child terms) of nematode larval development (2010)
  • regulation of (includes positive and negative regulation terms) dauer larval development (2010)
2009
  • response to drug withdrawel (2009)
  • phosphatidylserine exposure on apoptotic cell surface (2009)
2008
  • regulation of synaptic vesicle priming (2008)
  • chloride-activated potassium channel activity (2008)
  • transdifferentiation (2008)
  • Regulation of ovulation terms (2008)
  • Process terms for gap junction proteins (2008)
  • piRNA and 21U-RNA terms (2008)
2007
  • dense body (sensu Nematoda) cellular component term (2007)
  • GO:0000775, GO:0000779, GO:0000780
  • D/V and A/P axon guidance terms (2007)
  • palmitoyl-CoA 9-desaturase activity (2007)
  • response to hyperoxia (2007)
  • Cuticle component terms (2007)
  • response to anoxia (2007)
2006
  • dynein light intermediate chain binding (2006)
  • Regulation terms for cell and nuclear division (2006)
  • Several child terms for apoptosis (2006)
2005
  • Cilium terms (2005)
2004
  • Intraflagellar transport particle-component terms (2004)
  • oogenesis (non-species specific term)(2004)
Modifications to the Ontology
  • Revised definition for muscle homeostasis (2010)
  • Added dense core vesicle synonym to dense core granule (2010)
  • Updated definition and moved parentage for intraflagellar transport (2009)
  • Added lethargus as synonym for sleep (2008)
  • Change to the definitions of the component terms: GO:0000775, GO:0000779, GO:0000780 which refer to the centromeres or chromosome, pericentric region (2007)
  • Change to parent of tail tip morphogenesis (sensu Nematoda) (2006)
  • GO:0046536, dosage compensation complex definition (2006)

Annotation Practices

Cellular Component Annotations

If a protein contains a transmembrane domain, but expression experiments are not at sufficient resolution to show membrane localization, what annotation should we make?

Example: WBPaper00036024


WormBase use of Column 16

Column 16 refers to a column in the Gene Ontology's (GO) tab-delimited gene association file (gaf) that WormBase submits to the GO consortium on a regular basis.

Column 16 has been referred to as the Annotation Extension column in that it provides a placeholder for curation details that cannot be captured by a GO term alone, for example the substrate upon which an enzyme acts.

A number of different types of information could conceivably be entered into Column 16. The list below begins to document the potential use of Column 16 by WormBase curators with any additional information or questions that have arisen during the course of curation.

In the GAF, there will be an explicit relationship between the entity in Column 16 and the GO term. The annotation extension relations are viewable here:

http://www.geneontology.org/scratch/xps/go_annotation_extension_relations.obo

Column 16 curation at WormBase is just beginning and will likely be fleshed out more fully over the next few months.

In the Ontology Annotator, Column 16 data is being entered into the 'Xref to' field in the following format: Column 16: Xref ID


Biological Process Examples:


Translational Regulation

Example 1: sup-26 is annotated to GO:0017148, negative regulation of translation. The entry in Column 16 is the target of that regulation, tra-2.

In OA entry: Column 16: WB:WBGene00006605

[Typedef]

id: has_regulation_target

name: has_regulation_target

def: "Identifies a gene or gene product affected by a regulation BP or regulator MF." [GOC:mah]

comment: probably want to add one or two new subtypes that capture something about directness

domain: GO:0065007 ! biological regulation

range: TEMP:0000003 ! gene or gene product

is_a: OBO_REL:has_participant


Defense Response

Example 1: lys-7 is required for defense response to Cryptococcus neoformans

In OA entry: Column 16: NCBI:192011 (a taxon ID)

Response to Terms

Example 1: daf-2 is shown to be involved in response to oxidative stress by treating animals with paraquat. WBPaper00005488

In OA entry: annotate to 'response to oxidative stress' using CHEBI:34905

Cell Fate Specification

Example 1: egl-38 is found to be required for cell fate specification in the male tail. WBPaper00002924

Could add a number of Anatomy Terms to Column 16 (not done yet).

Regulation of Protein Localization

Example 1: hmp-1 and jac-1;hmp-1 double mutants are shown to affect the distribution of HMR-1. WBPaper00005972

Added WBGene ID of HMR-1 to Column 16.


Molecular Function Examples:

Nucleic Acid Binding

Example 1: sup-26 is annotated to GO:0003730, mRNA 3'-UTR binding. The entry in Column 16 is the target of that binding, tra-2.

In OA entry: Column 16: WB:WBGene00006605

Plans/Projects in progress

Changes to the GO data model
  • Add tags for accommodating data in WormBase that are already in the gene association file:
    • Qualifying an annotation with the qualifiers 'NOT' 'contributes_to' or 'colocalizes with'
    • Using the generic GO_REF tags for generic references eg., for a NOT annotation, need to add the proper database and accession syntax (need to add a field in curation interface in OA).
    • 'With' or 'From', for the use of additional identifiers with the use of certain evidence codes like IPI, IGI, etc.
    • Annotation Extension, for containing cross references to other ontologies,one of:
      • DB:gene_id
      • DB:sequence_id
      • CHEBI:CHEBI_id
      • Cell Type Ontology:CL_id
      • GO:GO_id
    • Gene Product Form ID, a canonical entry for specific variants of gene products.
      • When the gene product form ID (column 17 of ga) is filled with a protein identifier, the value in DB object type (column 12 of ga) must be protein. Protein identifiers can include UniProtKB accession numbers, NCBI NP identifiers or Protein Ontology (PRO) identifiers.
      • When the gene product form ID (column 17 of ga) is filled with a functional RNA identifier, the DB object type (column 12 of ga) must be either ncRNA, rRNA, tRNA, snRNA, or snoRNA.
Changes to the GO_term model and updating the ontology in WormBase

Progress Report 2011

WormBase Gene Ontology Progress Report, December 2010

Staff

Juancarlos Chan

Developer, WormBase, Caltech, Pasadena, CA.

Ranjana Kishore

Curator, WormBase, Caltech, Pasadena, CA.

Paul Sternberg

PI, WormBase, Caltech, Pasadena, CA.

Kimberly Van Auken

Curator, WormBase, Caltech, Pasadena, CA.


Jolene Fernandes

Phenotype Curation, WormBase, Caltech, Pasadena, CA.

Gary Schindelman

Phenotype Curation, WormBase, Caltech, Pasadena, CA.


Ruihua Fang

Bioinformatician, Developer, WormBase, Caltech, Pasadena, CA


Additional technical support:

Anthony Rogers

WormBase, Sanger Center, Hinxton, UK

Gary Williams

WormBase, Sanger Center, Hinxton, UK


Textpresso:

Hans Michael Muller

Project Leader, Textpresso, Caltech, Pasadena, CA

Arun Rangarajan

Developer, Textpresso, Caltech, Pasadena, CA


Annotation Progress

Table 1: Number of Genes Annotated

Type of Annotation Number of Genes Annotated, Dec 2010 % Change from Dec 2009 Number of Unique GO Terms Total Number of GO Terms
Manual Annotation 2,098 +19.8 1,840 10,467
Phenotype2GO Mappings 6309 -6.1 113 42,349
IEA/Electronic 12,954 +0.83 1,476 55,091
Total 15,799 +0.84 2,937 107,907


Methods and Strategies for Annotation

Literature Curation

Manual curation of the C. elegans literature remains our highest curation priority, contributing to ~90% of our total curation efforts. Curators use a GO curation check-out form that affords curators easy visual access to the curation status of all named C. elegans gene, e.g. vha-6 or egl-9. Genes are displayed in a list that includes the current number of published papers (references) indexed to that gene and the last date for which annotations to any of the three ontologies were made. Curators can query and sort the list according to reference count, gene name, and curation status.

Computational Methods

InterPro2GO mappings for IEA annotations: These annotations are annotations of C. elegans proteins to GO terms based on electronic matching of protein motifs/domains to those documented in the InterPro database (http://www.ebi.ac.uk/interpro/), and their mapping to GO terms provided by the InterPro2GO file generated by the EBI (PMID:12654719, PMID:12520011). Note that the 'IEA' annotations are not reviewed for accuracy by human curators. As such, all of these annotations use the evidence code 'IEA'.

TMHMM2GO mapping for IEA annotations: We also include the results of an internal pipeline that maps proteins containing a transmembrane domain, as predicted by the TMHMM algorithm, to the GO Cellular Component term, integral to membrane. About 6,710 gene products are annotated to the term 'integral to membrane' via this pipeline.

InterPro2GO and TMHMM2GO annotations are updated at every database release.

Semi-automated Methods

Review and improvement of the Phenotype2GO data pipeline

WormBase uses a well defined phenotype ontology to annotate gene-allele and RNAi-based phenotypes. A total of 201 phenotype terms used in annotation have been mapped to a GO term. These mappings are used to automatically generate Biological Process annotations to genes using the IMP evidence code, at every WormBase database build . The complete list of WormBase phenotype to GO term mappings can be found here: http://www.wormbase.org/wiki/index.php/Gene_Ontology. We have begun a detailed review of our phenotype to GO term mappings. We are in the process of making changes to this pipeline so that annotations are made with a stricter use of the IMP evidence code, as recently described in GO consortium annotation policies. This process will involve removing some high-level phenotype term to GO term mappings and/or removal of certain RNAi experiments/papers from being included in this pipeline and the review and changing of scripts.

Textpresso-Based Cellular Component Curation

As a complimentary approach to our manual curation pipeline, we continue to employ the Textpresso information retrieval system to annotate C. elegans gene products to the Cellular Component ontology. Textpresso searches for component annotations use three different categories (Assay Terms, Component Terms, and Verbs) plus a category of C. elegans protein names. Searches using these categories return sentences that contain a match to at least one term in each category. We use this approach to annotate newly published papers as well as papers published prior to 2010. For newly published papers, we prioritize our searches by first searching through papers that, as determined by a Support Vector Machine document classifier, have a relatively high probability of containing expression data.

Over the past year, through our manual and Textpresso-based pipelines, we added new cellular component annotations to 292 genes. Of these, 113 genes were annotated from papers published in 2010, with the remainder of the annotations coming from previously published papers. For the 2010 papers, Textpresso’s annotation recall was 91.3%. We have not yet measured the recall on papers annotated this year but published prior to 2010.

Priorities for Annotation

Our annotation priorities are as follows:

1) Reference Genome genes

2) Genes presented for annotation via our Textpresso-based semi-automated Cellular Component curation pipeline

3) Genes from training set papers used for piloting semi-automated Textpresso-based Molecular Function curation

4) Newly described genes for which previous annotation was not available

5) C. elegans orthologs of human disease genes

Presentations and Publications

Publications, Talks, Posters 2010-

Ontology Development Contributions

WormBase curators have contributed to ontology discussion and development in the areas of:
Biology of the cilium:

  • updates/revisions to terms added in 2005

Biology of the phagosome-lysosome during apoptotic cell clearance, terms added:

  • phagosome maturation involved in apoptotic cell clearance
  • phagosome acidification involved in apoptotic cell clearance
  • phagolysosome assembly involved in apoptotic cell clearance
  • phagosome-lysosome docking involved in apoptotic cell clearance
  • phagosome-lysosome fusion involved in apoptotic cell clearance

Biology of muscle, terms added:

  • striated muscle contraction involved in embryonic body morphogenesis
  • striated muscle myosin thick filament assembly
  • striated muscle paramyosin thick filament assembly (2010)
  • alpha-tubulin acetylation

Biology of nematode larval development, terms added:

  • regulation (includes positive and negative regulation child terms) of nematode larval development
  • regulation of (includes positive and negative regulation terms) dauer larval development

Other terms added were:

  • neuropeptide receptor binding
  • determination of left/right asymmetry in the nervous system
  • regulation of locomotion (including positive and negative regulation child terms) involved in locomotory behavior
  • detoxification of arsenic
  • chondroitin sulfate proteoglycan binding
  • chondroitin sulfate binding
  • octapamine/tyramine signaling involved in the response to food (and the regulation terms)

Annotation Outreach and User Advocacy Efforts

Kimberly Van Auken continues to participate in the go-help rotation. Ranjana Kishore continues to participate in the efforts of the GO News group.


Other Highlights

Curation Tools: Ontology Annotator

We have continued development on our web-based curation tool, the Ontology Annotator, that can be used to annotate genes to any ontology, including the Gene Ontology and the WormBase Phenotype Ontology. The Ontology Annotator incorporates and expands upon much of the functionality of the Phenote curation tool. Some of the more useful features of the tool include bulk annotation capabilities, autocomplete functions, retrieving data and filtering of the retrieved data for editing purposes. We have improved functionalities for existing annotation interfaces and have added the following curation interfaces that are fully functional: antibody, small molecule and gene regulation. We are currently working on 2 new curation interfaces: gene regulation and expression pattern related pictures.

Textpresso- and HMM-Based Molecular Function Curation

In addition to using Textpresso for Cellular Component Curation, we have developed pipelines for semi-automated Molecular Function (MF) curation. Given that Molecular Function annotations can be made based upon evidence from a wide variety of experiments, our approach has been to divide molecular function curation into two broad categories: 1) macromolecular interactions and 2) enzymatic and transporter activities.

Macromolecular Interactions

For the former, we employ a two-step curation pipeline. First, we use an SVM-based document classification algorithm to identify new papers likely to contain reports of macromolecular interactions. These papers are then searched for matching sentences using Textpresso categories developed specifically for identifying sentences describing macromolecular interactions. Our initial results indicate that the SVM performs very well on predicting high confidence true positive papers (precision=86.4%) as well as true negative papers (precision=88.9%). The Textpresso categories also have a high recall (89.7%), but have a relatively low precision (precision=47.3%). In practice, this means that while curators are able to retrieve and curate nearly all macromolecular interactions, they still must examine a number of false positive sentences, so the curation efficiency needs to be improved. We are in the process of fine tuning the Textpresso categories for macromolecular interactions in hopes of addressing this issue and improving curation efficiency. The new Textpresso categories for this data are scheduled to be in place at the end of December 2010.

Enzymatic and Transporter Activities In parallel to these efforts, we are also investigating semi-automated curation methods for annotating enzymatic and transporter activities. For this data type, we are developing Textpresso categories as well as training a Hidden Markov Model (HMM) to identify curatable sentences. For the former, we have collected 419 sentences from 64 papers. Using these sentences, we have developed two new Textpresso categories for enzymatic and transporter activities. As for the new macromolecular interaction categories, we plan to implement these categories on Textpresso by the end of December 2010.

In collaboration with Hans-Michael Mueller, we are also training a Hidden Markov Model to identify sentences describing enzymatic and transporter activities. At present, we are in the third round of training and evaluation for the model. We hope to complete an initial evaluation of the model by early next year and will report on its performance.

Textpresso-Based identification of literature for human disease gene orthologs

Human disease gene orthologs are a high priority annotation list for WormBase and other model organism databases. Sequence-based or similarity searches for human disease gene orthologs provide useful candidates but the biological information needs to be extracted manually from the literature. We are working on a project that combines the use of Textpresso-based categories and key word searches to identify elegans papers which describe the study of a human disease gene ortholog. Sentences in which a C. elegans gene co-occurs with a human disease term were deemed important for this process. A new 'human disease' Textpresso category was formed using the human disease ontology in the OBO foundry, the Neuroscience Information Framework Standardized (NIFSTD) ontology and the existing Textpresso human disease category. Several disease terms that would increase the number of false positives were removed iteratively by a manual process. As performing an 'AND' query on just two categories--human disease and C. elegans gene, returned too many false positives, a third category was formed with the words 'ortholog’, `homolog’, `similar’, `relate’ and `model’, and added to the 'AND' query. The query is being fine-tuned for increasing precision and recall. Once established, it will automate the flagging of new articles that have disease gene ortholog data. Subsequently this method can be used for the extraction of relevant information.

Textpresso-Based Curation Pipelines for Other MODs

We have been collaborating with The Arabidopsis Information Resource (TAIR) and dictyBase to develop and implement Textpresso-based curation pipelines for Cellular Component annotation. At present, we have been working with TAIR to develop a pipeline by which they can perform Textpresso searches on the Arabidopsis corpus from 2008 to identify potentially new annotations that were not previously captured in their existing GO curation pipeline. The results of these searches are being presented in a curation form that will allow TAIR curators to make new annotations and retrieve the annotations in both a gene_association file format and a more basic, three-column format that is similar to their user submission format. We hope to generalize this pipeline so that other groups, including dictyBase, will be able to perform Textpresso queries, send the results to a curation form, and retrieve the output of the curation as a gene_association file.

Collaboration with BioGRID

In late summer, we begin a collaboration with the BioGRID (Biological General Repository for Interaction Datasets) to add protein binding annotations (IPI evidence code) curated for GO into BioGRID. In addition, we also plan to add curated genetic interactions (IGI evidence code) to BioGRID. Our initial work will focus on adding interactions curated as part of the Reference Genome’s Wnt signaling pathway annotation project. Concurrently, we will also add any newly curated protein binding annotations (Wnt pathway or otherwise) to BioGRID.









Back to Caltech documentation