WormBase Gene Ontology Progress Report, December 2011

Staff

Juancarlos Chan

Developer, WormBase, Caltech, Pasadena, CA.

Ranjana Kishore

Curator, WormBase, Caltech, Pasadena, CA.

Paul Sternberg

PI, WormBase, Caltech, Pasadena, CA.

Kimberly Van Auken

Curator, WormBase, Caltech, Pasadena, CA.

Ruihua Fang

Bioinformatician, Developer, WormBase, Caltech, Pasadena, CA

Additional technical support:

Gary Williams

WormBase, Sanger Center, Hinxton, UK

Textpresso:

Hans Michael Muller

Project Leader, Textpresso, Caltech, Pasadena, CA

Arun Rangarajan

Developer, Textpresso, Caltech, Pasadena, CA

Yuling Li

Developer, Textpresso, Caltech, Pasadena, CA

Annotation Progress

Table 1: Number of Genes Annotated

Type of Annotation	Number of Genes Annotated, Dec 2011	% Change from Dec 2010	Number of Unique GO Terms	% Change from 2010	Total Number of GO Terms	% Change from Dec 2010
Manual Annotation	2,366	+12.8	2,167	+6.6	12,239	+9.4
Phenotype2GO Mappings	6229	-1.3	115	+1.8	42,986	+1.5
IEA/Electronic	13,597	+5.0	1,548	+4.9	55,091	-6.7
Total	16,269	+3.0	3,262	+11.1	106,604	-1.2

Methods and Strategies for Annotation

Literature Curation

Manual curation of the C. elegans literature remains our highest curation priority, contributing to ~90% of our curator efforts. Curators use a GO curation check-out form that affords curators easy visual access to the curation status of all named C. elegans gene, e.g. vha-6 or egl-9. Genes are displayed in a list that includes the current number of published papers (references) indexed to that gene and the last date for which annotations to any of the three ontologies were made. Curators can query and sort the list according to reference count, gene name, and curation status.

Computational Methods

InterPro2GO mappings for IEA annotations: This method provides annotations of C. elegans proteins to GO terms based on computational identification of protein motifs or domains contained within the InterPro database (http://www.ebi.ac.uk/interpro/), and their mapping to GO terms provided by the InterPro2GO file generated by the EBI (PMID:12654719, PMID:12520011). IEA annotations are spot-checked by curators for accuracy (see below).

TMHMM2GO mapping for IEA annotations: We also include the results of an internal pipeline that maps proteins containing a transmembrane domain, as predicted by the TMHMM algorithm, to the GO Cellular Component term, integral to membrane. About 6,710 gene products are annotated to the term 'integral to membrane' via this pipeline.

InterPro2GO and TMHMM2GO annotations are updated with every database release.

Semi-automated Methods

Phenotype2GO Data Pipeline

WormBase uses a well defined phenotype ontology to annotate variation- and RNAi-based phenotypes. A total of 201 phenotype terms used in annotation have been mapped to a GO term. These mappings are used to automatically generate Biological Process annotations to genes using the IMP evidence code at every WormBase database build . The complete list of WormBase phenotype to GO term mappings can be found here: http://www.wormbase.org/wiki/index.php/Gene_Ontology. Mappings are typically made to high-level GO terms and have generally been used to annotate large-scale experiments.

Textpresso-Based Curation, Support Vector Machines (SVMs) for Document Classification

Cellular Component Curation

As a complimentary approach to our manual curation pipeline, we use the Textpresso information retrieval system to annotate C. elegans gene products to both the Cellular Component and Molecular Function ontologies. Textpresso searches for component annotations use three different categories (Assay Terms, Component Terms, and Verbs) plus a category of C. elegans protein names. Searches using these categories return sentences that contain a match to at least one term in each category. We use this approach primarily to annotate newly published papers. New papers are first classified as having expression data by an SVM algorithm and positive papers are subject to Textpresso searches. The SVM acts as an effective filtering step that helps to remove false positive papers from the Textpresso searches.

Molecular Function Curation

In addition to using Textpresso for Cellular Component Curation, we have developed pipelines for semi-automated Molecular Function curation. Given that Molecular Function annotations can be made based upon evidence from a wide variety of experiments, our initial approach has been to focus on two areas of molecular function curation: 1) macromolecular interactions and 2) enzymatic and transporter activities.

Macromolecular Interactions

Similar to our approach to Cellular Component curation, we employ a two-step curation pipeline for annotating macromolecular interactions. First, we use an SVM-based document classification algorithm to identify new papers likely to contain reports of macromolecular interactions. These papers are then searched for matching sentences using Textpresso categories developed specifically for identifying sentences describing macromolecular interactions. Our initial results indicate that the SVM performs very well on predicting high confidence true positive papers (precision=72%) and has a high recall, 89%. The Textpresso categories also have a high recall (89.7%), but have a relatively low precision (precision=47.3%). In practice, this means that while curators are able to retrieve and curate nearly all macromolecular interactions, they still must examine a number of false positive sentences, so the curation efficiency needs to be improved. We are in the process of fine tuning the Textpresso categories for macromolecular interactions in hopes of addressing this issue and improving curation efficiency.

Enzymatic and Transporter Activities In parallel to these efforts, we are also investigating semi-automated curation methods for annotating enzymatic and transporter activities. For this data type, we are developing Textpresso categories as well as training a Hidden Markov Model (HMM) to identify curatable sentences. For the former, we have collected 419 sentences from 64 papers. Using these sentences, we have developed two new Textpresso categories for enzymatic and transporter activities. These categories are currently available for use with the C. elegans Textpresso implementation.

In collaboration with Hans-Michael Mueller, we have also developed a hidden Markov model (HMM) to identify sentences describing enzymatic and transporter activities. Using this HMM to scan the entire C. elegans corpus, we find that it identifies potentially curatable papers with a recall of 76.6% (n=111 papers). Using high-scoring sentences from these papers, curators are able to make 75 GO annotations (recall = 49%). We are currently evaluating the precision (i.e., granularity) of the HMM-derived annotations compared to fully manual curation, determining what additional information is needed to increase the number of annotations that can be made, and assessing how best to incorporate this HMM into a curation pipeline.

Priorities for Annotation

Our annotation priorities are as follows:

1) Reference Genome genes

2) Genes presented for annotation via our Textpresso-based semi-automated Cellular Component curation pipeline

3) Genes from training set papers used for piloting semi-automated Textpresso-based Molecular Function curation

4) Newly described genes for which previous annotation was not available

5) C. elegans orthologs of human disease genes

Presentations and Publications

Ontology Development Contributions

WormBase curators have contributed to ontology discussion and development in the areas of:

protein binding
regulation of neuron migration
basement membrane assembly involved in embryonic body morphogenesis
parentage of dauer larval development - also include dormancy process
regulation of ATP biosynthetic process
regulation, positive, negative of dipeptide transport
regulation of phospholipid transport
regulation, positive, negative of endocytic recyling
suggested change to InterPro2GO mapping for GoLoco motif
GABAergic neuron differentiation
pre-mRNA binding
nitric oxide sensory activity
age-dependent behavioral decline
regulation, positive, negative of anterograde axon cargo transport and retrograde axon cargo transport
aggrephagy
germ cell proliferation
in progress - centrosome maturation - when, what, how
defecation motor program
modifications to terms and definitions of cilium assembly and sensory cilium assembly
ciliary transition zone
regulation and pos/neg regulation of microtubule motor activity
nickel ion homeostasis and cellular nickel ion homeostasis
neurotransmitter receptor catabolic process

Annotation Outreach and User Advocacy Efforts

Kimberly Van Auken continues to participate in the go-help rotation. Ranjana Kishore continues to participate in the efforts of the GO News group.

Other Highlights

Curation Tools: Ontology Annotator

We have continued development on our web-based curation tool, the Ontology Annotator, that can be used to annotate genes to any ontology, including the Gene Ontology and the WormBase Phenotype Ontology. The Ontology Annotator incorporates and expands upon much of the functionality of the Phenote curation tool. Some of the more useful features of the tool include bulk annotation capabilities, autocomplete functions, retrieving data and filtering of the retrieved data for editing purposes. We have improved functionalities for existing annotation interfaces and have added the following curation interfaces that are fully functional: antibody, small molecule and gene regulation. We are currently working on 2 new curation interfaces: gene regulation and expression pattern related pictures.

Textpresso-Based identification of literature for human disease gene orthologs

Human disease gene orthologs are a high priority annotation list for WormBase and other model organism databases. Sequence-based or similarity searches for human disease gene orthologs provide useful candidates but the biological information needs to be extracted manually from the literature. We are working on a project that combines the use of Textpresso-based categories and key word searches to identify elegans papers which describe the study of a human disease gene ortholog. Sentences in which a C. elegans gene co-occurs with a human disease term were deemed important for this process. A new 'human disease' Textpresso category was formed using the human disease ontology in the OBO foundry, the Neuroscience Information Framework Standardized (NIFSTD) ontology and the existing Textpresso human disease category. Several disease terms that would increase the number of false positives were removed iteratively by a manual process. As performing an 'AND' query on just two categories--human disease and C. elegans gene, returned too many false positives, a third category was formed with the words 'ortholog’, `homolog’, `similar’, `relate’ and `model’, and added to the 'AND' query. The query is being fine-tuned for increasing precision and recall. Once established, it will automate the flagging of new articles that have disease gene ortholog data. Subsequently this method can be used for the extraction of relevant information.

Textpresso-Based Curation Pipelines for Other MODs

We have been collaborating with The Arabidopsis Information Resource (TAIR) and dictyBase to develop and implement Textpresso-based curation pipelines for Cellular Component annotation. At present, we have been working with TAIR to develop a pipeline by which they can perform Textpresso searches on the Arabidopsis corpus from 2008 to identify potentially new annotations that were not previously captured in their existing GO curation pipeline. The results of these searches are being presented in a curation form that will allow TAIR curators to make new annotations and retrieve the annotations in both a gene_association file format and a more basic, three-column format that is similar to their user submission format. We hope to generalize this pipeline so that other groups, including dictyBase, will be able to perform Textpresso queries, send the results to a curation form, and retrieve the output of the curation as a gene_association file.

Collaboration with BioGRID

In late summer, we begin a collaboration with the BioGRID (Biological General Repository for Interaction Datasets) to add protein binding annotations (IPI evidence code) curated for GO into BioGRID. In addition, we also plan to add curated genetic interactions (IGI evidence code) to BioGRID. Our initial work will focus on adding interactions curated as part of the Reference Genome’s Wnt signaling pathway annotation project. Concurrently, we will also add any newly curated protein binding annotations (Wnt pathway or otherwise) to BioGRID.

Back to Caltech documentation Back to Gene Ontology

Progress Report 2011

Contents