ISB2014 Session3 Functional Annotations
The Seventh International Biocuration Conference Functional Annotations April 8, 2014 http://biocuration2014.events.oicr.on.ca/agenda-5 link to all ISB2014 notes: http://etherpad.wikimedia.org/p/isb2014 Help me take collaborative notes on this session! Fill out your name in the box in the top right-hand corner. Add yourself to the editors list, then edit away! I make a lot of typos. Sorry. Editors name / affiliation / twitter Abigail Cabunoc / OICR / @abbycabs Marc Perry / OICR / @mdperry 10:00 - 12:00 Session 3 - Functional Annotations, Chaired by Iddo Friedberg The Great Hall, Hart House announcements: ---------------------------------------------------------------------------------------------------------- 10:00 - 10:15 A Robust Data-Driven Approach for Gene Ontology Annotation Hong Yu University of Massachusetts Medical School, USA Abstract The development of gene ontology (GO) has greatly benefit biologists for information seeking, but the an- notation of GO is a very time-consuming and labor-intensive procedure, which has become a major bottle neck of database curation. BioCreative IV GO annotation task aims to evaluate the performance of system that automatically assigns GO terms to genes based on the narrative sentences in biomedical literature. We present our work in this task as well as the experimental results and analysis after the competition. For the evidence sentence extraction subtask, we built a binary classifier to identify evidence sentences using reference distance estimator (RDE), a recently proposed semi-supervised learning method that learns new features from around 10 million unlabeled sentences, achieving an F1 of 19.3% in exact match and 32.5% in relaxed match. In the post-submission experiment, we obtained 22.1% and 35.7% F1 per- formance by incorporating bigram features in RDE learning. In both development and test sets, RDE based method achieved over 20% relative improvement on F1 and AUC performance against classical supervised learning methods e.g., SVM and Logistic regression. For the GO term prediction subtask, we developed an information retrieval (IR) based method to retrieve the GO term most relevant to each evidence sentence using a ranking function that combined cosine similarity and the frequency of GO terms in documents, and a filtering method based on high-level GO classes. The best performance of our submitted runs was 7.8% F1 and 22.2% hierarchy F1. We found that the incorporation of frequency information and hierarchy filtering substantially improved the performance. In the post-submission evalu- ation, we obtained a 10.6% F1 using a simpler setting. In addition, the experimental analysis showed our approaches were robust in both the two tasks. Notes Text-mining Biocuration & text mining >16 years (speaker experience) a lot of annotation >35K GO terms >1M gene products >6M annotations however: from text-mining perspective, not enough for us to use not detailed enough for us to learn build the gap: rebuild the system to assist with biocuration BioCreative - GO annotation task A. retrieve GO evidence sentences for relevant genes B. predict GO terms - retrieve evidence build system : UMASS GO System Challenges: 35K GO terms available - data provided to us - only 100 annotated full-text article (training data) not enough training data 50 articles development data 50 articles - test data try to build expert system that can be as accurate as human building not possible - data too small unsupervised and semi-unsupervised learning since annotated data not sufficient - use unlabelled data if annotated data has 'expression of clec-67 is controlled by ELT-2' use term - identify association with other terms unsupervised learning use particular matrix to do this task Result: best performance- 36% F1 score recall highest - 66% (gene names) - not as robust highest precision- 39% may not be usable as it is but remember- only 100 annotated data if we have more data we can substantially improve performance Future: identify not just sentence, but what GO terms are retrieve - much harder task Task B) >35K GO terms - not possible to do supervised classification ( not enough data) types of class - huge unsupervised - collaborative filtering (new approach) promising - recommended system (Amazon book) making automatic predictions about interested of a user by collecting preferences from many users (collaborating) distribution of GO annotations - not linear - power distribution some GO terms heavily annotated to lots of gene products majority of GO terms not associated with many gene products Hierarchy - T predict parent nodes - better luck doing this Result: 21% F1 score- remarkable form computation view if we increase # of annotated data, can substantially improve this any text minging task/system - we recommend no machines can achieve 100% precision the technology might be there need more collaborations more annotated data Future: Integrating existing larger GO annotations - work together! curating community and text-mining community Q & A Q: Has your team participated in 1st biocreative? Dataset much larger, but missing text that was evidence for annotation. Is there really an improvement with annotation provided - its a lot of work! For this task - curators have to go through every sentence that supports this annotation. Not regular. Very time consuming. A: did not participate in first. This is an investment. Do we want to invest something first so we can get something usable later. Will build publicly available system - hope ppl use. We automatically learn when they use it. Mostly painless process. We like to collaborate with the curators to improve the system. ---------------------------------------------------------------------------------------------------------- 10:15 - 10:30 Enzyme Prediction and the Metabolic Reconstruction of Probiotic Bacteria Cedoljub Bundalovic-Torma The Hospital for Sick Children, Canada Abstract The gut-microbiome plays a crucial role in human health, and it has been shown that disruptions of this important niche underlie several prevalent diseases, such as Type-I Diabetes, Celiac Disease, and Inflam- matory Bowel Disease. Having the potential to ameliorate disrupted gut-microbiomes, probiotic bacteria may hold promise for treating the growing incidence of these diseases. Unfortunately, the mechanisms and genetic determinants that lend probiotic bacteria towards therapeutic use are poorly understood at present but will be greatly aided by comparative genomic analyses. In order to do so, the reliable annota- tion of metabolic enzymes will be essential towards reconstructing and identifying the relevant metabolic pathways utilized by probiotics. We will present our efforts towards this goal through the development of a bioinformatics pipeline to computationally predict enzymes using an integration of a variety of sequence- based approaches, and its application pipeline toward the metabolic reconstruction two commercially utilized probiotic bacterial species. Aided by metabolic network models we will discuss initial results of ge- nomic comparisons in identifying unique metabolic pathways in probiotics, as well as future work applying our findings through metabolic modeling. Notes Problems in enzyme prediction Probiotics bacteria - known for awhile benefit to host health when administered NGS - wealth of novel bacterial genomes how do pathways differ systems bio approach - genes not in isolation genes/proteins diverse array of processes - variety of interactions highly interconnected accuracy decreases when you use blast-based approach challenge - novel sequence to annotate want to accurately predict metabolic DETECT - enzyme prediction blast results - get a lot of different hits typically see in enzymes. want to tell which one of these enzymatic functions we can assign to protein poster #13 greater accuracy than blast-based approaches x-validation of swiss-prot EC annotated proteins draft sequence of B. subtilis R0179 gone through some computational annotation re-predicting genes DETECT, PRIAM+BLAST generate list of likely enzymes in this genome, reactions they catalyze, metabolic pathways high confidence enzyme predictions different methods don't all overlap put enzymes into metabolic network reconstruction reference strain - curated metabolic model useful contribution: we can infer potentially missing reactions found in reference but not predicted State of the model we have so far: missing enzymes in glycolysis - seems to be a missing step terpenoid backbone biosynthesis - significant portion predicted Summary: need accurate functional annotation of genomes - crucial some enzyme categories cannot predict using BLAST probabilistic approach of DETECT can accurately assign enzymatic functions to genes combining approaches = can create reliable metabolic network reference metabolic pathways can aid to find missing reations - pathway holes Future: MetaCyc - tool to fill pathways holes construct metabolic model of B. subtilis R0179 metabolic reconstruction of Lactobacillus helveticus R0052 http://compsysbio.org/projects/DETECT/ Q & A Q: In the gut - that organism should live as part of different x-feeding community. Clue in how to factor in other organisms? A: Pathways reconstruction - create a meta. There are uniquely contributed pathways from particular species where you have a shift, healthy distribution. Disease state - certain overrepresentation - metabolic byproducts implicated in autism and more. Much to be learned. Microbiome not tapped out. ---------------------------------------------------------------------------------------------------------- 11:00 - 11:20 Expert curation in UniProtKB: a case study in dealing with conflicting and erroneous data Sylvain Poux SIB, Switzerland Abstract UniProtKB/Swiss-Prot provides expert curation with information extracted from literature and curator- evaluated computational analysis. As knowledgebases continue to play an increasingly important role in scientific research, a number of studies have evaluated their accuracy and revealed various errors. While some are curation errors, others are the result of incorrect information published in the scientific literature. By taking the example of a complex annotation case, we will describe the curation procedure of UniProtKB/Swiss-Prot and detail how we report conflicting information in the database. We will demon- strate the importance of collaboration between resources to ensure curation consistency and the value of contributions from the user community in helping to maintain error-free resources. Notes Address erroneous data in both lterature and databases Erroneous data databases study - misannotation level in 37 enzyme SF in protein erroneous annotation much higher in automatically annotated dbs the economist - growing number of errors - alarming level that science is not possible anymore nature- could only reproduce 6/53 studies scientists from only 25% could be reproduced e.g. SIRT5 - complex annotation case. how we deal with conflicting information member of Sir2 family S. cervisiae most members - protein deacetylases inital reports in an inital report - SIRT5 - protein deacetylase activity in vitro later, regulator of urea cycle, activate Cps1 enzyme in vivo Sirt5 deacetylates Cps1 invitro did not show activation of Cps1 is due to deacetylation 3 years ago - 3 different gorups new activity: crystal structure pocket to host acetyle group much larger in SIRT5 large enough for acyl group instead acts as protein deacylase protein lysine demalonylase and desuccinylase lysine malonylation and succinylation - 2 new PTMs malonylated / succinylated in vivo - including Cps1 in Sirt5 knockout mice - Cps1 not activated Summary: 1. SIRT5 - NOT a protein deacetylase in vivo acts as protein deacylase specifically removes malonyl and succiny groups how to represent in databases? Annotation: describe activity, structure, all with evidence annotate 3D structure info added in protein entries of appropriate species knowckout experimetns from mouse bovine - malonylation and succinylation when info known, it is indicated 'caution' tag - "does not exist in vivo" protein deacylase is confirmed - removes medium and long fatty acid chains information must be propogated with care a family rule was created - automatic annotation >4K unipro entries collaborate with a number of databases/resources GO corsortium: request new GO terms when required added NOT qualifier- gene product not associated with GO term while it may be expected by automatic methods or previous literature PTMs - controlled vocab in collab wiht RESID new catalytic reactions submitted to IUBMB Summary: curation of 3 papers acording to standard uniprot workflow: reannotation of SIRT5 protein addition of PTM sites in 28 protein entries new vocab: 4 GO terms, 2 PTM , 1 catalytic reaction uniprot entries regularly updated as new info comes only relevant information publications read fully and indetail - curated other publication - additional bibliography - UniProtKB 85 papers in PubMed - only 26 from annotation most if taken from 3 papers other 59 - not relevant, redundant, or weaker evidences number of papers still published based on invitro - indirect analysis a lot of confusion Possible Solution: how to detect erroneous info in publication number of retractions - very small reading a range of publications in an area ensures both efficiency and critical analysis of data hard when there is no publication contradicting it => Nature - 'Enhancing reproducibility' guidelines to authors - standards Collaboration between resources - essentail to limit number of errors in databases and their propagation Q & A Q: There are lots of ways to automatically detect conflict in information. Which fact is true fact in teh literature - that can help you? Software system openly available publically A: We can discuss later, thank you Q: Have to dig into annotations - make warnings more conspicuous. Not obvious from SIRT5 uniprot page does not have acytelase activity A: Difficult. Indicated in text. Impacted by users - dislike when critisize their work. Careful when retract evidence. Still published. We put appropriate number, names, highlight activity, but then as long as their are papers being published we have to display that. Careful. Q: On a number of high profile papers - showing it's wrong in social media (retraction watch). A: We don't actively monitor social media. Maybe we should look at that Q: It seems like biocurators are taking up the role of post-publication review. How much of that do you think you should be doing? Peer review is not perfect. How active do you think you should be having? A: We have to give a complete overview. Important to give information to user. Many conflicted data - sometimes minor other times major. Responsibility to give information to users. Of course, sometimes mistake, then we correct. ---------------------------------------------------------------------------------------------------------- 11:20 - 11:40 Effective automated classification using ontology-based annotation: experience with analysis of adverse event reports Melanie Courtot BCCRC, Canada Abstract Analysis of reports of adverse events following immunization is an important way to identify potential problems in vaccine safety and efficacy. However, reports are not annotated in a consistent manner, resulting in tedious, costly and time consuming analysis. In order to address these deficiencies, the Brighton Collaboration has done extensive work towards standardization of case definitions and diagnostic criteria for vaccine adverse events. Within the framework of the OBO Foundry, we built the Adverse Events Reporting Ontology (AERO), a logical formalization of the Brighton guidelines to enable consistent and accurate annotation of adverse event reports. We applied it and validated our results against a manually curated dataset from the Vaccine Adverse Event Reporting System. AERO allows users to unambiguously refer to a specific set of carefully defined signs and symptoms at the time of data entry, as well as an overall diagnosis that remains linked to its associated signs and symptoms. The adverse event diagnosis is formally expressed, making it amenable to further querying. We show that it is possible to automate the classification of adverse events using the AERO with very high specificity (97%). We have demonstrated that AERO can be used with other types of guidelines, and our pipeline relies on open and widely used data standards (RDF, OWL, SPARQL) for implementation, making our system easily transposable to other domains. Within the Integrated Rapid Infectious Disease Analysis project (IRIDA; http://www.irida.ca), we have also initiated development of ontologies for infectious disease outbreak analysis. Our aim is to not only transform infectious disease control, but also illustrate the general benefit of ontology-based annotation to automate classification, which is key to many biocuration efforts. Slides http://www.slideshare.net/mcourtot/biocuration-2014 Notes Next week: Post doc at SFU Why do we care about adverse event following immuniztion issues with vaccine shown that the better information we can communicate on immunization - the more successful the campaign if we look at how assessed: manually assessed by expert (expensive) time consuming Hypothesis: possible to encode clinical guideline - use OWL ontology apply to existing datasets - automated classification of advse effects Test case - VAERS dtaset Vaccine Adverse Event REporting System took a few months for experts to manually curate manually classified for anaphylaxis MedDRA - medical dictionary of regulatory activities - clinical findings free text - clinician entered data MedDRA encoded structured data - keywords that represent the free text part Diagnosis workflow A) adverse event reporting B) collected dataset - mysql dtabase c) annotations collected (mysql) Put these three pieces together -> run python pipeline (reasoner) OWL/RDF xport can ask questions d) add manually curated dataset specificicity really high - 97%. fair sensitivity 57% adverse events are rare - we want to be able to detect brighton reports- complex syntoms - miss positive case went back to MedDRA - great for screening. Try to increase sensitivity. take positive and negative reports find elements significantly associated to outcome add to existing query Cosine similarity method vector of terms (documents - query and report) combare cosine measure of angle they form Results: Sensitivity 92% Specificity 87% trying not to miss cases Discussion: using ontology on its own, sensitivity is too low for screening (b/c of brighton guideline for diagnosis, not screening) improve sensitivity with MedDRA Outcomes: encoding standards don't allow for complete representation of events chronic? important information lacking in reports from surveillance systems missing rash - not assessed? not recorded? negative? logical translation of guidelines allows for better detection of inconsistencies and errors working with brighton to add logical formalization Using ontology in future reporting fast screening - potentially positive. Send back form, get more information. Based on brighton guideline. future system - implementation of ontology based system at time of data entry labels and textual definitions for each term consistency checking Next steps: IRIDA project Integrated Rapid Infectious disease analysis iridia.ca bioinformatics platform for genomic epidemiology analysis academic and public groups together ontologies developed to annnotate clinical, lab and epidemiology data - integrated for analysis Poster 15 ---------------------------------------------------------------------------------------------------------- 11:40 - 12:00 Supervised text mining for functional curation of gene products: how the data and performances have evolved in the last ten years Julien Gobeill SIB, Switzerland Abstract We describe how text mining can support the literature-based curation tasks for functional proteomics us- ing Gene Ontology (GO) concepts. In bioinformatics, this task is often considered both as time-consuming for humans and not performed with sufficient quality by computers. We will report on our participations in the BioCreative I and IV challenges, and on the progress regarding the data and systems’ performances during the decade. In 2004, the first BioCreative already designed a task of automatic GO concepts as- signment from articles. At this time, we led the way thanks to a dictionary-based system. Yet, results were judged far from reaching the performances required by real curation workflows. Dictionary-based approaches compute similarities between the input text and the GO terms. These approaches are es- sentially limited by the complex nature of the GO: identifying GO concepts in text is highly challenging, as they often do not appear literally or even approximately in text. On the other hand, supervised approaches compute similarities between an input text and already curated instances. In 2004, such data-driven ap- proaches were outperformed by dictionary-based systems, due to a lack of training data. Since then, the availability curation data have massively improved. In 2003, the BioCreative IV Track 4 revisited the automatic GO assignment task. We demonstrate how our supervised classifier GOCat achieved top com- petitive results, approaching human curators. GOCat learns from 100,000 curated abstracts contained in GOA, in order to deliver accurate but also highly integrable GO concepts. Moreover, the quality of the GOCat predictions continues to improve across the time, thanks to the growth of GOA: since 2006, GOCat performances have improved by 50%. Supervised text mining has now reached maturity and is ready for being used in operational semi-automatic curation workflows, or in fully automatic annotation settings for high-throughput databases. GOCat i s available at eagl.unige.ch/GOCat. Notes Working on: algorithms input: free text (abstract) output: GO terms 10 year perspective: BioCreative 2004, BioCreateive 2013 evolution of performances across this time GO has grown - adding synonyms in GO yet DB effectiveness is constant GOA - PMID annotations - growing Background: text miners - not biocurators data deluse issue- curated data behind knowledge in literature classical categorization task: how to predict GO terms from text state of the art approach: 1. dictionary based approach similarity between input text nad GO terms - needs synonyms massive effort was put into synonyms (2008 esp) 2. machine learning approach tries to use knowledge base, to infer more likely GO terms 100K abstracts form medline in GOA 4K GO codes w/ at least 10 examples (80% of all annotations in GOA) GOCat- machine learning sys query - abstract/free text retrive k most similar abstracts in GOA infer most prevalent GO codes more similar is the document, the more likely share GO codes Study: both approaches accross time dictionary based EAGL machine learning GOCat Predict GO terms from just published abstract - simulation over time restore data as 2008 look at how well engines did predict GO terms gold standard: GOA benchmark 1K abstract each year (3K GO terms - 2.8 GO terms/abstract) Result: Dictionary based - constant Machine learning - improvement for knowledge base - better is bigger In 2004 - DB rules poor performance 2013 - ML rules protein and free text integrated GOCat Interface http://eagl.unige.ch/GOCat/ Conclusions: critical mass is reached for Machine Learning - but still place for combination ML works well for 80% of annotation. But not for more rare GO terms curators: lets use available data GOCat - good for semi-automatic curation qworkflows Q & A Q: Comment - good to see tha tML appraoch has improved as more data becomes available. After we ran the challenge we did some interal agreement. Exact match annotation - 50%. Over 60% if we relax that. Still room for text-mining community.