ISB2014 Session3 Functional Annotations

From WormBaseWiki
Jump to: navigation, search
The Seventh International Biocuration Conference
Functional Annotations
 April 8, 2014
http://biocuration2014.events.oicr.on.ca/agenda-5
link to all ISB2014 notes:
    http://etherpad.wikimedia.org/p/isb2014
Help me take collaborative notes on this session! Fill out your name in the box in the top right-hand corner. Add yourself to the editors list, then edit away!
I make a lot of typos. Sorry.
Editors

    name / affiliation / twitter

    Abigail Cabunoc / OICR / @abbycabs

    Marc Perry / OICR / @mdperry


10:00 - 12:00
Session 3 - Functional Annotations, Chaired by Iddo Friedberg
The Great Hall, Hart House
announcements:

----------------------------------------------------------------------------------------------------------
10:00 - 10:15
A Robust Data-Driven Approach for Gene Ontology Annotation 
Hong Yu
University of Massachusetts Medical School, USA
Abstract
The development of gene ontology (GO) has greatly benefit biologists for information seeking, but the an- notation of GO is a very time-consuming and labor-intensive procedure, which has become a major bottle neck of database curation. BioCreative IV GO annotation task aims to evaluate the performance of system that automatically assigns GO terms to genes based on the narrative sentences in biomedical literature. We present our work in this task as well as the experimental results and analysis after the competition. For the evidence sentence extraction subtask, we built a binary classifier to identify evidence sentences using reference distance estimator (RDE), a recently proposed semi-supervised learning method that learns new features from around 10 million unlabeled sentences, achieving an F1 of 19.3% in exact match and 32.5% in relaxed match. In the post-submission experiment, we obtained 22.1% and 35.7% F1 per- formance by incorporating bigram features in RDE learning. In both development and test sets, RDE based method achieved over 20% relative improvement on F1 and AUC performance against classical supervised learning methods e.g., SVM and Logistic regression. For the GO term prediction subtask, we developed an information retrieval (IR) based method to retrieve the GO term most relevant to each evidence sentence using a ranking function that combined cosine similarity and the frequency of GO terms in documents, and a filtering method based on high-level GO classes. The best performance of our submitted runs was 7.8% F1 and 22.2% hierarchy F1. We found that the incorporation of frequency information and hierarchy filtering substantially improved the performance. In the post-submission evalu- ation, we obtained a 10.6% F1 using a simpler setting. In addition, the experimental analysis showed our approaches were robust in both the two tasks. 
Notes
Text-mining
Biocuration & text mining

    >16 years (speaker experience)

    a lot of annotation

    >35K GO terms

    >1M gene products

    >6M annotations

    however: from text-mining perspective, not enough for us to use

    not detailed enough for us to learn

    build the gap: rebuild the system to assist with biocuration

    BioCreative - GO annotation task

    A. retrieve GO evidence sentences for relevant genes

    B. predict GO terms - retrieve evidence

    build system : UMASS GO System

    Challenges:

    35K GO terms available - data provided to us - only 100 annotated full-text article (training data)

    not enough training data

    50 articles development data

    50 articles - test data

    try to build expert system that can be as accurate as human building

    not possible - data too small

    unsupervised and semi-unsupervised learning

    since annotated data not sufficient - use unlabelled data

    if annotated data has 'expression of clec-67 is controlled by ELT-2'

    use term - identify association with other terms

    unsupervised learning

    use particular matrix to do this task

    Result: best performance-  36% F1 score

    recall highest - 66% (gene names) - not as robust

    highest precision- 39% 

    may not be usable as it is

    but remember- only 100 annotated data

    if we have more data we can substantially improve performance

    Future: 

    identify not just sentence, but what GO terms are retrieve - much  harder task

    Task B)

    >35K GO terms - not possible to do supervised classification ( not enough data)

    types of class - huge

    unsupervised - collaborative filtering (new approach)

    promising - recommended system (Amazon book)

    making automatic predictions about interested of a user by collecting preferences from many users (collaborating)

    distribution of GO annotations - not linear - power distribution

    some GO terms heavily annotated to lots of gene products

    majority of GO terms not associated with many gene products

    Hierarchy - T

    predict parent nodes - better luck doing this

    Result: 21% F1 score-  remarkable form computation view

    if we increase # of annotated data, can substantially improve this

    any text minging task/system - we recommend

    no machines can achieve 100% precision

    the technology might be there

    need more collaborations

    more annotated data

    Future: Integrating existing larger GO annotations -

    work together! curating community and text-mining community

Q & A
Q: Has your team participated in 1st biocreative? Dataset much larger, but missing text that was evidence for annotation. Is there really an improvement with annotation provided - its a lot of work! For this task - curators have to go through every sentence that supports this annotation. Not regular. Very time consuming.
A: did not participate in first. This is an investment. Do we want to invest something first so we can get something usable later. Will build publicly available system - hope ppl use. We automatically learn when they use it. Mostly painless process. We like to collaborate with the curators to improve the system.
----------------------------------------------------------------------------------------------------------
10:15 - 10:30
Enzyme Prediction and the Metabolic Reconstruction of Probiotic Bacteria
Cedoljub Bundalovic-Torma
The Hospital for Sick Children, Canada
Abstract
The gut-microbiome plays a crucial role in human health, and it has been shown that disruptions of this important niche underlie several prevalent diseases, such as Type-I Diabetes, Celiac Disease, and Inflam- matory Bowel Disease. Having the potential to ameliorate disrupted gut-microbiomes, probiotic bacteria may hold promise for treating the growing incidence of these diseases. Unfortunately, the mechanisms and genetic determinants that lend probiotic bacteria towards therapeutic use are poorly understood at present but will be greatly aided by comparative genomic analyses. In order to do so, the reliable annota- tion of metabolic enzymes will be essential towards reconstructing and identifying the relevant metabolic pathways utilized by probiotics. We will present our efforts towards this goal through the development of a bioinformatics pipeline to computationally predict enzymes using an integration of a variety of sequence- based approaches, and its application pipeline toward the metabolic reconstruction two commercially utilized probiotic bacterial species. Aided by metabolic network models we will discuss initial results of ge- nomic comparisons in identifying unique metabolic pathways in probiotics, as well as future work applying our findings through metabolic modeling. 
Notes
Problems in enzyme prediction

    Probiotics

    bacteria - known for awhile

    benefit to host health when administered

    NGS - wealth of novel bacterial genomes 

    how do pathways differ

    systems bio approach - 

    genes not in isolation

    genes/proteins diverse array of processes - variety of interactions

    highly interconnected

    accuracy decreases when you use blast-based approach

    challenge - novel sequence to annotate want to accurately predict metabolic 

    DETECT - enzyme prediction 

    blast results - get a lot of different hits typically see in enzymes.

    want to tell which one of these enzymatic functions we can assign to protein

    poster #13

    greater accuracy than blast-based approaches 

    x-validation of swiss-prot EC annotated proteins

    draft sequence of B. subtilis R0179

    gone through some computational annotation 

    re-predicting genes

    DETECT, PRIAM+BLAST

    generate list of likely enzymes in this genome, reactions they catalyze, metabolic pathways

    high confidence enzyme predictions

    different methods don't all overlap

    put enzymes into metabolic network reconstruction

    reference strain - curated metabolic model 

    useful contribution: we can infer potentially missing reactions found in reference but not predicted

    State of the model we have so far:

    missing enzymes in glycolysis - seems to be a missing step

    terpenoid backbone biosynthesis - significant portion predicted

    Summary:

    need accurate functional annotation of genomes - crucial 

    some enzyme categories cannot predict using BLAST

    probabilistic approach of DETECT can accurately assign enzymatic functions to genes

    combining approaches = can create reliable metabolic network

    reference metabolic pathways can aid to find missing reations - pathway holes

    Future:

    MetaCyc - tool to fill pathways holes

    construct metabolic model of B. subtilis R0179

    metabolic reconstruction of Lactobacillus helveticus R0052

http://compsysbio.org/projects/DETECT/
Q & A
Q: In the gut - that organism should live as part of different x-feeding community.  Clue in how to factor in other organisms?
A: Pathways reconstruction - create a meta. There are uniquely contributed pathways from particular species where you have a shift, healthy distribution. Disease state - certain overrepresentation - metabolic byproducts implicated in autism and more. Much to be learned. Microbiome not tapped out.
----------------------------------------------------------------------------------------------------------
11:00 - 11:20
Expert curation in UniProtKB: a case study in dealing with conflicting and erroneous data               
Sylvain Poux
SIB, Switzerland
Abstract
UniProtKB/Swiss-Prot provides expert curation with information extracted from literature and curator- evaluated computational analysis. As knowledgebases continue to play an increasingly important role in scientific research, a number of studies have evaluated their accuracy and revealed various errors. While some are curation errors, others are the result of incorrect information published in the scientific literature. By taking the example of a complex annotation case, we will describe the curation procedure of UniProtKB/Swiss-Prot and detail how we report conflicting information in the database. We will demon- strate the importance of collaboration between resources to ensure curation consistency and the value of contributions from the user community in helping to maintain error-free resources.
Notes
Address erroneous data in both lterature and databases

    Erroneous data

    databases

    study - misannotation level in 37 enzyme SF in protein

    erroneous annotation much higher in automatically annotated dbs

    the economist - growing number of errors - alarming level that science is not possible anymore

    nature- could only reproduce 6/53 studies

    scientists from only 25% could be reproduced

    e.g. SIRT5 - complex annotation case. how we deal with conflicting information

    member of Sir2 family S. cervisiae

    most members - protein deacetylases

    inital reports

    in an inital report - SIRT5 - protein deacetylase activity in vitro

    later, regulator of urea cycle, activate Cps1 enzyme in vivo

    Sirt5 deacetylates Cps1 invitro

    did not show activation of Cps1 is due to deacetylation

    3 years ago - 3 different gorups new activity:

    crystal structure

    pocket to host acetyle group much larger in SIRT5

    large enough for acyl group instead

    acts as protein deacylase

    protein lysine demalonylase and desuccinylase

    lysine malonylation and succinylation - 2 new PTMs

    malonylated / succinylated in vivo - including Cps1

    in Sirt5 knockout mice - Cps1 not activated

    Summary: 

    1. SIRT5 - NOT a protein deacetylase in vivo

    acts as protein deacylase

    specifically removes malonyl and succiny groups

    how to represent in databases?

    Annotation:

    describe activity, structure, all with evidence

    annotate 3D structure 

    info added in protein entries of appropriate species

    knowckout experimetns from mouse

    bovine - malonylation and succinylation

    when info known, it is indicated

    'caution' tag - "does not exist in vivo"

    protein deacylase is confirmed - removes medium and long fatty acid chains

    information must be propogated with care

    a family rule was created - automatic annotation

    >4K unipro entries

    collaborate with a number of databases/resources

    GO corsortium:

    request new GO terms when required

    added NOT qualifier- gene product not associated with GO term while it may be expected by automatic methods or previous literature

    PTMs - controlled vocab in collab wiht RESID

    new catalytic reactions submitted to IUBMB

    Summary:

    curation of 3 papers acording to standard uniprot workflow:

    reannotation of SIRT5 protein

    addition of PTM sites in 28 protein entries

    new vocab: 4 GO terms, 2 PTM , 1 catalytic reaction

    uniprot entries regularly updated

    as new info comes

    only relevant information

    publications read fully and indetail - curated

    other publication - additional bibliography - UniProtKB

    85 papers in PubMed - only 26 from annotation

    most if taken from 3 papers

    other 59 - not relevant, redundant, or weaker evidences

    number of papers still published based on invitro - indirect analysis

    a lot of confusion

    Possible Solution: how to detect erroneous info in publication

    number of retractions - very small

    reading a range of publications in an area ensures both efficiency and critical analysis of data

    hard when there is no publication contradicting it

    => Nature - 'Enhancing reproducibility'

    guidelines to authors - standards

    Collaboration between resources - essentail to limit number of errors in databases and their propagation

Q & A
Q: There are lots of ways to automatically detect conflict in information. Which fact is true fact in teh literature - that can help you? Software system openly available publically
A: We can discuss later, thank you
Q: Have to dig into annotations - make warnings more conspicuous. Not obvious from SIRT5 uniprot page does not have acytelase activity
A: Difficult. Indicated in text. Impacted by users - dislike when critisize their work. Careful when retract evidence. Still published. We put appropriate number, names, highlight activity, but then as long as their are papers being published we have to display that.  Careful.
Q: On a number of high profile papers - showing it's wrong in social media (retraction watch). 
A: We don't actively monitor social media. Maybe we should look at that
Q: It seems like biocurators are taking up the role of post-publication review. How much of that do you think you should be doing? Peer review is not perfect. How active do you think you should be having?
A: We have to give a complete overview. Important to give information to user. Many conflicted data - sometimes minor other times major. Responsibility to give information to users. Of course, sometimes mistake, then we correct.
----------------------------------------------------------------------------------------------------------
11:20 - 11:40
Effective automated classification using ontology-based annotation: experience with analysis of adverse event reports 
Melanie Courtot
BCCRC, Canada
Abstract
Analysis of reports of adverse events following immunization is an important way to identify potential problems in vaccine safety and efficacy. However, reports are not annotated in a consistent manner, resulting in tedious, costly and time consuming analysis. In order to address these deficiencies, the Brighton Collaboration has done extensive work towards standardization of case definitions and diagnostic criteria for vaccine adverse events. Within the framework of the OBO Foundry, we built the Adverse Events Reporting Ontology (AERO), a logical formalization of the Brighton guidelines to enable consistent and accurate annotation of adverse event reports. We applied it and validated our results against a manually curated dataset from the Vaccine Adverse Event Reporting System. AERO allows users to unambiguously refer to a specific set of carefully defined signs and symptoms at the time of data entry, as well as an overall diagnosis that remains linked to its associated signs and symptoms. The adverse event diagnosis is formally expressed, making it amenable to further querying. We show that it is possible to automate the classification of adverse events using the AERO with very high specificity (97%). We have demonstrated that AERO can be used with other types of guidelines, and our pipeline relies on open and widely used data standards (RDF, OWL, SPARQL) for implementation, making our system easily transposable to other domains. Within the Integrated Rapid Infectious Disease Analysis project (IRIDA; http://www.irida.ca), we have also initiated development of ontologies for infectious disease outbreak analysis. Our aim is to not only transform infectious disease control, but also illustrate the general benefit of ontology-based annotation to automate classification, which is key to many biocuration efforts.
Slides
    http://www.slideshare.net/mcourtot/biocuration-2014
Notes
Next week: Post doc at SFU

    Why do we care about adverse event following immuniztion

    issues with vaccine

    shown that the better information we can communicate on immunization - the more successful the campaign

    if we look at how assessed: manually assessed by expert (expensive)

    time consuming

    Hypothesis: possible to encode clinical guideline -

    use OWL ontology

    apply to existing datasets - automated classification of advse effects

    Test case - VAERS dtaset

    Vaccine Adverse Event REporting System

    took a few months for experts to manually curate

    manually classified for anaphylaxis

    MedDRA - medical dictionary of regulatory activities - clinical findings

    free text - clinician entered data

    MedDRA encoded structured data - keywords that represent the free text part

    Diagnosis workflow

    A) adverse event reporting

    B) collected dataset - mysql dtabase

    c) annotations collected (mysql)

    Put these three pieces together -> run python pipeline (reasoner) 

    OWL/RDF xport

    can ask questions

    d) add manually curated dataset

    specificicity really high - 97%. 

    fair sensitivity 57%

    adverse events are rare - we want to be able to detect 

    brighton reports- complex syntoms - miss positive case

    went back to MedDRA - great for screening. Try to increase sensitivity.

    take positive and negative reports

    find elements significantly associated to outcome

    add to existing query

    Cosine similarity method

    vector of terms (documents - query and report)

    combare cosine measure of angle they form

    Results: Sensitivity 92%

    Specificity 87%

    trying not to miss cases

    Discussion:

    using ontology on its own, sensitivity is too low for screening (b/c of brighton guideline for diagnosis, not screening)

    improve sensitivity with MedDRA

    Outcomes:

    encoding standards don't allow for complete representation of events

    chronic? important

    information lacking in reports from surveillance systems

    missing rash - not assessed? not recorded? negative?

    logical translation of guidelines allows for better detection of inconsistencies and errors

    working with brighton to add logical formalization

    Using ontology in future

    reporting

    fast screening - potentially positive. Send back form, get more information. Based on brighton guideline.

    future system - implementation of ontology based system at time of data entry

    labels and textual definitions for each term

    consistency checking

    Next steps:

    IRIDA project Integrated Rapid Infectious disease analysis

    iridia.ca

    bioinformatics platform for genomic epidemiology analysis

    academic and public groups together

    ontologies developed to annnotate clinical, lab and epidemiology data - integrated for analysis

Poster 15
----------------------------------------------------------------------------------------------------------
11:40 - 12:00
Supervised text mining for functional curation of gene products: how the data and performances have evolved in the last ten years         
Julien Gobeill
SIB, Switzerland
Abstract
We describe how text mining can support the literature-based curation tasks for functional proteomics us- ing Gene Ontology (GO) concepts. In bioinformatics, this task is often considered both as time-consuming for humans and not performed with sufficient quality by computers. We will report on our participations in the BioCreative I and IV challenges, and on the progress regarding the data and systems’ performances during the decade. In 2004, the first BioCreative already designed a task of automatic GO concepts as- signment from articles. At this time, we led the way thanks to a dictionary-based system. Yet, results were judged far from reaching the performances required by real curation workflows. Dictionary-based approaches compute similarities between the input text and the GO terms. These approaches are es- sentially limited by the complex nature of the GO: identifying GO concepts in text is highly challenging, as they often do not appear literally or even approximately in text. On the other hand, supervised approaches compute similarities between an input text and already curated instances. In 2004, such data-driven ap- proaches were outperformed by dictionary-based systems, due to a lack of training data. Since then, the availability curation data have massively improved. In 2003, the BioCreative IV Track 4 revisited the automatic GO assignment task. We demonstrate how our supervised classifier GOCat achieved top com- petitive results, approaching human curators. GOCat learns from 100,000 curated abstracts contained in GOA, in order to deliver accurate but also highly integrable GO concepts. Moreover, the quality of the GOCat predictions continues to improve across the time, thanks to the growth of GOA: since 2006, GOCat performances have improved by 50%. Supervised text mining has now reached maturity and is ready for being used in operational semi-automatic curation workflows, or in fully automatic annotation settings for high-throughput databases. GOCat i
s available at eagl.unige.ch/GOCat.
Notes
Working on: algorithms
input: free text (abstract)
output: GO terms 

    10 year perspective: BioCreative 2004, BioCreateive 2013

    evolution of performances across this time

    GO has grown - adding synonyms in GO

    yet DB effectiveness is constant

    GOA - PMID annotations - growing 

    Background:

    text miners - not biocurators

    data deluse issue-  curated data behind knowledge in literature

    classical categorization task: how to predict GO terms from text

    state of the art approach:

    1. dictionary based approach

    similarity between input text nad GO terms - needs synonyms

    massive effort was put into synonyms (2008 esp)

    2. machine learning approach

    tries to use knowledge base, to infer more likely GO terms

    100K abstracts form medline in GOA

    4K GO codes w/ at least 10 examples (80% of all annotations in GOA)

    GOCat-  machine learning sys

    query - abstract/free text

    retrive k most similar abstracts in GOA

    infer most prevalent GO codes 

    more similar is the document, the more likely share GO codes

    Study: both approaches accross time

    dictionary based EAGL

    machine learning GOCat

    Predict GO terms from just published abstract - simulation over time

    restore data as 2008

    look at how well engines did predict GO terms 

    gold standard: GOA

    benchmark 1K abstract each year (3K GO terms - 2.8 GO terms/abstract)

    Result:

    Dictionary based - constant

    Machine learning - improvement 

    for knowledge base - better is bigger

    In 2004 - DB rules

    poor performance

    2013 - ML rules

    protein and free text integrated

    GOCat Interface

    http://eagl.unige.ch/GOCat/

    Conclusions:

    critical mass is reached for Machine Learning - but still place for combination

    ML works well for 80% of annotation. But not for more rare GO terms

    curators: lets use available data

    GOCat - good for semi-automatic curation qworkflows

Q & A
Q: Comment - good to see tha tML appraoch has improved as more data becomes available. After we ran the challenge we did some interal agreement. Exact match annotation - 50%. Over 60% if we relax that. Still room for text-mining community.