ISB2014 Babbitt

From WormBaseWiki
Jump to: navigation, search
The Seventh International Biocuration Conference
Functional Annotations
 April 8, 2014
http://biocuration2014.events.oicr.on.ca/agenda-5
link to all ISB2014 notes:
    http://etherpad.wikimedia.org/p/isb2014
Help me take collaborative notes on this session! Fill out your name in the box in the top right-hand corner. Add yourself to the editors list, then edit away!
I make a lot of typos. Sorry.
Editors

    name / affiliation / twitter

    Abigail Cabunoc / OICR / @abbycabs

    Marc Perry / OICR / @mdperry

Next session: http://etherpad.wikimedia.org/p/isb2014-microbe
09:00 - 10:00
Keynote Lecture, Chaired by Marc Gillespie 
The Great Hall, Hart House
10:00 - 12:00
Session 3 - Functional Annotations, Chaired by Iddo Friedberg
The Great Hall, Hart House
announcements:
Posters go down tonight
group picture tomorrow
Poser awards - semi-finalists: stand by your posters from 5-6
    3, 28, 42, 60, 68, 75, 96, 120, 131, 145


----------------------------------------------------------------------------------------------------------
A Global Context for Prediction of Functional Trends in Protein Superfamilies
Patricia Babbitt
California Institute for Quantitative Biosciences (QB3), UCSF, USA
NCBI board scientific counsellor
UniProt board
Abstract
As sequence data continues to rapidly increase, the number of new protein 
sequences of unknown function is beginning to dwarf those of known or 
[accurately] predicted function. Likewise, the proportion of proteins that can 
ever be experimentally characterized is becoming vanishingly small. Graphical 
network models that capture sequence similarities among members of large sets of 
homologous sequences offer a global view of their structure-function relationships. 
“Painting” clusters of related sequences with available functional information 
provides valuable clues for prediction of functional properties for proteins of 
unknown or misannotated function along with new insights unavailable from 
smaller scale studies. Further, similarity networks offer a powerful way to suggest 
targets for which experimental characterization may be effectively leveraged to 
predict functional properties of the many uncharacterized unknowns that continue 
to be discovered. Less encouraging, summary network views of large enzyme 
superfamilies suggest that similarity boundaries inferred from sequence and 
structural comparisons often fail to track with functional boundaries, reflecting 
significant challenges for automated function prediction. 
Notes
Challenges - non trivial cases (assignment of molecular function)

    messyness in bio

    great resources that provide a foundation - context for work we want to do

    many of these tools often require that we use sequence and structural info as a starting point for functional annotation

    Large scale studies - protein super families - trying to get functional trends

    what we learn form nature

    what we learn from biology 

    what we need to know

    our work in functional inferrence

    enzymes evolved to catalyze the diff chemical reactions of living organisms

    used limited set of 'priveleged' scaffolds

    sometimes dozens of rxns, associated functional key capability

    what are the range of reactions accessible?

    how has nature remodelled these scaffolds to enable 

    specific superfamilies within fold families

    estimate 1/3 of universe of enzyme superfamilies are functionally diverse

    Enolase SF - example

    27K sequences in this set

    found throughout biosphere

    many roles

    pairwise identity between famlies low < 15-65%

    best curated SF she knows of

    develop the idea of chemically constrained evolution

    structural elements - fundamental chemical aspect are conserved

    chemical reactions - how does this conserved active site lead to many different reactions/

    share active site architecture

    model constrains search space

    predicting reaction & substrate specificity

    binding of ligants by substrate/product in active site

    conserved sub-structure

    rest of ligand - all over the place

    figure out parts of hte protein that specify these differences

    not trivial (cna't just do mult alignment)

    b/c these different reaction families evolve at different rates

    'pseudo-convergent' evolution -

    same reaction has evolved from different intermediate ancestors

    promiscuity - underlying

    pretty sure (not exhaustive) none of these same reaction, but not far from each other

    misannotation - hard to annotate

    proteins with good mechanistic info (know specificity determinants) - how often wrong?

    high levels of mis-annotation in many families

    only swisprot was really good

    chemistry0constrained - can assign members to superfamility associated with fundamental chemical capability/strategy

    can then use sequence nad structure data to break the data into more manageable parts

    can do annotation transfer only at the level of granularity with good evidence

    => structure function database - manually curated super families

    different kinds of info (mult align, associated data)

    What we really need to know about biology to facilitate functional annotation? clues

    1. different SF evolved in unique adn complex ways

    different in important ways - prevent simple generalization

    can't rely on thesholds - (above level of similarity)

    examples of different kinds of variations

    1. domain organization - thioredoxin fold class

    varies much more than recognized

    many SF in this fold class with additional domains - do chemistry differently

    similar with respect to thioredoxin 

    2. alkaline phosphatase SF 

    poster child for detailed mechanistic understanding

    all associated with variations and some large inserts

    many of these inserts unrelated

    locations vary

    multiple kinds in a single group

    structural insert patters don't match... something

    3. vicinal oxygen chelate (VOC) fold

    proteins made of variations of a fundamental sub-domain duplicated 

    in many diff. groups - interesting set of combinatorics of these sub-domains

    not till we got enough structures that we started to understand the problem of annotating these proteins

    active sites: divergence

    you'd think they'd be conserved

    the background active site structure is similar, but different binding (metal dependence sometimes)

    quite different chemistry - complicated

    2. Reaction families evolve at different rates

    OSBS - only residues conserved are teh residues conserved in /entire/ superfamiy

    reason we know the proteins evolve quickly

    looked at ancestry - somewhere back the twist between domains vary

    specificity are different

    identify them through genome context

    means that: when we build markov models for families - separate nicely between families

    OSBS - family members can't be separated nicely - only conserve residues eveyone has

    protein similarity network

    SF can be very large - too large to look at easily (trees, mult align)

    took years to curate 12 SF in highly curated set

    offer advantages for getting first pass view on large scale of structure-function mapping

    networks have been useful - but principly basis of hypothesis

    how architecture put together

    what we know function of proteins

    different clades in tree represent different functional classes

    Validation:

    BLAST - simple metric to compare structures

    get quite good distances relative 

    generally robust to missing data

    biocomparisons - looking at N-dimensional matrix - complex. 2 dimension capture well

    Layout used - organic layout cytoscape - couldn't reverse engineer

    edge lengths degree of connectivity (not directly)

    built app to use representative networks of very large SF

    can't build network with too many edges (<1M, cap at 500K to download)

    how build netowrk

    1. generate representative network

    2. define subgroup clusters

    3. generate full network for each cluster

    4. define criteria for distinguishing families and annotation transfer

    5. map functioan properties to nodes and edges

    reached a limit on full network where using all sequences

    too many sequences to do this now

    running against storage capacity limits

    have to use 90% redundance filtering - to not take up too much space

    attributes can be mapped to netowrk - genome context, type of life, more

    can downlod from database

    What we've learned

    Glutathione transferases (GSTs)

    found throughout biosphere (some variation)

    13K nonredundant sequences

    30K refs in PubMed-  studies for ~70years

    reaction types known - difficult to assay 

    we don't know huge majority - physiological functions

    tested using synthetic substrates

    how little we know

    1500 representative nodes - swiss prot famility representations

    grey nodes- sparse information

    many classes to be discovered - used more stringent e-value

    structure similarity networks

    eukaryotic proteins more similar

    promiscuity and mult. reaction types - in most subgroup

    many reaction types found in multiple subgroups

    annotation is going to be hard

    most subgroups contain several types of live

    don't track with phylogeny

    even groups with lot of metazoans and eukaryotes -  some bacteria

    well characterized mechanistically

    Isoprene synthase I superfamily - very sparse annotation

    includes some very large clusters. we know very little

    Radical SAM superfamily

    50K sequences

    mapped 

    not sure that they all the same fold type

    very complicated problem - easy to make inferrences based on conserved

    Enolase SF - unknowns

    think about how to sort out functions 

    evaluating misannotation

    enolase SF - distinct clusters

    using coming in can search db for sequence - send back network

    understand why sequence doesn't behave as expected

    may be misannotated

    Substrate specificity

    alkaline phophate SF

    missing piece - 

    Target selection - experimental characterisation

    pick pathogens, proteins with 

    different approach: picked 82 targets to cover the unknown space

    Armstrong's lab - assayed , screened for disulfide bond reductase reactions

    typified with one subgroup

    Context offers clues - 'outlier' reactions

    because protein member of enolase SF (canonical rxns)

    can make inferences on mechanical 

    What we've learned: 

    function prediction in these kinds of SF is hard- a lot of these

    typify general problem: a lot of the proteins are very similar to each other. Looking at diverse members of the group we start to see features we've talked about

    context to get started at structure-function relationships

    looking at diverse proteins - clustering of homologous sequences may not track with functional boundaries - complicates functional prediction

    every SF is different in important ways - 

    remarkable - how few proteins in each SF have been looked at experimentally

    Challenges and caveats

    appropriate threshold for visualisation

    interpretation - sensitive to many factors

    relationships span large set of variations

    blast is fast & compares reasonably

some key enzyme resources include:
    sfld - www.sfld.rbvi.ucsf.edu
    csa - www.ebi.ac.uk/thornton-srv/databases/csa
    macie - www.ebi.ac.uk/thornton-srv/databases/MACiE
A superfamily is defined as:
    evolutonarily related
    conserved avcteie site architecture
    some conservec
i
Q & A
Q: could you not get additional info from protein-protein interaction networks? Proteins next to each other get function?
A: Of course, only way to go these days. Use multiple kinds of different information. In experiment - the ppl who did best did that sort of thing. Bring together multiple kinds of information - use genome context extensively. Similarity and context we get a lot of power. We often don't have to the data in enzyme - esp. bacteria.
Q: gene ontology - also do function. Phylogenetic approach to do the inference. Sparse (true). In misannotation - do the misannotations that you uncover get fed back to the source? Can be corrected at sourcE?
A: depends on teh database - depends if theyre correcting annotation. Genbank is different from uniprot . Working with them to make these kinds of correction (swiss prot will fix). Worry about the level of information is. Just because we have some experimental information - doesn't mean we know everything. 
Q: Have you tried to contact Refseq
A: TAlked to NCBI quite a bit. Not directly working. Some of that development is more recent. Working wiht swissprot, uniprot. We tried to make these changes years ago. One of the things concerned about - get into database can be done systematically. Need to define criteria in order to correct this annotation. Poised to do in future.