https://wiki.wormbase.org/index.php?title=ISB2014_Babbitt&feed=atom&action=historyISB2014 Babbitt - Revision history2024-03-28T15:53:21ZRevision history for this page on the wikiMediaWiki 1.33.0https://wiki.wormbase.org/index.php?title=ISB2014_Babbitt&diff=22870&oldid=prevKyook: Created page with "<pre> The Seventh International Biocuration Conference Functional Annotations April 8, 2014 http://biocuration2014.events.oicr.on.ca/agenda-5 link to all ISB2014 notes: h..."2014-04-11T19:13:26Z<p>Created page with "<pre> The Seventh International Biocuration Conference Functional Annotations April 8, 2014 http://biocuration2014.events.oicr.on.ca/agenda-5 link to all ISB2014 notes: h..."</p>
<p><b>New page</b></p><div><pre><br />
The Seventh International Biocuration Conference<br />
Functional Annotations<br />
April 8, 2014<br />
http://biocuration2014.events.oicr.on.ca/agenda-5<br />
link to all ISB2014 notes:<br />
http://etherpad.wikimedia.org/p/isb2014<br />
Help me take collaborative notes on this session! Fill out your name in the box in the top right-hand corner. Add yourself to the editors list, then edit away!<br />
I make a lot of typos. Sorry.<br />
Editors<br />
<br />
name / affiliation / twitter<br />
<br />
Abigail Cabunoc / OICR / @abbycabs<br />
<br />
Marc Perry / OICR / @mdperry<br />
<br />
Next session: http://etherpad.wikimedia.org/p/isb2014-microbe<br />
09:00 - 10:00<br />
Keynote Lecture, Chaired by Marc Gillespie <br />
The Great Hall, Hart House<br />
10:00 - 12:00<br />
Session 3 - Functional Annotations, Chaired by Iddo Friedberg<br />
The Great Hall, Hart House<br />
announcements:<br />
Posters go down tonight<br />
group picture tomorrow<br />
Poser awards - semi-finalists: stand by your posters from 5-6<br />
3, 28, 42, 60, 68, 75, 96, 120, 131, 145<br />
<br />
<br />
----------------------------------------------------------------------------------------------------------<br />
A Global Context for Prediction of Functional Trends in Protein Superfamilies<br />
Patricia Babbitt<br />
California Institute for Quantitative Biosciences (QB3), UCSF, USA<br />
NCBI board scientific counsellor<br />
UniProt board<br />
Abstract<br />
As sequence data continues to rapidly increase, the number of new protein <br />
sequences of unknown function is beginning to dwarf those of known or <br />
[accurately] predicted function. Likewise, the proportion of proteins that can <br />
ever be experimentally characterized is becoming vanishingly small. Graphical <br />
network models that capture sequence similarities among members of large sets of <br />
homologous sequences offer a global view of their structure-function relationships. <br />
“Painting” clusters of related sequences with available functional information <br />
provides valuable clues for prediction of functional properties for proteins of <br />
unknown or misannotated function along with new insights unavailable from <br />
smaller scale studies. Further, similarity networks offer a powerful way to suggest <br />
targets for which experimental characterization may be effectively leveraged to <br />
predict functional properties of the many uncharacterized unknowns that continue <br />
to be discovered. Less encouraging, summary network views of large enzyme <br />
superfamilies suggest that similarity boundaries inferred from sequence and <br />
structural comparisons often fail to track with functional boundaries, reflecting <br />
significant challenges for automated function prediction. <br />
Notes<br />
Challenges - non trivial cases (assignment of molecular function)<br />
<br />
messyness in bio<br />
<br />
great resources that provide a foundation - context for work we want to do<br />
<br />
many of these tools often require that we use sequence and structural info as a starting point for functional annotation<br />
<br />
Large scale studies - protein super families - trying to get functional trends<br />
<br />
what we learn form nature<br />
<br />
what we learn from biology <br />
<br />
what we need to know<br />
<br />
our work in functional inferrence<br />
<br />
enzymes evolved to catalyze the diff chemical reactions of living organisms<br />
<br />
used limited set of 'priveleged' scaffolds<br />
<br />
sometimes dozens of rxns, associated functional key capability<br />
<br />
what are the range of reactions accessible?<br />
<br />
how has nature remodelled these scaffolds to enable <br />
<br />
specific superfamilies within fold families<br />
<br />
estimate 1/3 of universe of enzyme superfamilies are functionally diverse<br />
<br />
Enolase SF - example<br />
<br />
27K sequences in this set<br />
<br />
found throughout biosphere<br />
<br />
many roles<br />
<br />
pairwise identity between famlies low < 15-65%<br />
<br />
best curated SF she knows of<br />
<br />
develop the idea of chemically constrained evolution<br />
<br />
structural elements - fundamental chemical aspect are conserved<br />
<br />
chemical reactions - how does this conserved active site lead to many different reactions/<br />
<br />
share active site architecture<br />
<br />
model constrains search space<br />
<br />
predicting reaction & substrate specificity<br />
<br />
binding of ligants by substrate/product in active site<br />
<br />
conserved sub-structure<br />
<br />
rest of ligand - all over the place<br />
<br />
figure out parts of hte protein that specify these differences<br />
<br />
not trivial (cna't just do mult alignment)<br />
<br />
b/c these different reaction families evolve at different rates<br />
<br />
'pseudo-convergent' evolution -<br />
<br />
same reaction has evolved from different intermediate ancestors<br />
<br />
promiscuity - underlying<br />
<br />
pretty sure (not exhaustive) none of these same reaction, but not far from each other<br />
<br />
misannotation - hard to annotate<br />
<br />
proteins with good mechanistic info (know specificity determinants) - how often wrong?<br />
<br />
high levels of mis-annotation in many families<br />
<br />
only swisprot was really good<br />
<br />
chemistry0constrained - can assign members to superfamility associated with fundamental chemical capability/strategy<br />
<br />
can then use sequence nad structure data to break the data into more manageable parts<br />
<br />
can do annotation transfer only at the level of granularity with good evidence<br />
<br />
=> structure function database - manually curated super families<br />
<br />
different kinds of info (mult align, associated data)<br />
<br />
What we really need to know about biology to facilitate functional annotation? clues<br />
<br />
1. different SF evolved in unique adn complex ways<br />
<br />
different in important ways - prevent simple generalization<br />
<br />
can't rely on thesholds - (above level of similarity)<br />
<br />
examples of different kinds of variations<br />
<br />
1. domain organization - thioredoxin fold class<br />
<br />
varies much more than recognized<br />
<br />
many SF in this fold class with additional domains - do chemistry differently<br />
<br />
similar with respect to thioredoxin <br />
<br />
2. alkaline phosphatase SF <br />
<br />
poster child for detailed mechanistic understanding<br />
<br />
all associated with variations and some large inserts<br />
<br />
many of these inserts unrelated<br />
<br />
locations vary<br />
<br />
multiple kinds in a single group<br />
<br />
structural insert patters don't match... something<br />
<br />
3. vicinal oxygen chelate (VOC) fold<br />
<br />
proteins made of variations of a fundamental sub-domain duplicated <br />
<br />
in many diff. groups - interesting set of combinatorics of these sub-domains<br />
<br />
not till we got enough structures that we started to understand the problem of annotating these proteins<br />
<br />
active sites: divergence<br />
<br />
you'd think they'd be conserved<br />
<br />
the background active site structure is similar, but different binding (metal dependence sometimes)<br />
<br />
quite different chemistry - complicated<br />
<br />
2. Reaction families evolve at different rates<br />
<br />
OSBS - only residues conserved are teh residues conserved in /entire/ superfamiy<br />
<br />
reason we know the proteins evolve quickly<br />
<br />
looked at ancestry - somewhere back the twist between domains vary<br />
<br />
specificity are different<br />
<br />
identify them through genome context<br />
<br />
means that: when we build markov models for families - separate nicely between families<br />
<br />
OSBS - family members can't be separated nicely - only conserve residues eveyone has<br />
<br />
protein similarity network<br />
<br />
SF can be very large - too large to look at easily (trees, mult align)<br />
<br />
took years to curate 12 SF in highly curated set<br />
<br />
offer advantages for getting first pass view on large scale of structure-function mapping<br />
<br />
networks have been useful - but principly basis of hypothesis<br />
<br />
how architecture put together<br />
<br />
what we know function of proteins<br />
<br />
different clades in tree represent different functional classes<br />
<br />
Validation:<br />
<br />
BLAST - simple metric to compare structures<br />
<br />
get quite good distances relative <br />
<br />
generally robust to missing data<br />
<br />
biocomparisons - looking at N-dimensional matrix - complex. 2 dimension capture well<br />
<br />
Layout used - organic layout cytoscape - couldn't reverse engineer<br />
<br />
edge lengths degree of connectivity (not directly)<br />
<br />
built app to use representative networks of very large SF<br />
<br />
can't build network with too many edges (<1M, cap at 500K to download)<br />
<br />
how build netowrk<br />
<br />
1. generate representative network<br />
<br />
2. define subgroup clusters<br />
<br />
3. generate full network for each cluster<br />
<br />
4. define criteria for distinguishing families and annotation transfer<br />
<br />
5. map functioan properties to nodes and edges<br />
<br />
reached a limit on full network where using all sequences<br />
<br />
too many sequences to do this now<br />
<br />
running against storage capacity limits<br />
<br />
have to use 90% redundance filtering - to not take up too much space<br />
<br />
attributes can be mapped to netowrk - genome context, type of life, more<br />
<br />
can downlod from database<br />
<br />
What we've learned<br />
<br />
Glutathione transferases (GSTs)<br />
<br />
found throughout biosphere (some variation)<br />
<br />
13K nonredundant sequences<br />
<br />
30K refs in PubMed- studies for ~70years<br />
<br />
reaction types known - difficult to assay <br />
<br />
we don't know huge majority - physiological functions<br />
<br />
tested using synthetic substrates<br />
<br />
how little we know<br />
<br />
1500 representative nodes - swiss prot famility representations<br />
<br />
grey nodes- sparse information<br />
<br />
many classes to be discovered - used more stringent e-value<br />
<br />
structure similarity networks<br />
<br />
eukaryotic proteins more similar<br />
<br />
promiscuity and mult. reaction types - in most subgroup<br />
<br />
many reaction types found in multiple subgroups<br />
<br />
annotation is going to be hard<br />
<br />
most subgroups contain several types of live<br />
<br />
don't track with phylogeny<br />
<br />
even groups with lot of metazoans and eukaryotes - some bacteria<br />
<br />
well characterized mechanistically<br />
<br />
Isoprene synthase I superfamily - very sparse annotation<br />
<br />
includes some very large clusters. we know very little<br />
<br />
Radical SAM superfamily<br />
<br />
50K sequences<br />
<br />
mapped <br />
<br />
not sure that they all the same fold type<br />
<br />
very complicated problem - easy to make inferrences based on conserved<br />
<br />
Enolase SF - unknowns<br />
<br />
think about how to sort out functions <br />
<br />
evaluating misannotation<br />
<br />
enolase SF - distinct clusters<br />
<br />
using coming in can search db for sequence - send back network<br />
<br />
understand why sequence doesn't behave as expected<br />
<br />
may be misannotated<br />
<br />
Substrate specificity<br />
<br />
alkaline phophate SF<br />
<br />
missing piece - <br />
<br />
Target selection - experimental characterisation<br />
<br />
pick pathogens, proteins with <br />
<br />
different approach: picked 82 targets to cover the unknown space<br />
<br />
Armstrong's lab - assayed , screened for disulfide bond reductase reactions<br />
<br />
typified with one subgroup<br />
<br />
Context offers clues - 'outlier' reactions<br />
<br />
because protein member of enolase SF (canonical rxns)<br />
<br />
can make inferences on mechanical <br />
<br />
What we've learned: <br />
<br />
function prediction in these kinds of SF is hard- a lot of these<br />
<br />
typify general problem: a lot of the proteins are very similar to each other. Looking at diverse members of the group we start to see features we've talked about<br />
<br />
context to get started at structure-function relationships<br />
<br />
looking at diverse proteins - clustering of homologous sequences may not track with functional boundaries - complicates functional prediction<br />
<br />
every SF is different in important ways - <br />
<br />
remarkable - how few proteins in each SF have been looked at experimentally<br />
<br />
Challenges and caveats<br />
<br />
appropriate threshold for visualisation<br />
<br />
interpretation - sensitive to many factors<br />
<br />
relationships span large set of variations<br />
<br />
blast is fast & compares reasonably<br />
<br />
some key enzyme resources include:<br />
sfld - www.sfld.rbvi.ucsf.edu<br />
csa - www.ebi.ac.uk/thornton-srv/databases/csa<br />
macie - www.ebi.ac.uk/thornton-srv/databases/MACiE<br />
A superfamily is defined as:<br />
evolutonarily related<br />
conserved avcteie site architecture<br />
some conservec<br />
i<br />
Q & A<br />
Q: could you not get additional info from protein-protein interaction networks? Proteins next to each other get function?<br />
A: Of course, only way to go these days. Use multiple kinds of different information. In experiment - the ppl who did best did that sort of thing. Bring together multiple kinds of information - use genome context extensively. Similarity and context we get a lot of power. We often don't have to the data in enzyme - esp. bacteria.<br />
Q: gene ontology - also do function. Phylogenetic approach to do the inference. Sparse (true). In misannotation - do the misannotations that you uncover get fed back to the source? Can be corrected at sourcE?<br />
A: depends on teh database - depends if theyre correcting annotation. Genbank is different from uniprot . Working with them to make these kinds of correction (swiss prot will fix). Worry about the level of information is. Just because we have some experimental information - doesn't mean we know everything. <br />
Q: Have you tried to contact Refseq<br />
A: TAlked to NCBI quite a bit. Not directly working. Some of that development is more recent. Working wiht swissprot, uniprot. We tried to make these changes years ago. One of the things concerned about - get into database can be done systematically. Need to define criteria in order to correct this annotation. Poised to do in future.<br />
<br />
</pre></div>Kyook