The Seventh International Biocuration Conference Functional Annotations April 8, 2014 http://biocuration2014.events.oicr.on.ca/agenda-5 link to all ISB2014 notes: http://etherpad.wikimedia.org/p/isb2014 Help me take collaborative notes on this session! Fill out your name in the box in the top right-hand corner. Add yourself to the editors list, then edit away! I make a lot of typos. Sorry. Editors name / affiliation / twitter Abigail Cabunoc / OICR / @abbycabs Marc Perry / OICR / @mdperry Next session: http://etherpad.wikimedia.org/p/isb2014-microbe 09:00 - 10:00 Keynote Lecture, Chaired by Marc Gillespie The Great Hall, Hart House 10:00 - 12:00 Session 3 - Functional Annotations, Chaired by Iddo Friedberg The Great Hall, Hart House announcements: Posters go down tonight group picture tomorrow Poser awards - semi-finalists: stand by your posters from 5-6 3, 28, 42, 60, 68, 75, 96, 120, 131, 145 ---------------------------------------------------------------------------------------------------------- A Global Context for Prediction of Functional Trends in Protein Superfamilies Patricia Babbitt California Institute for Quantitative Biosciences (QB3), UCSF, USA NCBI board scientific counsellor UniProt board Abstract As sequence data continues to rapidly increase, the number of new protein sequences of unknown function is beginning to dwarf those of known or [accurately] predicted function. Likewise, the proportion of proteins that can ever be experimentally characterized is becoming vanishingly small. Graphical network models that capture sequence similarities among members of large sets of homologous sequences offer a global view of their structure-function relationships. “Painting” clusters of related sequences with available functional information provides valuable clues for prediction of functional properties for proteins of unknown or misannotated function along with new insights unavailable from smaller scale studies. Further, similarity networks offer a powerful way to suggest targets for which experimental characterization may be effectively leveraged to predict functional properties of the many uncharacterized unknowns that continue to be discovered. Less encouraging, summary network views of large enzyme superfamilies suggest that similarity boundaries inferred from sequence and structural comparisons often fail to track with functional boundaries, reflecting significant challenges for automated function prediction. Notes Challenges - non trivial cases (assignment of molecular function) messyness in bio great resources that provide a foundation - context for work we want to do many of these tools often require that we use sequence and structural info as a starting point for functional annotation Large scale studies - protein super families - trying to get functional trends what we learn form nature what we learn from biology what we need to know our work in functional inferrence enzymes evolved to catalyze the diff chemical reactions of living organisms used limited set of 'priveleged' scaffolds sometimes dozens of rxns, associated functional key capability what are the range of reactions accessible? how has nature remodelled these scaffolds to enable specific superfamilies within fold families estimate 1/3 of universe of enzyme superfamilies are functionally diverse Enolase SF - example 27K sequences in this set found throughout biosphere many roles pairwise identity between famlies low < 15-65% best curated SF she knows of develop the idea of chemically constrained evolution structural elements - fundamental chemical aspect are conserved chemical reactions - how does this conserved active site lead to many different reactions/ share active site architecture model constrains search space predicting reaction & substrate specificity binding of ligants by substrate/product in active site conserved sub-structure rest of ligand - all over the place figure out parts of hte protein that specify these differences not trivial (cna't just do mult alignment) b/c these different reaction families evolve at different rates 'pseudo-convergent' evolution - same reaction has evolved from different intermediate ancestors promiscuity - underlying pretty sure (not exhaustive) none of these same reaction, but not far from each other misannotation - hard to annotate proteins with good mechanistic info (know specificity determinants) - how often wrong? high levels of mis-annotation in many families only swisprot was really good chemistry0constrained - can assign members to superfamility associated with fundamental chemical capability/strategy can then use sequence nad structure data to break the data into more manageable parts can do annotation transfer only at the level of granularity with good evidence => structure function database - manually curated super families different kinds of info (mult align, associated data) What we really need to know about biology to facilitate functional annotation? clues 1. different SF evolved in unique adn complex ways different in important ways - prevent simple generalization can't rely on thesholds - (above level of similarity) examples of different kinds of variations 1. domain organization - thioredoxin fold class varies much more than recognized many SF in this fold class with additional domains - do chemistry differently similar with respect to thioredoxin 2. alkaline phosphatase SF poster child for detailed mechanistic understanding all associated with variations and some large inserts many of these inserts unrelated locations vary multiple kinds in a single group structural insert patters don't match... something 3. vicinal oxygen chelate (VOC) fold proteins made of variations of a fundamental sub-domain duplicated in many diff. groups - interesting set of combinatorics of these sub-domains not till we got enough structures that we started to understand the problem of annotating these proteins active sites: divergence you'd think they'd be conserved the background active site structure is similar, but different binding (metal dependence sometimes) quite different chemistry - complicated 2. Reaction families evolve at different rates OSBS - only residues conserved are teh residues conserved in /entire/ superfamiy reason we know the proteins evolve quickly looked at ancestry - somewhere back the twist between domains vary specificity are different identify them through genome context means that: when we build markov models for families - separate nicely between families OSBS - family members can't be separated nicely - only conserve residues eveyone has protein similarity network SF can be very large - too large to look at easily (trees, mult align) took years to curate 12 SF in highly curated set offer advantages for getting first pass view on large scale of structure-function mapping networks have been useful - but principly basis of hypothesis how architecture put together what we know function of proteins different clades in tree represent different functional classes Validation: BLAST - simple metric to compare structures get quite good distances relative generally robust to missing data biocomparisons - looking at N-dimensional matrix - complex. 2 dimension capture well Layout used - organic layout cytoscape - couldn't reverse engineer edge lengths degree of connectivity (not directly) built app to use representative networks of very large SF can't build network with too many edges (<1M, cap at 500K to download) how build netowrk 1. generate representative network 2. define subgroup clusters 3. generate full network for each cluster 4. define criteria for distinguishing families and annotation transfer 5. map functioan properties to nodes and edges reached a limit on full network where using all sequences too many sequences to do this now running against storage capacity limits have to use 90% redundance filtering - to not take up too much space attributes can be mapped to netowrk - genome context, type of life, more can downlod from database What we've learned Glutathione transferases (GSTs) found throughout biosphere (some variation) 13K nonredundant sequences 30K refs in PubMed- studies for ~70years reaction types known - difficult to assay we don't know huge majority - physiological functions tested using synthetic substrates how little we know 1500 representative nodes - swiss prot famility representations grey nodes- sparse information many classes to be discovered - used more stringent e-value structure similarity networks eukaryotic proteins more similar promiscuity and mult. reaction types - in most subgroup many reaction types found in multiple subgroups annotation is going to be hard most subgroups contain several types of live don't track with phylogeny even groups with lot of metazoans and eukaryotes - some bacteria well characterized mechanistically Isoprene synthase I superfamily - very sparse annotation includes some very large clusters. we know very little Radical SAM superfamily 50K sequences mapped not sure that they all the same fold type very complicated problem - easy to make inferrences based on conserved Enolase SF - unknowns think about how to sort out functions evaluating misannotation enolase SF - distinct clusters using coming in can search db for sequence - send back network understand why sequence doesn't behave as expected may be misannotated Substrate specificity alkaline phophate SF missing piece - Target selection - experimental characterisation pick pathogens, proteins with different approach: picked 82 targets to cover the unknown space Armstrong's lab - assayed , screened for disulfide bond reductase reactions typified with one subgroup Context offers clues - 'outlier' reactions because protein member of enolase SF (canonical rxns) can make inferences on mechanical What we've learned: function prediction in these kinds of SF is hard- a lot of these typify general problem: a lot of the proteins are very similar to each other. Looking at diverse members of the group we start to see features we've talked about context to get started at structure-function relationships looking at diverse proteins - clustering of homologous sequences may not track with functional boundaries - complicates functional prediction every SF is different in important ways - remarkable - how few proteins in each SF have been looked at experimentally Challenges and caveats appropriate threshold for visualisation interpretation - sensitive to many factors relationships span large set of variations blast is fast & compares reasonably some key enzyme resources include: sfld - www.sfld.rbvi.ucsf.edu csa - www.ebi.ac.uk/thornton-srv/databases/csa macie - www.ebi.ac.uk/thornton-srv/databases/MACiE A superfamily is defined as: evolutonarily related conserved avcteie site architecture some conservec i Q & A Q: could you not get additional info from protein-protein interaction networks? Proteins next to each other get function? A: Of course, only way to go these days. Use multiple kinds of different information. In experiment - the ppl who did best did that sort of thing. Bring together multiple kinds of information - use genome context extensively. Similarity and context we get a lot of power. We often don't have to the data in enzyme - esp. bacteria. Q: gene ontology - also do function. Phylogenetic approach to do the inference. Sparse (true). In misannotation - do the misannotations that you uncover get fed back to the source? Can be corrected at sourcE? A: depends on teh database - depends if theyre correcting annotation. Genbank is different from uniprot . Working with them to make these kinds of correction (swiss prot will fix). Worry about the level of information is. Just because we have some experimental information - doesn't mean we know everything. Q: Have you tried to contact Refseq A: TAlked to NCBI quite a bit. Not directly working. Some of that development is more recent. Working wiht swissprot, uniprot. We tried to make these changes years ago. One of the things concerned about - get into database can be done systematically. Need to define criteria in order to correct this annotation. Poised to do in future.