Difference between revisions of "ISB2014 Session4 Microbial Informatics"
From WormBaseWiki
Jump to navigationJump to search (Created page with "<pre> The Seventh International Biocuration Conference Microbial Informatics April 8, 2014 http://biocuration2014.events.oicr.on.ca/agenda-5 link to all ISB2014 notes: htt...") |
(No difference)
|
Latest revision as of 19:15, 11 April 2014
The Seventh International Biocuration Conference Microbial Informatics April 8, 2014 http://biocuration2014.events.oicr.on.ca/agenda-5 link to all ISB2014 notes: http://etherpad.wikimedia.org/p/isb2014 Help me take collaborative notes on this session! Fill out your name in the box in the top right-hand corner. Add yourself to the editors list, then edit away! I make a lot of typos. Sorry. Editors name / affiliation / twitter Abigail Cabunoc / OICR / @abbycabs Melanie Courtot/ BCCA-SFU/ @mcourtot Karen Yook/wormbase- Caltech/@wbperson712 Moni Munoz-Torres / LBNL / @monimunozto 13:00 - 14:30 Session 4 - Microbial Informatics, Chaired by Fiona Brinkman and John Parkinson The Great Hall, Hart House -------------------------------------------------------------------------------------------- 13:00 - 13:10 Microbial Informatics in 2014 Fiona Brinkman Simon Fraser University, Canada High overview of things set the stage for the talks lots of data - visualizations largest published dataset of microbial genomes today (lots) IslandViewer - new version in 2014 curating lots to allow overlays of interformation InnateDB - new version yesterday beta.innatedb.com active curation of innate immunity interactions/allergy resources we used for innate immunity analysis and used by many researchers studying other things Broadening interest in microbiome reserach 13 institute - workshop (Canadian microbiome workshop) all care about microbial research from different perspectives using microbial data as markers for cancer and other chronic diseases can be better predictors than other molecular markers broad interest from nutrition - growing interest in general keep in mind: quality of data - how to deal with issue of growing analysis data Stein tomorrow - Big Data data grows massively - need standards, data integration, good curation fed into - automated funciton prediction use this data effectively google and amazon joining - global alliance (http://genomicsandhealth.org/) http://www.globalmicrobialidentifier.org/ genomes, metatranscriptomic, proteomes, pathway tools, collaboration - all being talked about this session One final note: mixed bag of perspectives on microbial informatics: relevant not only to microbes Q: In terms of curation in future: how much do you see manual curation playing a part in microbial genomic A: big believer in pipeline of information - massive amounts of data. sequence structure divide. Growing divide of data in literature and what we can curate and sequence data. we need manual curation to bring literature info into great amount of data already available Need automated curation that utizes curation - manual curation shouldn't feel in competition with automation. Make sure fit the right roles - don't over predict. Manual curation focused on areas where its most needed. -------------------------------------------------------------------------------------------- 13:10 - 13:30 The importance of newly sequenced genomes and functional annotations for phylogenetic profiling Nives Skunca ETH Zurich, Switzerland Abstract Phylogenetic profiling methods use patterns of presence and absence of genes in different species to predict protein-protein interactions and functional annotations. Since their introduction by Pellegrini et al. in 1999 [1], numerous methodological refinements have been proposed [2]. But a much greater difference lies in the amount of available genomic and functional data. In my talk, I will explore the extent to which new data improves the performance of phylogenetic profiling. Using a state-of-the-art phylogenetic pro- filing method [3], we quantified the improvement in prediction accuracy afforded by additional sequence and function information. Firstly, I will discuss an impressive difference in performance between phyloge- netic profiles that use only the data available in 2005 and phylogenetic profiles that use the most recently available data. Further, I will discuss the difference in performance when having more organisms in phy- logenetic profiles, compared to having more comprehensive functional annotations. I will briefly reflect on the difference in the performance of phylogenetic profiling in the three kingdoms of life. Finally, I will dis- cuss one avenue of reducing the computational costs related to phylogenetic profiling: a careful selection of organisms that provides similar performance as when using the full set of sequenced organisms. References: 1. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO (1999) Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A 96: 4285-4288. 2. Kensche PR, van Noort V, Dutilh BE, Huynen MA (2008) Practical and theoretical advances in predicting the function of a protein by its phylogenetic distribution. J R Soc Interface 5: 151- 170. doi:10.1098/rsif.2007.1047. 3. Skunca N, Bosnjak M, Krisko A, Panov P, Dzeroski S, et al. (2013) Phyletic profiling with cliques of orthologs is enhanced by signatures of paralogy relationships. PLoS Comput Biol 9: e1002852. doi:10.1371/journal.pcbi.1002852. Notes Done collaboration with co-supervisor (Dessimoz) underlying idea: want to evaluate the effect of having a larger amount of data in the current databases on the established funcitonal annotation sheer amount of data: automation is inevitable phylogenetic profiling - established 15 years ago Pelligrini et al., PNAS 1999 different organisms. for bacteria phylogenetic profiling performs best hwo does more data influence phylogenetic profiling? more data better -> should see increase in accuracy Skunca et al. Plos CB, 2013 OMA - phylogenetic profiles - closed groups of orthologues assigned function to profiles - assume single function took method the way published, changed data given. give data available in 2005 long tail - number of GO terms reasonably well predict function in 2005 but heavy base - many functions, could not use this method reliably step by step added data that became available in subsequence years accuracy rises - quite sharp rise to now 777 GO terms wiht reliable redictions (ML algorithm) How many organisms are enough? more not always better - good info would increase AUC additional genomes can provide {useful | redundant} information looked at phylogenetic profiles - removed organisms (columns), but left the function the same any change in predictive accuracy is a conseqeucne of a change in the set of organisms predictive accuracy grows sharply in the beginning (up to 100 organisms) levels off after 100 400 organisms - almost doens't make any difference to add more- but there is a slight (noticeable) enough increase in predictability as more genomes added contrary to literature: best accuracy when we use /all/ the organisms using random subset of organisms doesn't make any difference practically. phylogenetic profiling sensitive to diversity nonetheless: using only one clade: actinobacteria, big gap in AUC value vs diverse set How many annotations are enough? quantify the effect - how much better? look at phylogenetic profiles - only look at right most column, but number of orgs the same deline in predictive accuracy when we remove annotations (plot AUC vs % of used annotations) 20% of annotations -> predictions are useless seemingly leveling off - 80%-100% by the time we have added 80%, most have been annotated (general terms) repeat using a more specific set, less levelling off Open World Assumption notion: biological dbs are incomplete AUPRC - calc by resampling - inflating false positives not yet in our dataset Look at effect of open world assumption on our results looked at proteins with >5 annotations in db - all else equal, these should be comprehensively annotated, effect of open world assumption should be small other proteins - removed 60% of annotations, open world assumption should have more effect Poster #17 Q & A Q: Plots- huge increase between 2007-2009? What happened? A: Didn't look at particular influences. Suspect there was an increase in # of organisms we added to db Q: Open world assumption - removed 60% of annotations - woudln't it depend on which ones removed? A: Do it many times -------------------------------------------------------------------------------------------- 13:30 – 13:45 GIST - An ensemble approach to the taxonomic classification of metatranscriptomic reads Samantha Halliday University of Toronto, Canada Abstract Whole-microbiome gene expression profiling (’meta-transcriptomics’) has emerged as a powerful means of gaining a mechanistic understanding of the complex inter-relationships that exist in microbial communi- ties. However, due to the inherent complexity of human-associated microbial communities and a lack of a comprehensive set of reference genomes, currently available algorithms for metatranscriptomic analysis are limited in their ability to functionally classify and organize these sequence datasets. To overcome this challenge we have been developing methods that combine accurate transcript annotation with systems- level functional interrogation of metatranscriptomic datasets. As part of these methods, we present GIST (Generative Inference of Sequence Taxonomy), which combines several statistical and machine learning methods for compositional analysis of both nucleotide and amino acid content with the output from the Burroughs-Wheeler Aligner to produce robust taxonomic assignments of metatranscriptomic RNA reads. A neural network is used to automatically infer the optimal weightings of each taxon and method, pro- viding a phylogenetic classification technique that is robust against variable branch length. In addition to identifying taxon-specific pathways within the context of a pan-microbial functional network, linking taxa with specific functions in a microbiome will produce deeper understanding of how their loss or gain alters microbiome functionality. Applied to real as well as synthetic datasets, generated using an in-house simu- lation tool termed GENEPUDDLE, we demonstrate an improved performance in taxonomic assignments over existing methods. Notes compsysbio.org/gist GIST - to provide improved annotations for species and taxonomic identification from reads - metatranscriptomes Microbiomes huge amount of biology on this planet contain >50% of earth's biodiversity when they form communities and interact - do so in dense methods. Lots of interconnections. Pathways between different species strains and clades significance - environmental health concers diseases inclusing autism & diabetes influenced by metabiome health metatranscriptomics - study mRNA from microbial communities contrast metagnomics and marker gene classification Goals: produce full network of interconnected major problems wiht network graph 1. not compartmentalized 2. taxonomic units shown are very general groups - phyla. Not very useful. want to develop more resolution detect presense of spore and inactive cells very resiliant bacteria will go to inactive state - did not kill Two approaches: 1. Alignment (top blast hit) 2. not gene content, but composition Composistion: move window along sequence and count things we see (e.g. GC content, codon bias) N-mer frequencies HTGs assimilated over time and fading in background MG-RAST - based on blast - limits precision not effective with datasets recently added best-performing compositional method NBC - 2008 large window length = large disk space resource intensive but fairly accurate GIST takes 4 different statistical methods i considered aa & nucleotides -> combines wiht BWA better idea of whats going on, differnet ML techniques shorter N-mers adaptive weighting - learning correct way to analyze data confidence estimation in output - if many species/strains have similar score, can return parent taxon instead of just taking top hit classification process Naive Bayes, Gaussian, BWA (80% accuracy), nearest neighbor - wide range of genetic features combined, high levels accuracy secret how this works- adaptive weighting training pipeline family-level weights specificity: untrained GIST performs poorly (12%) metacv 63% however limited as it uses only 6 amino acids nbc uses up to 16GB - GISt uses only fraction Next steps: exploring datasets increase window - memory management rhetorica@cs.toronto.edu compsysbio.org/gist Q & A Q: natural datasets - taking water and spiking iwht sample? known bacteria? A: currently studying metastatis(?) set. benchmark. also real world datasets out there - most of the work has been around improving performance on natural datasets. There is no standard benchmark to do anything like that. First thing - develop pipeline - simulated datasets. Q: Have you seen teh performance improving as you see the longer reads? A: longer read, better performance. Nearing perfection -------------------------------------------------------------------------------------------- 13:45 – 14:00 Proteomes at UniProtKB – advancements and challenges in the post-genomic era Ramona Britto EMBL-EBI, UK (not Ramona who has laryngitis and can't talk) Abstract The UniProt Knowledgebase (UniProtKB) is a central resource for high-quality, consistent and richly an- notated protein sequence and functional information. UniProtKB has witnessed an exponential growth with a four-fold increase in the number of entries since 2011. This follows the vastly increased submis- sion of multiple genomes for the same or closely related organisms. To keep up with this rapid growth, recent advances at UniProt including a new web interface will enable users to find the most relevant and best-annotated set of sequences for each species. In addition to complete proteomes that are based on translations of completely sequenced genomes, we offer a selected subset of reference proteomes con- stituting a representative cross-section of the taxonomic diversity found within UniProtKB. We are working closely in collaboration with the INSDC, Ensembl and RefSeq to map all UniProtKB proteins to the under- lying genomic assemblies and to offer a consistent set of complete and reference genomes/proteomes to the user community. Also in the pipeline is the concept of a pan-proteome within taxonomic groups that will capture unique sequences not found in reference proteomes and aid in phylogenetic compar- isons. To further reduce redundancy within UniProt, a gene-centric view of complete proteomes will be implemented. This will bring together canonical and variant protein sequences into gene-based clusters that will more closely reflect genome size and offer a single reference protein for each gene. For highly redundant proteomes (e.g. strains of Escherichia coli), the non-reference protein sets will be made avail- able through UniParc which will be extended to include annotation data. Finally, a new proteome identifier will be introduced that will uniquely identify the set of proteins corresponding to a single assembly of a completely sequenced genome. All of these new developments and future plans with a particular focus on the microbial context will be presented. Notes Growth of sequence databases Challenges technical (servers, load, infrastructure) content of dbs - much more difficult to explore Solutions # of genomes rising 4 fold increase since 2011 lot of new genomes, many strains for the same species 0 complicates for dbs and users sequence redundancy increases challenge s for annotation, scientific exploration, analysis and visualization In uniprot - Proteomes DB monitors which orgs have complete proteomes keep track of assemblies - whether complete infrastructure in place, make sense for increase in number of proteins currently >4500 proteomes and spread over taxonomic range Reference proteomes representation of a particular taxonomic group landmarks of proteome space try to use the same reference as a base group references as they have done in genomics we can do much more with the data annotation created , collaborations essentials for many resources Redundancy within proteomes single gene can encode for multiple proteins through alternative splicing, initiation and varying promoter using there are many more protein sequences Gene-centric view ofproteomes new UniProt interface working the last year and a half on a new interface for uniprot present the data in teh best possible way for users talking to many userss - many in tehis audience give feedback on how they can explore thedata in teh best way this week new proteomes interface any kind of proteomic information description of the data - report page everything related to genome sequence as well pan-proteomes provide representative set of all sequences in a taxonomic group Q & A Q: reference proteomes are key - paired down dataset that's distributed nicely. Waht is the criteria for encoding that phylotic spread of the refernece? A: There is a user community interested in a particular one - they are experts. Look in annotations - how many sequences can be annotated , are annotated. Take into accound all these criteria to decide. -------------------------------------------------------------------------------------------- 14:00 - 14:15 Development of the EcoCyc and MetaCyc Databases and the Pathway Tools Software Ingrid M. Keseler SRI International, USA Abstract EcoCyc (EcoCyc.org) is a comprehensive database resource for Escherichia coli K-12 that provides an overview of the current state of knowledge of E. coli biology. EcoCyc is continuously updated and con- tains references to more than 26,000 publications covering the gene products, metabolic pathways, and regulatory network of E. coli. Recent work with EcoCyc include the development of EcoCyc-17.5-GEM, a genome-scale model of the E. coli K-12 MG1655 metabolic network that was automatically generated from EcoCyc using the MetaFlux component of the Pathway Tools software. While EcoCyc is focused on a single model organism, the MetaCyc database (MetaCyc.org) is a curated collection of more than 2000 experimentally elucidated metabolic pathways and more than 9000 enzymes from organisms that cover a wide taxonomic range, serving as an encyclopedic reference on pathways and enzymes for basic and applied research including metabolic engineering. MetaCyc also provides the data source that sup- ports computational metabolic network prediction for organisms with sequenced and annotated genomes, resulting in the BioCyc collection of more than 3000 pathway/genome databases. The Pathway Tools soft- ware system supports both the creation of and end-user access to pathway/genome databases. It enables development of organism-specific databases by providing tools to automatically build databases from an- notated genome sequences, to infer the presence of metabolic pathways, to create and edit database entries, to generate metabolic models, and to publish the resulting databases. For end users of the pub- lished databases, the software provides a suite of query, visualization and analysis tools. Results can be captured and further analyzed or exported with Smart Tables (formerly called Web Groups). It is now possible to create temporary tables and conversely, to publish tables that can not be changed and can therefore be referred to in publications. Notes EcoCyc - E. coli encyclopedia 15 years ago curated everythign ppl would want to know about e. coli started with metabolic pathways few hi throughput datasets incorportated current funding situation - seeking funding for BsubCyc new in EcoCyc generated metabolic model from EcoCyc generated from curated data in database not separated curated info plan frequent releases - 2x a year can be inspected easily from what is curated - metabolic pathways MetaCyc db of Metabolic patthways from many different organisms human, mouse, rat, plants, archaea, bacteria, etc literature, manual curation 40K literature citations new: Atom mappings for the reactions not manually curated all computational give idea where parts of the molecules go useful for understanding reactions can be used to predict metabolic pathways in other organisms predict pathways complement of that organism BioCyc 3500+ - metabolic pathways predicted built on common schema/vocab/db object compound structure updated in metacyc - can translate to biocyc easily Software used to curate the databases curation interface publish to web query tools visualization tools analysis tools metabolic research input desired compounds (start and end), search in db, or db+metacyc for optimal route from start to end renamed from Groups - SmartTables- create and manage lists of 'things' (NB: pretty sure this is InterMine??) methods for creating - search, import list, do analysis on lists - transform lists to other lists (enzymes -> pathways || genes - > regulators known to regulate genes) set operations share tables, publish, export Biocuration Accuracy Assessment got feedback from grant review panel: manual curation is expensive (curators not the best paid??) not scalable what is the accuracy? how good are the curators? => manual curation should be replaced validator task: validators - validate task. Outside the db curated assertions ('facts' with literature citations per gene/protein) - verify curation evaluate the results validor errors (false positive) metadata errors (fact claimed was true, but citation incorrect) factual errors EcoCyc & CGD final error rate (factual errors): 5 for each db, <2% error rate very low. But we do make mistakes. GO term curation is hard very difficult to measure errors - Ingrid doesn't point out what kind of errors, how significant are the errors? Q & A Q: 1. Does your algorithm - nature has optimized these pathways, does your search take into account energy requirements? 2. Metadata - how does the validator know? A: 2. decided by conferring with validator. Is that fact in the paper? Simple task. Decided in the end - curator would look at the fact. Detailed analysis of all analysis reported. 1. Can't answer - need to talk to Mario(?). Intended for metabolic engineering - where there is no natural path. May catalyze a possible step in a possible pathway. Different way of using: pathway hole filling - not done yet. -------------------------------------------------------------------------------------------- 14:15 - 14:30 Three's a crowd-source: Observations on Collaborative Genome Annotation Monica C. Munoz-Torres (Suzana Lewis talking in place -- because Moni's visa did not arrive in time.) Lawrence Berkeley National Laboratory, USA Abstract It is impossible for a single individual to fully curate a genome with precise biological fidelity. Beyond the problem of scale, curators need second opinions and insights from colleagues with domain and gene family expertise, but the communications constraints imposed in earlier applications made this inherently collaborative task difficult. Apollo, a client-side, JavaScript application allowing extensive changes to be rapidly made without server round-trips, placed us in a position to assess the difference this real-time interactivity would make to researchers’ productivity and the quality of downstream scientific analysis. To evaluate this, we trained and supported geographically dispersed scientific communities (hundreds of scientists and agreed-upon gatekeepers, in 100 institutions around the world) to perform biologically supported manual annotations, and monitored their findings. We observed that: 1) Previously discon- nected researchers were more productive when obtaining immediate feedback in dialogs with collabo- rators. 2) Unlike earlier genome projects, which had the advantage of more highly polished genomes, recent projects usually have lower coverage. Therefore curators now face additional work correcting for more frequent assembly errors and annotating genes that are split across multiple contigs. 3) Automated annotations were improved as exemplified by discoveries made based on revised annotations, for exam- ple 2800 manually annotated genes from three species of ants granted further insight into the evolution of sociality in this group, and 3600 manual annotations contributed to a better understanding of immune function, reproduction, lactation and metabolism in cattle. 4) There is a notable trend shifting from whole- genome annotation to annotation of specific gene families or other gene groups linked by ecological and evolutionary significance. 5) The distributed nature of these efforts still demand strong, goal-oriented (i.e. publication of findi ngs) leadership and coordination, as these are crucial to the success of each project. Here we detail these and other observations on collaborative genome annotation efforts. Notes There is a lot more potential curators out there than there are in this room. There's a whole world of ppl out there with the experience that could be helpful. Consequences of technology changes - sequencing is a lot more available crowd-sourcing - there's more ppl doing sequencing we need to give all of those ppl the same opportunity we have to annote the genome. Apollo Automated systems: gene prediction curation - identify every element that might have a functional role eliminate elements - systemic errors functional roles- compare what we do find with other species curators - achieve precise biological fidelity- Apollo: JS application on client side. Server where genome is, edits can be pushed out to everyoen looking at genome at the same time. working with ppl, helping get set up education, training, workshops Apollo - genomic annotation editing platform - modify locations you can write thigns, add things, upload data, adjust things privacy handling search - lower coverage genomes configure - toggle strands, reverse complement, highlighting standard navigation - zoom, pan control edit - google doc, but for a genome fine tune edits - non- cannonical splice site - simple click and drag insertions and deletions, genome assembly itself can get feedback history tracking - see changes - signed and dated - can look back and revert back kinds of data coverage plots - heat maps transcriptome data Just got renewed! Want to be able to show variants and haplotypes associated use same technique on low coverage genomes working on next years 'folded' genome view (here now) gene split accross, can visualize together. Translocation. See paralogues. dynamic thresholding - grab and drag up and down to adjust threshold - create new track ppl seem to like it! Familiar wiht browsers - click and drag What have we learned by working together? Bovine - 100 ppl, annotations. 3600 genes contributed i5K project - 5K insect genomes - social genes, labour New Technologies - challenges Lessons learned: training - can be used in educational purposes the technology changes - take advantage of the latest technologies out there standards Q & A Q: 1. Like web apollo. Track history - do you track evidence? 2. Other communities are using this to curate select gene sets - live in resource A/B. Lack of communication between curator groups? How can we all work together? A: 1. No, this came up long long ago. Evidence changes. New transcriptome... no, we don't track the evidence. If you're interested in determining the evidence, you can do that dynamically, looking around at what you have now. 2. Broader discussion. Try to get convergence. Version control - you don't want a bunch of branches you don't merge at some point. Wanted to make it very open - didn't want to register to download. Don't know how many servers there are! Policing function? Encourage ppl to report these.Not sure what will happen. Export function - need to have one. Q: Local BAM files - test server? CAn feed it in remotely? A: Can feed it in remotely. Remote - each time you start a new session you need to reload it.