ISB2014 Session4 Microbial Informatics

From WormBaseWiki
Jump to navigationJump to search
The Seventh International Biocuration Conference
Microbial Informatics
April 8, 2014
link to all ISB2014 notes:
Help me take collaborative notes on this session! Fill out your name in the box in the top right-hand corner. Add yourself to the editors list, then edit away!
I make a lot of typos. Sorry.

    name / affiliation / twitter

    Abigail Cabunoc / OICR / @abbycabs

    Melanie Courtot/ BCCA-SFU/ @mcourtot

    Karen Yook/wormbase- Caltech/@wbperson712

    Moni Munoz-Torres / LBNL / @monimunozto

13:00 - 14:30
Session 4 - Microbial Informatics, Chaired by Fiona Brinkman and John Parkinson
The Great Hall, Hart House
13:00 - 13:10
Microbial Informatics in 2014 
Fiona Brinkman 
Simon Fraser University, Canada
High overview of things

    set the stage for the talks

    lots of data - visualizations

    largest published dataset of microbial genomes today (lots)

    IslandViewer - new version in 2014

    curating lots to allow overlays of interformation

    InnateDB - new version yesterday

    active curation of innate immunity interactions/allergy

    resources we used for innate immunity analysis and used by many researchers studying other things

    Broadening interest in microbiome reserach

    13 institute - workshop (Canadian microbiome workshop)

    all care about microbial research from different perspectives

    using microbial data as markers for cancer and other chronic diseases

    can be better predictors than other molecular markers

    broad interest from nutrition - growing interest in general

    keep in mind: quality of data - how to deal with issue of growing analysis data

    Stein tomorrow - Big Data

    data grows massively - need standards, data integration, good curation fed into - automated funciton prediction

    use this data effectively

    google and amazon joining - global alliance (

    genomes, metatranscriptomic, proteomes, pathway tools, collaboration - all being talked about this session

    One final note: mixed bag of perspectives on microbial informatics: relevant not only to microbes

Q: In terms of curation in future: how much do you see manual curation playing a part in microbial genomic
A: big believer in pipeline of information - massive amounts of data. sequence structure divide. Growing divide of data in literature and what we can curate and sequence data.
we need manual curation to bring literature info into great amount of data already available
Need automated curation that utizes curation - manual curation shouldn't feel in competition with automation. Make sure fit the right roles - don't over predict. Manual curation focused on areas where its most needed.
13:10 - 13:30
The importance of newly sequenced genomes and functional annotations for phylogenetic profiling 
Nives Skunca
ETH Zurich, Switzerland
Phylogenetic profiling methods use patterns of presence and absence of genes in different species to predict protein-protein interactions and functional annotations. Since their introduction by Pellegrini et al. in 1999 [1], numerous methodological refinements have been proposed [2]. But a much greater difference lies in the amount of available genomic and functional data. In my talk, I will explore the extent to which new data improves the performance of phylogenetic profiling. Using a state-of-the-art phylogenetic pro- filing method [3], we quantified the improvement in prediction accuracy afforded by additional sequence and function information. Firstly, I will discuss an impressive difference in performance between phyloge- netic profiles that use only the data available in 2005 and phylogenetic profiles that use the most recently available data. Further, I will discuss the difference in performance when having more organisms in phy- logenetic profiles, compared to having more comprehensive functional annotations. I will briefly reflect on the difference in the performance of phylogenetic profiling in the three kingdoms of life. Finally, I will dis- cuss one avenue of reducing the computational costs related to phylogenetic profiling: a careful selection of organisms that provides similar performance as when using the full set of sequenced organisms. References: 1. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO (1999) Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A 96: 4285-4288. 2. Kensche PR, van Noort V, Dutilh BE, Huynen MA (2008) Practical and theoretical advances in predicting the function of a protein by its phylogenetic distribution. J R Soc Interface 5: 151- 170. doi:10.1098/rsif.2007.1047. 3. Skunca N, Bosnjak M, Krisko A, Panov P, Dzeroski S, et al. (2013) Phyletic profiling with cliques of orthologs is enhanced by signatures of paralogy relationships. PLoS Comput Biol
 9: e1002852. doi:10.1371/journal.pcbi.1002852.

    Done collaboration with co-supervisor (Dessimoz)

    underlying idea: want to evaluate the effect of having a larger amount of data in the current databases on the established funcitonal annotation

    sheer amount of data: automation is inevitable

    phylogenetic profiling  - established 15 years ago Pelligrini et al., PNAS 1999

    different organisms. for bacteria phylogenetic profiling performs best

    hwo does more data influence phylogenetic profiling?

    more data better -> should see increase in accuracy Skunca et al. Plos CB, 2013

    OMA - phylogenetic profiles - closed groups of orthologues

    assigned function to profiles - assume single function

    took method the way published, changed data given.

    give data available in 2005

    long tail - number of GO terms reasonably well predict function in 2005

    but heavy base - many functions, could not use this method reliably

    step by step added data that became available in subsequence years

    accuracy rises - quite sharp rise to now

    777 GO terms wiht reliable redictions (ML algorithm)

    How many organisms are enough? more not always better - good info would increase AUC

    additional genomes can provide {useful | redundant} information

    looked at phylogenetic profiles - removed organisms (columns), but left the function the same

    any change in predictive accuracy is a conseqeucne of a change in the set of organisms

    predictive accuracy grows sharply in the beginning (up to 100 organisms)

    levels off after 100

    400 organisms - almost doens't make any difference to add more- but there is a slight (noticeable) enough increase in predictability as more genomes added

    contrary to literature: best accuracy when we use /all/ the organisms

    using random subset of organisms doesn't make any difference practically.

    phylogenetic profiling sensitive to diversity nonetheless: using only one clade: actinobacteria, big gap in AUC value vs diverse set

    How many annotations are enough?

    quantify the effect - how much better?

    look at phylogenetic profiles - only look at right most column, but number of orgs the same

    deline in predictive accuracy when we remove annotations (plot AUC vs % of used annotations)

    20% of annotations -> predictions are useless

    seemingly leveling off - 80%-100%

    by the time we have added 80%, most have been annotated (general terms)

    repeat using a more specific set, less levelling off

    Open World Assumption

    notion: biological dbs are incomplete

    AUPRC - calc by resampling - inflating false positives 

    not yet in our dataset

    Look at effect of open world assumption on our results

    looked at proteins with >5 annotations in db - all else equal, these should be comprehensively annotated, effect of open world assumption should be small

    other proteins - removed 60% of annotations, open world assumption should have more effect

Poster #17
Q & A
Q: Plots- huge increase between 2007-2009? What happened?
A: Didn't look at particular influences. Suspect there was an increase in # of organisms we added to db
Q: Open world assumption - removed 60% of annotations - woudln't it depend on which ones removed?
A: Do it many times
13:30 – 13:45
GIST - An ensemble approach to the taxonomic classification of metatranscriptomic reads 
Samantha Halliday
University of Toronto, Canada
Whole-microbiome gene expression profiling (’meta-transcriptomics’) has emerged as a powerful means of gaining a mechanistic understanding of the complex inter-relationships that exist in microbial communi- ties. However, due to the inherent complexity of human-associated microbial communities and a lack of a comprehensive set of reference genomes, currently available algorithms for metatranscriptomic analysis are limited in their ability to functionally classify and organize these sequence datasets. To overcome this challenge we have been developing methods that combine accurate transcript annotation with systems- level functional interrogation of metatranscriptomic datasets. As part of these methods, we present GIST (Generative Inference of Sequence Taxonomy), which combines several statistical and machine learning methods for compositional analysis of both nucleotide and amino acid content with the output from the Burroughs-Wheeler Aligner to produce robust taxonomic assignments of metatranscriptomic RNA reads. A neural network is used to automatically infer the optimal weightings of each taxon and method, pro- viding a phylogenetic classification technique that is robust against variable branch length. In addition to identifying taxon-specific pathways within the context of a pan-microbial functional network, linking taxa with specific functions in a microbiome will produce deeper understanding of how their loss or gain alters microbiome functionality. Applied to real as well as synthetic datasets, generated using an in-house simu- lation tool termed GENEPUDDLE, we demonstrate an improved performance in taxonomic assignments over existing methods. 
GIST - to provide improved annotations for species and taxonomic identification from reads  - metatranscriptomes


    huge amount of biology on this planet

    contain >50% of earth's biodiversity

    when they form communities and interact - do so in dense methods. Lots of interconnections. Pathways between different species strains and clades

    significance - environmental health concers

    diseases inclusing autism & diabetes

    influenced by metabiome health

    metatranscriptomics - study mRNA from microbial communities

    contrast metagnomics and marker gene classification 

    Goals: produce full network of interconnected 

    major problems wiht network graph

    1. not compartmentalized

    2. taxonomic units shown are very general groups - phyla. Not very useful.

    want to develop more resolution

    detect presense of spore and inactive cells

    very resiliant bacteria will go to inactive state - did not kill

    Two approaches:

    1. Alignment (top blast hit)

    2. not gene content, but composition


    move window along sequence and count things we see (e.g. GC content, codon bias) N-mer frequencies

    HTGs assimilated over time and fading in background

    MG-RAST - based on blast - limits precision

    not effective with datasets recently added

    best-performing compositional method NBC - 2008

    large window length = large disk space

    resource intensive

    but fairly accurate


    takes 4 different statistical methods i

    considered aa & nucleotides -> combines wiht BWA

    better idea of whats going on, differnet ML techniques

    shorter N-mers

    adaptive weighting - learning correct way to analyze data

    confidence estimation in output - if many species/strains have similar score, can return parent taxon instead of just taking top hit

    classification process

    Naive Bayes, Gaussian, BWA (80% accuracy), nearest neighbor - wide range of genetic features

    combined, high levels accuracy

    secret how this works- adaptive weighting

    training pipeline 

    family-level weights


    untrained GIST performs poorly (12%)

    metacv 63% however limited as it uses only 6 amino acids

    nbc uses up to 16GB - GISt uses only fraction

    Next steps: exploring datasets

    increase window - memory management
Q & A
Q: natural datasets - taking water and spiking iwht sample? known bacteria?
A: currently studying metastatis(?) set. benchmark. also real world datasets out there - most of the work has been around improving performance on natural datasets. There is no standard benchmark to do anything like that. First thing - develop pipeline - simulated datasets.
Q: Have you seen teh performance improving as you see the longer reads?
A: longer read, better performance. Nearing perfection
13:45 – 14:00
Proteomes at UniProtKB – advancements and challenges in the post-genomic era 
Ramona Britto
(not Ramona who has laryngitis and can't talk)
The UniProt Knowledgebase (UniProtKB) is a central resource for high-quality, consistent and richly an- notated protein sequence and functional information. UniProtKB has witnessed an exponential growth with a four-fold increase in the number of entries since 2011. This follows the vastly increased submis- sion of multiple genomes for the same or closely related organisms. To keep up with this rapid growth, recent advances at UniProt including a new web interface will enable users to find the most relevant and best-annotated set of sequences for each species. In addition to complete proteomes that are based on translations of completely sequenced genomes, we offer a selected subset of reference proteomes con- stituting a representative cross-section of the taxonomic diversity found within UniProtKB. We are working closely in collaboration with the INSDC, Ensembl and RefSeq to map all UniProtKB proteins to the under- lying genomic assemblies and to offer a consistent set of complete and reference genomes/proteomes to the user community. Also in the pipeline is the concept of a pan-proteome within taxonomic groups that will capture unique sequences not found in reference proteomes and aid in phylogenetic compar- isons. To further reduce redundancy within UniProt, a gene-centric view of complete proteomes will be implemented. This will bring together canonical and variant protein sequences into gene-based clusters that will more closely reflect genome size and offer a single reference protein for each gene. For highly redundant proteomes (e.g. strains of Escherichia coli), the non-reference protein sets will be made avail- able through UniParc which will be extended to include annotation data. Finally, a new proteome identifier will be introduced that will uniquely identify the set of proteins corresponding to a single assembly of a completely sequenced genome. All of these new developments and future plans with a particular focus on the microbial context will be presented. 

    Growth of sequence databases


     technical (servers, load, infrastructure)

    content of dbs - much more difficult to explore


    # of genomes rising

    4 fold increase since 2011 

    lot of new genomes, many strains for the same species 0 complicates for dbs and users

    sequence redundancy increases

    challenge s for annotation, scientific exploration, analysis and visualization

    In uniprot - Proteomes DB

    monitors which orgs have complete proteomes

    keep track of assemblies - whether complete

    infrastructure in place, make sense for increase in number of proteins

    currently >4500 proteomes and spread over taxonomic range

    Reference proteomes

    representation of a particular taxonomic group

    landmarks of proteome space

    try to use the same reference as a base

    group references as they have done in genomics

    we can do much more with the data annotation

    created , collaborations

    essentials for many resources

    Redundancy within proteomes

    single gene can encode for multiple proteins through alternative splicing, initiation and varying promoter using

    there are many more protein sequences

    Gene-centric view ofproteomes

    new UniProt interface

    working the last year and a half on a new interface for uniprot

    present the data in teh best possible way for users

    talking to many userss - many in tehis audience

    give feedback on how they can explore thedata in teh best way

    this week

    new proteomes interface

    any kind of proteomic information

    description of the data - report page

    everything related to genome sequence as well


    provide representative set of all sequences in a taxonomic group

Q & A
Q: reference proteomes are key - paired down dataset that's distributed nicely. Waht is the criteria for encoding that phylotic spread of the refernece?
A: There is a user community interested in a particular one - they are experts. Look in annotations - how many sequences can be annotated , are annotated. Take into accound all these criteria to decide.
14:00 - 14:15
Development of the EcoCyc and MetaCyc Databases and the Pathway Tools Software 
Ingrid M. Keseler
SRI International, USA
EcoCyc ( is a comprehensive database resource for Escherichia coli K-12 that provides an overview of the current state of knowledge of E. coli biology. EcoCyc is continuously updated and con- tains references to more than 26,000 publications covering the gene products, metabolic pathways, and regulatory network of E. coli. Recent work with EcoCyc include the development of EcoCyc-17.5-GEM, a genome-scale model of the E. coli K-12 MG1655 metabolic network that was automatically generated from EcoCyc using the MetaFlux component of the Pathway Tools software. While EcoCyc is focused on a single model organism, the MetaCyc database ( is a curated collection of more than 2000 experimentally elucidated metabolic pathways and more than 9000 enzymes from organisms that cover a wide taxonomic range, serving as an encyclopedic reference on pathways and enzymes for basic and applied research including metabolic engineering. MetaCyc also provides the data source that sup- ports computational metabolic network prediction for organisms with sequenced and annotated genomes, resulting in the BioCyc collection of more than 3000 pathway/genome databases. The Pathway Tools soft- ware system supports both the creation of and end-user access to pathway/genome databases. It enables development of organism-specific databases by providing tools to automatically build databases from an- notated genome sequences, to infer the presence of metabolic pathways, to create and edit database entries, to generate metabolic models, and to publish the resulting databases. For end users of the pub- lished databases, the software provides a suite of query, visualization and analysis tools. Results can be captured and further analyzed or exported with Smart Tables (formerly called Web Groups). It is now possible to create temporary tables and conversely, to publish tables that can not be changed and can therefore be referred to in publications.
EcoCyc - E. coli encyclopedia

    15 years ago

    curated everythign ppl would want to know about e. coli

    started with metabolic pathways

    few hi throughput datasets incorportated

    current funding situation - seeking funding for BsubCyc

    new in EcoCyc

    generated metabolic model from EcoCyc

    generated from curated data in database

    not separated curated info

    plan frequent releases - 2x a year

    can be inspected easily from what is curated - metabolic pathways


    db of Metabolic patthways from many different organisms

    human, mouse, rat, plants, archaea, bacteria, etc

    literature, manual curation

    40K literature citations


    Atom mappings for the reactions

    not manually curated

    all computational

    give idea where parts of the molecules go

    useful for understanding reactions

    can be used to predict metabolic pathways in other organisms

    predict pathways complement of that organism

    BioCyc  3500+ - metabolic pathways predicted

    built on common schema/vocab/db object

    compound structure updated in metacyc - can translate to biocyc easily

    Software used to curate the databases

    curation interface

    publish to web

    query tools

    visualization tools

    analysis tools

    metabolic research

    input desired compounds (start and end), search in db, or db+metacyc for optimal route from start to end

    renamed from Groups - SmartTables- create and manage lists of 'things' (NB: pretty sure this is InterMine??)

    methods for creating - search, import list, do analysis on lists - transform lists to other lists (enzymes -> pathways || genes - > regulators known to regulate genes)

    set operations

    share tables, publish, export

    Biocuration Accuracy Assessment

    got feedback from grant review panel:

    manual curation is expensive (curators not the best paid??)

    not scalable

    what is the accuracy? how good are the curators?

    => manual curation should be replaced 

    validator task:

    validators - validate task. Outside the db

    curated assertions ('facts' with literature citations per gene/protein) - verify curation

    evaluate the results 

    validor errors (false positive)

    metadata errors (fact claimed was true, but citation incorrect)

    factual errors

    EcoCyc & CGD

    final error rate (factual errors): 5 for each db, <2%

    error rate very low. But we do make mistakes.

    GO term curation is hard

    very difficult to measure errors - Ingrid doesn't point out what kind of errors, how significant are the errors? 

Q & A

    1. Does your algorithm - nature has optimized these pathways, does your search take into account energy requirements?

    2. Metadata - how does the validator know?

     2. decided by conferring with validator. Is that fact in the paper? Simple task. Decided in the end - curator would look at the fact. Detailed analysis of all analysis reported.
     1. Can't answer - need to talk to Mario(?). Intended for metabolic engineering - where there is no natural path. May catalyze a possible step in a possible pathway. Different way of using: pathway hole filling - not done yet.
14:15 - 14:30
Three's a crowd-source: Observations on Collaborative Genome Annotation 
Monica C. Munoz-Torres (Suzana Lewis talking in place -- because Moni's visa did not arrive in time.)
Lawrence Berkeley National Laboratory, USA
It is impossible for a single individual to fully curate a genome with precise biological fidelity. Beyond the problem of scale, curators need second opinions and insights from colleagues with domain and gene family expertise, but the communications constraints imposed in earlier applications made this inherently collaborative task difficult. Apollo, a client-side, JavaScript application allowing extensive changes to be rapidly made without server round-trips, placed us in a position to assess the difference this real-time interactivity would make to researchers’ productivity and the quality of downstream scientific analysis. To evaluate this, we trained and supported geographically dispersed scientific communities (hundreds of scientists and agreed-upon gatekeepers, in 100 institutions around the world) to perform biologically supported manual annotations, and monitored their findings. We observed that: 1) Previously discon- nected researchers were more productive when obtaining immediate feedback in dialogs with collabo- rators. 2) Unlike earlier genome projects, which had the advantage of more highly polished genomes, recent projects usually have lower coverage. Therefore curators now face additional work correcting for more frequent assembly errors and annotating genes that are split across multiple contigs. 3) Automated annotations were improved as exemplified by discoveries made based on revised annotations, for exam- ple 2800 manually annotated genes from three species of ants granted further insight into the evolution of sociality in this group, and 3600 manual annotations contributed to a better understanding of immune function, reproduction, lactation and metabolism in cattle. 4) There is a notable trend shifting from whole- genome annotation to annotation of specific gene families or other gene groups linked by ecological and evolutionary significance. 5) The distributed nature of these efforts still demand strong, goal-oriented (i.e. publication of findi
ngs) leadership and coordination, as these are crucial to the success of each project. Here we detail these and other observations on collaborative genome annotation efforts.
There is a lot more potential curators out there than there are in this room. There's a whole world of ppl out there with the experience that could be helpful.

    Consequences of technology changes - sequencing is a lot more available

    crowd-sourcing - there's more ppl doing sequencing we need to give all of those ppl the same opportunity we have to annote the genome.


    Automated systems:

    gene prediction

    curation -

    identify every element that might have a functional role

    eliminate elements - systemic errors

    functional roles- compare what we do find with other species

    curators - achieve precise biological fidelity- 

    Apollo: JS application on client side. Server where genome is, edits can be pushed out to everyoen looking at genome at the same time.

    working with ppl, helping get set up

    education, training, workshops

    Apollo - genomic annotation editing platform - modify locations

    you can write thigns, add things, upload data, adjust things

    privacy handling

    search - lower coverage genomes

    configure - toggle strands, reverse complement, highlighting

    standard navigation - zoom, pan

    control edit - google doc, but for a genome

    fine tune edits - non- cannonical splice site - simple click and drag

    insertions and deletions, genome assembly itself can get feedback

    history tracking - see changes - signed and dated - can look back and revert back

    kinds of data

    coverage plots - 

    heat maps

    transcriptome data

    Just got renewed! Want to be able to show variants and haplotypes associated

    use same technique on low coverage genomes

    working on next years

    'folded' genome view (here now)

    gene split accross, can visualize together. Translocation. See paralogues.

    dynamic thresholding - grab and drag up and down to adjust threshold - create new track

    ppl seem to like it! Familiar wiht browsers - click and drag

    What have we learned by working together?

    Bovine - 100 ppl, annotations. 3600 genes contributed

    i5K project - 5K insect genomes - social genes, labour

    New Technologies - challenges

    Lessons learned:

     training - can be used in educational purposes

    the technology changes - take advantage of the latest technologies out there


Q & A
    1. Like web apollo. Track history - do you track evidence?
    2. Other communities are using this to curate select gene sets - live in resource A/B. Lack of communication between curator groups? How can we all work together?
    1. No, this came up long long ago. Evidence changes. New transcriptome... no, we don't track the evidence. If you're interested in determining the evidence, you can do that dynamically, looking around at what you have now. 
    2. Broader discussion. Try to get convergence. Version control - you don't want a bunch of branches you don't merge at some point. Wanted to make it very open - didn't want to register to download. Don't know how many servers there are! Policing function? Encourage ppl to report these.Not sure what will happen. Export function - need to have one.
Q: Local BAM files - test server? CAn feed it in remotely?
A: Can feed it in remotely. Remote - each time you start a new session you need to reload it.