ISB2014 Session4 Microbial Informatics

The Seventh International Biocuration Conference
Microbial Informatics
April 8, 2014
http://biocuration2014.events.oicr.on.ca/agenda-5
link to all ISB2014 notes:
http://etherpad.wikimedia.org/p/isb2014
Help me take collaborative notes on this session! Fill out your name in the box in the top right-hand corner. Add yourself to the editors list, then edit away!
I make a lot of typos. Sorry.
Editors

name / affiliation / twitter

Abigail Cabunoc / OICR / @abbycabs

Melanie Courtot/ BCCA-SFU/ @mcourtot

Karen Yook/wormbase- Caltech/@wbperson712

Moni Munoz-Torres / LBNL / @monimunozto

13:00 - 14:30
Session 4 - Microbial Informatics, Chaired by Fiona Brinkman and John Parkinson
The Great Hall, Hart House
--------------------------------------------------------------------------------------------
13:00 - 13:10
Microbial Informatics in 2014
Fiona Brinkman
Simon Fraser University, Canada
High overview of things

set the stage for the talks

lots of data - visualizations

largest published dataset of microbial genomes today (lots)

IslandViewer - new version in 2014

curating lots to allow overlays of interformation

InnateDB - new version yesterday

beta.innatedb.com

active curation of innate immunity interactions/allergy

resources we used for innate immunity analysis and used by many researchers studying other things

Broadening interest in microbiome reserach

13 institute - workshop (Canadian microbiome workshop)

all care about microbial research from different perspectives

using microbial data as markers for cancer and other chronic diseases

can be better predictors than other molecular markers

broad interest from nutrition - growing interest in general

keep in mind: quality of data - how to deal with issue of growing analysis data

Stein tomorrow - Big Data

data grows massively - need standards, data integration, good curation fed into - automated funciton prediction

use this data effectively

google and amazon joining - global alliance (http://genomicsandhealth.org/)

http://www.globalmicrobialidentifier.org/

genomes, metatranscriptomic, proteomes, pathway tools, collaboration - all being talked about this session

One final note: mixed bag of perspectives on microbial informatics: relevant not only to microbes

Q: In terms of curation in future: how much do you see manual curation playing a part in microbial genomic
A: big believer in pipeline of information - massive amounts of data. sequence structure divide. Growing divide of data in literature and what we can curate and sequence data.
we need manual curation to bring literature info into great amount of data already available
Need automated curation that utizes curation - manual curation shouldn't feel in competition with automation. Make sure fit the right roles - don't over predict. Manual curation focused on areas where its most needed.
--------------------------------------------------------------------------------------------
13:10 - 13:30
The importance of newly sequenced genomes and functional annotations for phylogenetic profiling
Nives Skunca
ETH Zurich, Switzerland
Abstract
Phylogenetic profiling methods use patterns of presence and absence of genes in different species to predict protein-protein interactions and functional annotations. Since their introduction by Pellegrini et al. in 1999 [1], numerous methodological refinements have been proposed [2]. But a much greater difference lies in the amount of available genomic and functional data. In my talk, I will explore the extent to which new data improves the performance of phylogenetic profiling. Using a state-of-the-art phylogenetic pro- filing method [3], we quantified the improvement in prediction accuracy afforded by additional sequence and function information. Firstly, I will discuss an impressive difference in performance between phyloge- netic profiles that use only the data available in 2005 and phylogenetic profiles that use the most recently available data. Further, I will discuss the difference in performance when having more organisms in phy- logenetic profiles, compared to having more comprehensive functional annotations. I will briefly reflect on the difference in the performance of phylogenetic profiling in the three kingdoms of life. Finally, I will dis- cuss one avenue of reducing the computational costs related to phylogenetic profiling: a careful selection of organisms that provides similar performance as when using the full set of sequenced organisms. References: 1. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO (1999) Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A 96: 4285-4288. 2. Kensche PR, van Noort V, Dutilh BE, Huynen MA (2008) Practical and theoretical advances in predicting the function of a protein by its phylogenetic distribution. J R Soc Interface 5: 151- 170. doi:10.1098/rsif.2007.1047. 3. Skunca N, Bosnjak M, Krisko A, Panov P, Dzeroski S, et al. (2013) Phyletic profiling with cliques of orthologs is enhanced by signatures of paralogy relationships. PLoS Comput Biol
9: e1002852. doi:10.1371/journal.pcbi.1002852.

Notes

Done collaboration with co-supervisor (Dessimoz)

underlying idea: want to evaluate the effect of having a larger amount of data in the current databases on the established funcitonal annotation

sheer amount of data: automation is inevitable

phylogenetic profiling - established 15 years ago Pelligrini et al., PNAS 1999

different organisms. for bacteria phylogenetic profiling performs best

hwo does more data influence phylogenetic profiling?

more data better -> should see increase in accuracy Skunca et al. Plos CB, 2013

OMA - phylogenetic profiles - closed groups of orthologues

assigned function to profiles - assume single function

took method the way published, changed data given.

give data available in 2005

long tail - number of GO terms reasonably well predict function in 2005

but heavy base - many functions, could not use this method reliably

step by step added data that became available in subsequence years

accuracy rises - quite sharp rise to now

777 GO terms wiht reliable redictions (ML algorithm)

How many organisms are enough? more not always better - good info would increase AUC

additional genomes can provide {useful | redundant} information

looked at phylogenetic profiles - removed organisms (columns), but left the function the same

any change in predictive accuracy is a conseqeucne of a change in the set of organisms

predictive accuracy grows sharply in the beginning (up to 100 organisms)

levels off after 100

400 organisms - almost doens't make any difference to add more- but there is a slight (noticeable) enough increase in predictability as more genomes added

contrary to literature: best accuracy when we use /all/ the organisms

using random subset of organisms doesn't make any difference practically.

phylogenetic profiling sensitive to diversity nonetheless: using only one clade: actinobacteria, big gap in AUC value vs diverse set

How many annotations are enough?

quantify the effect - how much better?

look at phylogenetic profiles - only look at right most column, but number of orgs the same

deline in predictive accuracy when we remove annotations (plot AUC vs % of used annotations)

20% of annotations -> predictions are useless

seemingly leveling off - 80%-100%

by the time we have added 80%, most have been annotated (general terms)

repeat using a more specific set, less levelling off

Open World Assumption

notion: biological dbs are incomplete

AUPRC - calc by resampling - inflating false positives

not yet in our dataset

Look at effect of open world assumption on our results

looked at proteins with >5 annotations in db - all else equal, these should be comprehensively annotated, effect of open world assumption should be small

other proteins - removed 60% of annotations, open world assumption should have more effect

Poster #17
Q & A
Q: Plots- huge increase between 2007-2009? What happened?
A: Didn't look at particular influences. Suspect there was an increase in # of organisms we added to db
Q: Open world assumption - removed 60% of annotations - woudln't it depend on which ones removed?
A: Do it many times
--------------------------------------------------------------------------------------------
13:30 – 13:45
GIST - An ensemble approach to the taxonomic classification of metatranscriptomic reads
Samantha Halliday
University of Toronto, Canada
Abstract
Whole-microbiome gene expression profiling (’meta-transcriptomics’) has emerged as a powerful means of gaining a mechanistic understanding of the complex inter-relationships that exist in microbial communi- ties. However, due to the inherent complexity of human-associated microbial communities and a lack of a comprehensive set of reference genomes, currently available algorithms for metatranscriptomic analysis are limited in their ability to functionally classify and organize these sequence datasets. To overcome this challenge we have been developing methods that combine accurate transcript annotation with systems- level functional interrogation of metatranscriptomic datasets. As part of these methods, we present GIST (Generative Inference of Sequence Taxonomy), which combines several statistical and machine learning methods for compositional analysis of both nucleotide and amino acid content with the output from the Burroughs-Wheeler Aligner to produce robust taxonomic assignments of metatranscriptomic RNA reads. A neural network is used to automatically infer the optimal weightings of each taxon and method, pro- viding a phylogenetic classification technique that is robust against variable branch length. In addition to identifying taxon-specific pathways within the context of a pan-microbial functional network, linking taxa with specific functions in a microbiome will produce deeper understanding of how their loss or gain alters microbiome functionality. Applied to real as well as synthetic datasets, generated using an in-house simu- lation tool termed GENEPUDDLE, we demonstrate an improved performance in taxonomic assignments over existing methods.
Notes
compsysbio.org/gist
GIST - to provide improved annotations for species and taxonomic identification from reads - metatranscriptomes

Microbiomes

huge amount of biology on this planet

contain >50% of earth's biodiversity

when they form communities and interact - do so in dense methods. Lots of interconnections. Pathways between different species strains and clades

significance - environmental health concers

diseases inclusing autism & diabetes

influenced by metabiome health

metatranscriptomics - study mRNA from microbial communities

contrast metagnomics and marker gene classification

Goals: produce full network of interconnected

major problems wiht network graph

1. not compartmentalized

2. taxonomic units shown are very general groups - phyla. Not very useful.

want to develop more resolution

detect presense of spore and inactive cells

very resiliant bacteria will go to inactive state - did not kill

Two approaches:

1. Alignment (top blast hit)

2. not gene content, but composition

Composistion:

move window along sequence and count things we see (e.g. GC content, codon bias) N-mer frequencies

HTGs assimilated over time and fading in background

MG-RAST - based on blast - limits precision

not effective with datasets recently added

best-performing compositional method NBC - 2008

large window length = large disk space

resource intensive

but fairly accurate

GIST

takes 4 different statistical methods i

considered aa & nucleotides -> combines wiht BWA

better idea of whats going on, differnet ML techniques

shorter N-mers

adaptive weighting - learning correct way to analyze data

confidence estimation in output - if many species/strains have similar score, can return parent taxon instead of just taking top hit

classification process

Naive Bayes, Gaussian, BWA (80% accuracy), nearest neighbor - wide range of genetic features

combined, high levels accuracy

secret how this works- adaptive weighting

training pipeline

family-level weights

specificity:

untrained GIST performs poorly (12%)

metacv 63% however limited as it uses only 6 amino acids

nbc uses up to 16GB - GISt uses only fraction

Next steps: exploring datasets

increase window - memory management

rhetorica@cs.toronto.edu
compsysbio.org/gist
Q & A
Q: natural datasets - taking water and spiking iwht sample? known bacteria?
A: currently studying metastatis(?) set. benchmark. also real world datasets out there - most of the work has been around improving performance on natural datasets. There is no standard benchmark to do anything like that. First thing - develop pipeline - simulated datasets.
Q: Have you seen teh performance improving as you see the longer reads?
A: longer read, better performance. Nearing perfection
--------------------------------------------------------------------------------------------
13:45 – 14:00
Proteomes at UniProtKB – advancements and challenges in the post-genomic era
Ramona Britto
EMBL-EBI, UK
(not Ramona who has laryngitis and can't talk)
Abstract
The UniProt Knowledgebase (UniProtKB) is a central resource for high-quality, consistent and richly an- notated protein sequence and functional information. UniProtKB has witnessed an exponential growth with a four-fold increase in the number of entries since 2011. This follows the vastly increased submis- sion of multiple genomes for the same or closely related organisms. To keep up with this rapid growth, recent advances at UniProt including a new web interface will enable users to find the most relevant and best-annotated set of sequences for each species. In addition to complete proteomes that are based on translations of completely sequenced genomes, we offer a selected subset of reference proteomes con- stituting a representative cross-section of the taxonomic diversity found within UniProtKB. We are working closely in collaboration with the INSDC, Ensembl and RefSeq to map all UniProtKB proteins to the under- lying genomic assemblies and to offer a consistent set of complete and reference genomes/proteomes to the user community. Also in the pipeline is the concept of a pan-proteome within taxonomic groups that will capture unique sequences not found in reference proteomes and aid in phylogenetic compar- isons. To further reduce redundancy within UniProt, a gene-centric view of complete proteomes will be implemented. This will bring together canonical and variant protein sequences into gene-based clusters that will more closely reflect genome size and offer a single reference protein for each gene. For highly redundant proteomes (e.g. strains of Escherichia coli), the non-reference protein sets will be made avail- able through UniParc which will be extended to include annotation data. Finally, a new proteome identifier will be introduced that will uniquely identify the set of proteins corresponding to a single assembly of a completely sequenced genome. All of these new developments and future plans with a particular focus on the microbial context will be presented.
Notes

Growth of sequence databases

Challenges

technical (servers, load, infrastructure)

content of dbs - much more difficult to explore

Solutions

# of genomes rising

4 fold increase since 2011

lot of new genomes, many strains for the same species 0 complicates for dbs and users

sequence redundancy increases

challenge s for annotation, scientific exploration, analysis and visualization

In uniprot - Proteomes DB

monitors which orgs have complete proteomes

keep track of assemblies - whether complete

infrastructure in place, make sense for increase in number of proteins

currently >4500 proteomes and spread over taxonomic range

Reference proteomes

representation of a particular taxonomic group

landmarks of proteome space

try to use the same reference as a base

group references as they have done in genomics

we can do much more with the data annotation

created , collaborations

essentials for many resources

Redundancy within proteomes

single gene can encode for multiple proteins through alternative splicing, initiation and varying promoter using

there are many more protein sequences

Gene-centric view ofproteomes

new UniProt interface

working the last year and a half on a new interface for uniprot

present the data in teh best possible way for users

talking to many userss - many in tehis audience

give feedback on how they can explore thedata in teh best way

this week

new proteomes interface

any kind of proteomic information

description of the data - report page

everything related to genome sequence as well

pan-proteomes

provide representative set of all sequences in a taxonomic group

Q & A
Q: reference proteomes are key - paired down dataset that's distributed nicely. Waht is the criteria for encoding that phylotic spread of the refernece?
A: There is a user community interested in a particular one - they are experts. Look in annotations - how many sequences can be annotated , are annotated. Take into accound all these criteria to decide.
--------------------------------------------------------------------------------------------
14:00 - 14:15
Development of the EcoCyc and MetaCyc Databases and the Pathway Tools Software
Ingrid M. Keseler
SRI International, USA
Abstract
EcoCyc (EcoCyc.org) is a comprehensive database resource for Escherichia coli K-12 that provides an overview of the current state of knowledge of E. coli biology. EcoCyc is continuously updated and con- tains references to more than 26,000 publications covering the gene products, metabolic pathways, and regulatory network of E. coli. Recent work with EcoCyc include the development of EcoCyc-17.5-GEM, a genome-scale model of the E. coli K-12 MG1655 metabolic network that was automatically generated from EcoCyc using the MetaFlux component of the Pathway Tools software. While EcoCyc is focused on a single model organism, the MetaCyc database (MetaCyc.org) is a curated collection of more than 2000 experimentally elucidated metabolic pathways and more than 9000 enzymes from organisms that cover a wide taxonomic range, serving as an encyclopedic reference on pathways and enzymes for basic and applied research including metabolic engineering. MetaCyc also provides the data source that sup- ports computational metabolic network prediction for organisms with sequenced and annotated genomes, resulting in the BioCyc collection of more than 3000 pathway/genome databases. The Pathway Tools soft- ware system supports both the creation of and end-user access to pathway/genome databases. It enables development of organism-specific databases by providing tools to automatically build databases from an- notated genome sequences, to infer the presence of metabolic pathways, to create and edit database entries, to generate metabolic models, and to publish the resulting databases. For end users of the pub- lished databases, the software provides a suite of query, visualization and analysis tools. Results can be captured and further analyzed or exported with Smart Tables (formerly called Web Groups). It is now possible to create temporary tables and conversely, to publish tables that can not be changed and can therefore be referred to in publications.
Notes
EcoCyc - E. coli encyclopedia

15 years ago

curated everythign ppl would want to know about e. coli

started with metabolic pathways

few hi throughput datasets incorportated

current funding situation - seeking funding for BsubCyc

new in EcoCyc

generated metabolic model from EcoCyc

generated from curated data in database

not separated curated info

plan frequent releases - 2x a year

can be inspected easily from what is curated - metabolic pathways

MetaCyc

db of Metabolic patthways from many different organisms

human, mouse, rat, plants, archaea, bacteria, etc

literature, manual curation

40K literature citations

new:

Atom mappings for the reactions

not manually curated

all computational

give idea where parts of the molecules go

useful for understanding reactions

can be used to predict metabolic pathways in other organisms

predict pathways complement of that organism

BioCyc 3500+ - metabolic pathways predicted

built on common schema/vocab/db object

compound structure updated in metacyc - can translate to biocyc easily

Software used to curate the databases

curation interface

publish to web

query tools

visualization tools

analysis tools

metabolic research

input desired compounds (start and end), search in db, or db+metacyc for optimal route from start to end

renamed from Groups - SmartTables- create and manage lists of 'things' (NB: pretty sure this is InterMine??)

methods for creating - search, import list, do analysis on lists - transform lists to other lists (enzymes -> pathways || genes - > regulators known to regulate genes)

set operations

share tables, publish, export

Biocuration Accuracy Assessment

got feedback from grant review panel:

manual curation is expensive (curators not the best paid??)

not scalable

what is the accuracy? how good are the curators?

=> manual curation should be replaced

validator task:

validators - validate task. Outside the db

curated assertions ('facts' with literature citations per gene/protein) - verify curation

evaluate the results

validor errors (false positive)

metadata errors (fact claimed was true, but citation incorrect)

factual errors

EcoCyc & CGD

final error rate (factual errors): 5 for each db, <2%

error rate very low. But we do make mistakes.

GO term curation is hard

very difficult to measure errors - Ingrid doesn't point out what kind of errors, how significant are the errors?

Q & A
Q:

1. Does your algorithm - nature has optimized these pathways, does your search take into account energy requirements?

2. Metadata - how does the validator know?

A:
2. decided by conferring with validator. Is that fact in the paper? Simple task. Decided in the end - curator would look at the fact. Detailed analysis of all analysis reported.
1. Can't answer - need to talk to Mario(?). Intended for metabolic engineering - where there is no natural path. May catalyze a possible step in a possible pathway. Different way of using: pathway hole filling - not done yet.
--------------------------------------------------------------------------------------------
14:15 - 14:30
Three's a crowd-source: Observations on Collaborative Genome Annotation
Monica C. Munoz-Torres (Suzana Lewis talking in place -- because Moni's visa did not arrive in time.)
Lawrence Berkeley National Laboratory, USA
Abstract
It is impossible for a single individual to fully curate a genome with precise biological fidelity. Beyond the problem of scale, curators need second opinions and insights from colleagues with domain and gene family expertise, but the communications constraints imposed in earlier applications made this inherently collaborative task difficult. Apollo, a client-side, JavaScript application allowing extensive changes to be rapidly made without server round-trips, placed us in a position to assess the difference this real-time interactivity would make to researchers’ productivity and the quality of downstream scientific analysis. To evaluate this, we trained and supported geographically dispersed scientific communities (hundreds of scientists and agreed-upon gatekeepers, in 100 institutions around the world) to perform biologically supported manual annotations, and monitored their findings. We observed that: 1) Previously discon- nected researchers were more productive when obtaining immediate feedback in dialogs with collabo- rators. 2) Unlike earlier genome projects, which had the advantage of more highly polished genomes, recent projects usually have lower coverage. Therefore curators now face additional work correcting for more frequent assembly errors and annotating genes that are split across multiple contigs. 3) Automated annotations were improved as exemplified by discoveries made based on revised annotations, for exam- ple 2800 manually annotated genes from three species of ants granted further insight into the evolution of sociality in this group, and 3600 manual annotations contributed to a better understanding of immune function, reproduction, lactation and metabolism in cattle. 4) There is a notable trend shifting from whole- genome annotation to annotation of specific gene families or other gene groups linked by ecological and evolutionary significance. 5) The distributed nature of these efforts still demand strong, goal-oriented (i.e. publication of findi
ngs) leadership and coordination, as these are crucial to the success of each project. Here we detail these and other observations on collaborative genome annotation efforts.
Notes
There is a lot more potential curators out there than there are in this room. There's a whole world of ppl out there with the experience that could be helpful.

Consequences of technology changes - sequencing is a lot more available

crowd-sourcing - there's more ppl doing sequencing we need to give all of those ppl the same opportunity we have to annote the genome.

Apollo

Automated systems:

gene prediction

curation -

identify every element that might have a functional role

eliminate elements - systemic errors

functional roles- compare what we do find with other species

curators - achieve precise biological fidelity-

Apollo: JS application on client side. Server where genome is, edits can be pushed out to everyoen looking at genome at the same time.

working with ppl, helping get set up

education, training, workshops

Apollo - genomic annotation editing platform - modify locations

you can write thigns, add things, upload data, adjust things

privacy handling

search - lower coverage genomes

configure - toggle strands, reverse complement, highlighting

standard navigation - zoom, pan

control edit - google doc, but for a genome

fine tune edits - non- cannonical splice site - simple click and drag

insertions and deletions, genome assembly itself can get feedback

history tracking - see changes - signed and dated - can look back and revert back

kinds of data

coverage plots -

heat maps

transcriptome data

Just got renewed! Want to be able to show variants and haplotypes associated

use same technique on low coverage genomes

working on next years

'folded' genome view (here now)

gene split accross, can visualize together. Translocation. See paralogues.

dynamic thresholding - grab and drag up and down to adjust threshold - create new track

ppl seem to like it! Familiar wiht browsers - click and drag

What have we learned by working together?

Bovine - 100 ppl, annotations. 3600 genes contributed

i5K project - 5K insect genomes - social genes, labour

New Technologies - challenges

Lessons learned:

training - can be used in educational purposes

the technology changes - take advantage of the latest technologies out there

standards

Q & A
Q:
1. Like web apollo. Track history - do you track evidence?
2. Other communities are using this to curate select gene sets - live in resource A/B. Lack of communication between curator groups? How can we all work together?

A:
1. No, this came up long long ago. Evidence changes. New transcriptome... no, we don't track the evidence. If you're interested in determining the evidence, you can do that dynamically, looking around at what you have now.
2. Broader discussion. Try to get convergence. Version control - you don't want a bunch of branches you don't merge at some point. Wanted to make it very open - didn't want to register to download. Don't know how many servers there are! Policing function? Encourage ppl to report these.Not sure what will happen. Export function - need to have one.
Q: Local BAM files - test server? CAn feed it in remotely?
A: Can feed it in remotely. Remote - each time you start a new session you need to reload it.

ISB2014 Session4 Microbial Informatics

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools