ISB2014 Session5 Data Integration and Sharing

TimeWednesday, April 9, 201409:00 - 10:00
Keynote Lecture, Chaired by Weimin Zhu 
The Great Hall, Hart House
link to all ISB2014 notes:
    http://etherpad.wikimedia.org/p/isb2014
The Great Hall, Hart House
April 8, 2014
http://biocuration2014.events.oicr.on.ca/agenda-5
Help me take collaborative notes on this session! Fill out your name in the box in the top right-hand corner. Add yourself to the editors list, then edit away!
I make a lot of typos. Sorry.
Editors

    name / affiliation / twitter

    Abigail Cabunoc / OICR / @abbycabs

    Marc Perry / OICR/ @mdperry

    Karen Yook/WormBase-Caltech/@wbperson712

Will Big Data Crush Curation?
Lincoln Stein
OICR, Canada
10.00 - 12:00
Session 5 - Data Integration and Sharing, Chaired by Gary Bader and Winston Hide, Sponsored by Heidelberg Institute for Theoretical Studies 
The Great Hall, Hart House
Bioinformatics will die?
http://www.oreillynet.com/pub/a/network/biocon2003/stein.html

    Big Data - trend

    organized Big Data in Biology - Keystone

    2008 - watershed moment in Big Data

    google announce able to - domain agnostic stats, could do better job than CDC timing and scale of flu outbreaks

    http://www.google.org/flutrends/ca/#CA

    using ppls search terms to predict flu outbreaks

    tracks closely - didn't need to be a doctor or know about the flu. Just apply clever algos to large amonts of data

    Nate Silver - NYT statistician

    highly accurate predictions for 2012 presidential and congresional races - not traditional polling agencies

    trick: gather info from many polls

    historical data: error corrections

    down to state level

    Feb 4, 2014

    big data predicting EVERYTHING (sports, politics, hollywood)

    financial reporting: business model before (subjective) - now quants (big data trends)

    Our field - big data rolling over us - driven by tech fact, cost of sequencing fell off ('07-08, NGS)

    less expensive to sequence gemoe than before

    vast increase in primary sequencing data and chipseq

    rise to large quantitative and qualitatively complex datasets (UCSC ENCODE)

    ICGC Project - International Cancer Genome Consortium project

    Data release in Feb- 42 projects, 50K genomes, 18 cancer types, 10K+ donors, 4M somatic mutaions, clinical data

    huge db

    We're no longer looking at individual datasets in conventional publications - is curation necessary?

    has the dataload overturned the donkey

    Reading papers - doesn't scale

    Hand editing gene models (C. elegans)  - doesn't scale

    Is there a clear path for biocurators?

    (metadata) massage therapist

    (data) wrangler

    (complex data) modeler

    Data /doesn't/ speak to itself

    Google flu - they may high overtrained their system...

    Did pretty well last coule of years

    but last few years - completely missed the mark -> getting more innaccurate

    Needs ppl who understand flu to make this work again

    Changing role of the biocurator

    curation of primary literature - less important

    Future: Working hand in hand with computational biologists - domain experts

    definte the data model for complex data in ways that make sense are resilient and flexible

    curate metadata - (usually a complete mess)

    spend a lot of time thinking about the data. QCs, sanity checks

    anectode: at site visit for another cancer genome project

    put together approprate budget for curating data

    'think you underestimate curator by half' - needs a lot of massaging

    invited to double the numbers of curators

    indication: curation is not going to be a lost art

    ICGC - (http://www.icgc.org/)

    example of curators/computational bio/software engineers - working hand in hand

    all continents (except africa & antarctica)

    Identify - cancer data pathways, drug targets

    new methods of treating, diagnosing, predicting outcome

    simple project: for each major cancer type, each project looks at ~500 patients who have taht type

    sequence their genomes (tumour and normal tissue - usually blood)

    comparing the sequences - identify the cancer related mutations (point mutations and larger scale - rearrangements, transposistions, etc)

    relate mutations to tumor biology

    translate knowledge to diagnosis, treatment

    5 years now

    15 data releases - next release in May

    rapid growth in # of donors - 20K genomes (10K donors)

    Ingest data - software engineering team put together data processing pipeline

    data in, validations system, combined with dother data system (cosmic, references, more)

    quality control

    index with elastic search

    output on website -> depth of data

    http://dcc.icgc.org/search

    Key to making the system work - data dictionary

    machine readable and human readable

    controlled vocabularies, constrains what data can be submitted to us and how we present it

    used in all parts of the system

    formatting submission docs

    running quality control check

    create the datamodel - integration and search engine

    Hardeep Nahal, Marc Perry, Jun jun Zhang - Data curators! - Ouellette & Ferretti

    created standards, worked with software team to create interface

    Biocurators - needed more than ever!

    ICGC alone (not the only big data project in our field) will hit ~10PB mark

    maybe as early as 2018

    metadata & raw read data

    Big challenges!

    who can download this to their laptop and mine it?

    probably only a couple group in the world that have the storage space and cluster resources and bandwidth to download and do whole genome analysis on it

    we do not want a world where only a couple of groups can do it

    ICGC & TCGA => met to create Pan-Cancer Whole Genome Analysis Project (PAWG)

    Goals: understand what's going on in the 95% of the cancer genomes isn't protein coding

    whole genome analysis

    integrated fasion -> to date, using a silo type of pipeline

    gives us insights to particular tumor type

    but comparing one tumor to another - can't easily do

    each pipeline is different enough - methodological differences

    get together all whole genome tumor/normal pairs from ICGC

    reanalyze in a standard way- same alignment, variant calling, qc filters

    completely uniform dataset - work at similarities and differences among them

    Need raw data

    already 0.5 PB - getting larger

    Using cloud computing to make this possible

    collaborating with 6 datacenters - hold dataset and run uniform alignment and variant calling in their centers

    after done alignment and variant calling - synchronizing data accross centers

    July: Opening up most of the centers to log in from members of the working groups - access the data in the cloud

    launch vms on the cloud centers - have direct access to reads and interprted data

    share with memebers of working groups

    no downloading involved - run cluster jobs in cloud center

    Working on:

    Develop metadata specs - BAM?

    Design data submission and validation - make sure data going to the right place

    wrangle data

    training

    Coming soon: The Cancer Genome Collaboratory

    Would be a waste if we put all the data in these clouds then turned them off

    Long term project - PAWG continued

    all cancer data in a few datacenters along with vms and access control needed to authenticate ppl and do analysis on them

    but instead - open to general public!

    not limited to ICGC working groups

    Not free - get to data for free, download for free

    running vms - pay per cpu hour (same model as amazon/google/MS for commercial clouds)

    Teams

    One team building infrastructure/hardware/software

    Ethics - usage agreements

    Driving biological projects - test infrastructure, develop use cases

    training and outreach

    benchmarking

    Biocurators are playing a role in each one of these teams!

    Two data centers: Chicago & Toronto - connected by high speed link

    Up and running for internal testing - end of this year

    public testing - end of 2015

    You can do it!

    Without us, it's not going to happen.

==== This Segment written by Marc Perry (who has worked on Lincoln's team since 2008) ===== Comments are my own ======
Quotation Highlights (in some cases paraphrased):
LDS: Re: Death of Bioinformatics in 2010: Caveat Emptor: "Whatever I say is going to be wrong"
"We have gone from the sublime to the absurd" -- Media hype on Big Data as the end-all and be-all for forecasting the future
ISB2014 Bingo!!! Moore's Law on this slide!! (see Lincoln's original article in Genome Biology with this plot (2010))
"ENCODE is complex"
Hey, ICGC DCC Data Portal, that is the project that I work on now!! 10,068 Donors as of Feb. 2014
"Has the data load overturned the donkey?" "I don't want to call you guys donkeys, but that _is_ the implication."
metadata massage therapist, that sounds so much nicer than data wrangler for a job title
Google Flu Predictions sort of sound like weather forecasting
"Big Data is going to push through the change in biocurator job activities"
1. Define model for complex data
2. Curate metadata
3. Thinking about the Data (is this different from day dreaming?)
Yay! ICGC, my new project again!! That is the same map that I had on my poster #108 (in case you missed it)
The Biocurators create the Data Dictionary, it is both human and machine readable (JSON).  Man, LDS put up a slide with us on it!!
Hmmmmm, that is sweet, being acknowledged during a keynote at a biocuration meeting.
10 PB of data and metadata by ~ 2018 (is that another "prediction")
We are going to the Cloud now . . . .
Now Lincoln is channeling my poster #108 from Biocuration 2014 (in case you missed it)
Creating a level playing field for downstream analysis in the Pan-Cancer Analysis Project, by reanalyzing them in a uniform way.
I think that guy in the deck chair is supposed to be Todd Harris (there was a slide about people analyzing the data) 
"Don't tell my boss about this" - seconding staff to PCAP
"Its big and its scary"
======================================
Q & A
Q: Possible to get ICGC data programmatically? Find all samples with this mutation in this particular gene?
A: Available via rest interface. All connected via REST API.
Q: You mentioned there's a few international data centers - in each of these centers include whole dataset? Or each have their own data set they contribute to?
A: Recruited 6 centers, access to compute clusters. EBI, Tokyo, Barcelona, University of Chicago, DKFZ, Korea. Mirroring data to each of these in a way to distribute the analytic work that they do. Can encourage working groups to move centers if one is overload - This is PAWG. After done - will clear data out and go back to business.
However, Chicago and new datacenter in Toronto will keep copies of the data and open to thecommunity as cloud compute resource. Will bring in the rest of the ICGC data - exome, genomes sequenced beteween now and 2018.
Q: How do you deal with data transfer?
A: FTP doesn't really cut it. Better to have a fat pipe - multi connection protocol. Using GeneTorrent (based on bittorrent) - multiple connection, transfer bit of file simultaneously. Peer-to-peer, uses bandwith appropriately. Other protocols that we can use - espera, udr
Q: PAWG - using certain parameters for the calls - how did you come up with these parameters? Collectively done?
A: Benchmarking. Co-sponsoring DREAM challenge - using winners
Q: 6 curators, tons of samples, many sites. Challenges?
A: Main challenges - the centers have to stop doing what they've been doing (stop calling themselves and submitted, start packaging everything together using different submission software). Education, trial error, testing refinement. Communication.
Q: How can we make sure that the NCI is one of your supporting groups?
A: Will put NCI logo up... open ended question. The NCI has been a great support. There continue to be policy differences between TCGA and ICGC - logistic problems. Biggest - the current policy of dbGap taht prevents TCGA datasets from being analyzed on commercial clouds. It would give us a great deal of flexibility if that policy were reviewed and reversed. And allowing TCGA data to mingle with data in Europe and Asia. Changing circumstance.
----------------------------------
10:00 - 10:15
The eGenVar data management system (eGDMS) - cataloguing and sharing sensitive data and meta-data for the life sciences
Sabry Razick
Norwegian University of Science and Technology, Norway
"My data is my data, your data is my data too" - Sydney Brenner
Abstract
Systematic data management and controlled data sharing aim at increasing reproducibility, reducing re- dundancy in work, and providing a way to efficiently locate complementing or contradicting information. One method of achieving this is collecting data in a central repository or in a location that is part of a federated system and providing interfaces to the data. However, certain data, such as data from biobanks or clinical studies, may, for legal and privacy reasons, often not be stored in public repositories. Instead, we describe a metadata cataloguing system and a software suite for reporting the presence of data from the life sciences domain. The system stores three types of metadata: file information, file provenance with data lineage, and content descriptions. Our software suite includes both graphical and command line interfaces that allow users to report and tag files with these different metadata types. Importantly, the files remain in their original locations with their existing access control mechanisms in place, while our system provides descriptions of their contents and relationships. Our system and software suite thereby provide a common framework for cataloguing and sharing both public and private data. 
Notes

    Mainly talking about - the data that you can't have

    eGenVar - sensitive data

    special use case from bio-bank

    data ppl wont' integrate and share - not allowed to look at the data, but must come up with a system to somehwo share

    why share?

    data sharing is very important

    published data vs unpublished data - paper not finished yet, or secretive reasons

    why coudln't share:

    privacy

    secrecy

    publication not ready - you don't want to put the data out there

    ethical issues

    Sharing strategies

    email

    ssh

    advanced fis - iRODS, TwinNEET, 

    extract data, describe data

    parse files, extract

    pipeline analysis - invite ppl to use your data

    Sensitive data

    strict standards, difficult to follow-up. Try to collect data, ppl will do that as necessity, not freely

    eGenVar - data managemetn

    used to report extended set of metadata about existing sensitive data

    metadata

    provenance

    content description - tagged. Described with strict standards. 

    Tags - easy, less effort

    standardized, controlled vocab

Q: Can a user apply for permission to see raw data?
A: If you want it, apply for permission
----------------------------------
10:15 - 10:30
A controlled vocabulary for entities and events in the Reactome database 
Steven Jupe
EMBL-EBI, UK
Abstract
Reactome is a database of human pathway and processes accessible via a website, available as down- loads in standard reusable formats and via a RESTFUL API. Entities involved in pathways require de- scriptive and unambiguous names that are not available elsewhere. We have devised and partially incor- porated a controlled vocabulary (CV) that creates unambiguous and unique names for reactions (pathway events) and all the molecular entities they involve. This will significantly improve naming consistency and readability, with consequent benefits for searching, data mining and data exchange.
Notes 

    Controllved vocab - pathway adn events

    when it comes to naming things-  ppl get very uptight

    CS background - why do names matter? Why do you just use the db identifier?

    Users are biologists- we want them to access our data

    we shouldn't expect them to learn some special magic to find the information in our resources

    should be able to find nicely wihtout learning

    testing ppl who were expect in looking up information in a particular system

    surprised how quickly they gave up when they coudln't find what they were looking for!

    users will give up quite quickly

    users will search for things with names they're familiar with

    How do you name things?

    naming events - reactions that constitute the steps in pathways

    we can just use the names from the literature?

    no: rather ambiguous

    don't necessarily accurately describe the things they're being applied to

    typically more than one

    need to make a selection

    upset someone

    may be acceptible in context, but in broad sense of db, inappropriate

    Often trying to name things - bc they're intermediates in pathways - don't have a name in literature

    Solution: controlled vocab - used for naming all events and entitites in Reactome - could be applied to any other system

    improve consistency and interprability of names, searches, finding things, data mining, reduce burden on curators -> takes time to select a good name

    Peptide Name

    no universal authoritative source of names for peptides (no name for all the peptide fragments you can derive from the inital translated product)

    no agreed vocabulary for peptide products

    no universal authoritative source for all co- and post- translational modifications

    Complex, set and event names

    even more tricky- groups or sets of proteins

    names ought to reflect components

    event names- identify event particpants and category of event

    want to use simple text searches to find all events - molecule of interest

    Peptide CV Names

    Gene symbol core - reactome is very human centric. HGNC approved gene symbols

    Peptide coordinates suffix

    uniprot - chain feature

    if peptide reprent is not identical to peptide in chain feature

    uncertain - '?'

    Post-translational modification PTM prefix

    PSI-MOD - contains majority of post-translatinal modifications, short abreviations

    uniquely labels peptide most of the time

    Small molecule (chemical) CV

    sources from ChEBI, KEGG compounds, PubChem, literature

    http://www.allacronyms.com/

    Complex and set CV names

    if you have a complex with more than one member, concat names of members

    similar for sets

    complex: separate with ':'

    set: separate with ':'

    occurs more than once: name preceded by nx

    Optional extension:

    not generally applicable, useful in Reactome

    candidates set - has additional possible members

    in round brakets

    precedence/hierarchy - square brakets

    Pathway event (reaction) CV names

    defined small set of terms - classify, define what's happening in a pathway event

    e.g. transformation, binding, jtransfter, active transport, and more

    anything with a defined catalyst with a definable molecular function

    GO molecular function - verbs derived form GO terms

Q & A
----------------------------------
11:00 - 11:15
Pathway Commons: A public library of biological pathways 
Emek Demir
MSKCC, USA
Abstract 
Pathway Commons (http://www.pathwaycommons.org) is a network biology resource and acts as a con- venient point of access to biological pathway information collected from public pathway databases, which you can search, visualize and download. All data is freely available, under the license terms of each contributing database. Biologists can search, visualize and download Pathway Commons pathways as part of an integrated network analysis workflow. Computational biologists and software developers can download all pathways in BioPAX, SIF and other formats for pathway and network analysis. They can also build software on top of Pathway Commons using our web service API. Database providers can share their pathway data via a common repository. Pathways include biochemical reactions, complex assembly, transport and catalysis events and physical interactions involving proteins, DNA, RNA, small molecules and complexes. Pathway Commons aims to collect and integrate all public pathway data available in stan- dard formats (BioPAX, PSI-MI). A completely redesigned database and query system was developed in late 2013 to support a powerful web service API. Recent work focuses on developing a Pathway Com- mons query app for Cytoscape and new web based visualization tools, as well as expanding the data in Pathway Commons.
Notes 
Data-driven discovery cycle:

    read a lot of papers, form hypothesis, etc

    combine with high throughput profiles - don't play well with thousands of datapoints

    TCGA paper: pathway covers 60% of patients. Go to pathways commons - missed it.

    missing oppourtunity - not using systematic computational work

    computational cell map: 

    still use traditional site

    culture change in computational biology

    historically - pathway dbs from different rules, use cases, schemas

    hard to combine and integrate

    Want to find a common language - standardize - interoperability (Pathway Commons)

    use easily to answer biological questions

    a lot of barriers - hard to solve this problem

    BioPAX - standard language - process language. Simlar to what Reactome/biocyc/etc does

    Pathway Commons: workflow - once we get exports in biopax, can merge and align pathways

    curation differences - curators pay more attention to different parts

    reading different papers, reading different parts. Different preferences

    reluctant to merge automatically

    want to find solution for this

    Paxtools

    started as simple reading/writing library

    now expanded

    import, export, huge library, can just use parts if you want

    Pathway Commons access

    multiple ways

    removing barriers

    webservice - graph serch, fields search, path traversal

    SPARQL endpoint

    batch downloads

    formats: BioPAX, SIF, GSEA, SBGN (exports topology and automated layout)

    Clients- cytoscape, chibe2, virtual cell, pc java client, paxtoolsR

    Use cases

    much can be done using pathway information

Q & A
----------------------------------
11:15 - 11:30
The UniRule system for data integration and sharing 
Claire O'Donovan
EMBL-EBI, United Kingdom
Abstract
The UniProt Knowledgebase is a central hub for the collection of functional information on proteins. It consists of 2 sections: UniProtKB/Swiss-Prot which is manually annotated and reviewed and UniPro- tKB/TrEMBL which is automatically annotated and not reviewed. Over the last few years, we have devel- oped the UniRule system that leverages experimental UniProtKB/Swiss-Prot curation for the automatic annotation of UniProtKB/TrEMBL in order to address the exponential growth in uncharacterized proteins from the genome sequencing centers. UniRule consists of manually created annotation rules that specify functional annotations and the conditions which must be satisfied for them to apply (such as taxonomic scope, family membership as defined by the 11 InterPro Consortium members (GENE3D, HAMAP, PAN- THER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, TIGRFAMs and the presence of specific sequence features and as such is an illustrative example of data integration and sharing. The UniRules are applied at each UniProt release, and their predictions are continuously evaluated against the content of matching UniProtKB/Swiss-Prot entries, guaranteeing that the predictions remain in synch with the ex- pert curated knowledge of UniProtKB/Swiss-Prot. The collaboration with the InterPro Consortium is an illustrative example of data integration and sharing and UniProt now wants to extend this collaboration to the functional annotation community as a whole, exchanging rules and setting standards for consistent annotation.
Notes 
Why automatic annotation necessary?

    Lots of new sequences

    lots of new sequences-  no experimental work will be done on it

    mission to attach information to it

    automatic annotation enables us to provide accurate and standardized data

    major issue:

    lots of data coming in is somewhat mixed in quality - not just correcting bu updating. While you might get funding at creation - no funding to keep it up to date

    annotation could have been corrrect at some point, but need to maintain

    helps us in our own db annotation

    we were able to find new knowledge 

    highlights inconsistent annotation across family

    Automatic Annotation

    UniRule and SAAS

    Steps in Rule curation

    start with a protein family - look in papers, dbs, resrouces, distil that to manual curation

    common annotation - across a certain family/orthology range

    some annotation not appropriate, but as you're annotating you're thinking about that

    rules become useful

    when biocurators and developers get together - supportive programming environment (rule maintenance)

    data refreshed every month

    tool will tell you when the source data for your rule has been updated for any reason.

    Automatic Annotation in UniProtKB

    rule creation - manual

    anatomy of a rule:

    conditions & annotations

    UniRule tool:

    webapp

    repository for rules in prod and dev

    navigation - filtering by annotation/condition, searching (free text, field)

    batch editing - across system

    history of changes for each rule - QA

    you can see what the curator was thinking about - don't have to second guess

    developers: drop down menus, enforce standards, enabling curators to do what they want to do

    very easy to add/change

    new feature - sequence features

    using predictors

    signatures from interpro and elsewhere are good for diagnostic, but not as specific as biocurators would like, they're not specific

    align component to template sequence- check if feature is actually there

    new version beta available in september

    Data sharing!

    webapp - use from anywhere

    upload/download in xml

    changes documented adn reversible

    experts can contribute knowledge

    plan: make available to scientific community

    tested with visiting expert from Spain

    want to include more researchers

    want to work with other in automated function prediction

    want to organize automated anntation workshop

Q & A
Q: Igor form Bader lab: Don't often see somone so excited abt software tool. Question about Rule: possible to use external resources?
A: Conditions and annotation - software engineers set it up so that new ones can be easily integrated
----------------------------------
11:30 - 11:45
Metadata audit between European Genome-phenome Archive and International Cancer Genome Consortium 
Hardeep K. Nahal
OICR, Canada
Abstract
The International Cancer Genome Consortium (ICGC) currently comprises of data from over 10,000 can- cer genomes from over 40 different tumour types. Processed mutational data is submitted and stored at the ICGC, while the accompanying raw sequencing data (FASTQ and BAM) is required by ICGC mem- ber projects to be submitted to the European Bioinformatics Institute’s (EBI) European Genome-phenome Archive (EGA). Currently, EGA has 91 controlled-access ICGC datasets. Mapping of data in ICGC with read data file submissions in EGA is facilitated by requiring ICGC member projects to submit sample/donor metadata information captured in XML files. A major challenge in distributed projects is to keep in sync submissions and data which are split between multiple resources. In response to user feedback which highlighted difficulties in linking data across the EGA and ICGC datasets, a recent preliminary file audit was performed for the purpose of identifying current problems in order to target corrective measures. This exercise identified differences in the formats used for sample and donor identifiers in the metadata information submitted to EGA and clinical data submitted to the ICGC. These identifier format differences pose problems for researchers and web portal developers trying to obtain or point to raw sequencing data for mutations they may find of interest in the ICGC. Here, we present the current status of the ICGC/EGA audit and details on our procedures to track and coordinate our efforts to correctly curate and map a consistent set of sample and donor metadata identifiers between EGA and ICGC. Future goals will in- volve implementing a validation step or enforcing quality control measures to ensure metadata information submitted to EGA maps to information submitted to ICGC. This audit will be critical for the Pan-Cancer Analysis project, which will require obtaining raw sequencing data from EGA based on information about sample/donors in ICGC.
Notes 
Bioinformatician at OICR - Francis group
Metadata audit between EGA & ICGC

    ICGC - 71 cancer project , 18 countries

    42 cancer types 10K donors

    submission cycle - 3-4 releases each year

    challenges: we have to make sure that submissions and data between different resources are in sync

    collect mutation data: 

    simple somatic, cnv, structural comatic, gene expression, splicing, miRNA, and more

    open and controlled data

    open: publicaly acessible - no indivdual identification

    controlled: identifiable. Apply for permission from DACO - approve reseracher can get raw sequence (BAM, fastq)

    ICGC DCC Pipeline

    connection icgc, reseracher, ega

    ICGC-  collect tumor/normal sample from donor, sequence

    send to DCC ICGC

    validated, DCC data portal

    raw data -> EGA

    also metatdata and xml files

    researchers - interact with DCC data portal

    further analysis, apply to DACO, get BAM files from EGA

    EGA - 110 controlled access datasets

    ICGC Data portal:

    search for mutation, go to donors in project

    interest in specific donor - specimin level - obtain clinical info. Raw data 

    apply to DACO, get approval, go to EGA

    but then can't find BAM files associated to patient???

    metadata not always complete

    ICGC & EGA

    mapping between submissions, study, sample/donor - submitters 

    hierarchy to get to sequence file - lots of potential discrepancies

    in response to user feedback - difficulty linking EGA and ICGC datasets

    identify current problem

    corrective measures

    tool xml files from EGA and x-ref with ICGC

    compiled some project info, sample/donor identifiers, accessions

    audit reports - sent to individual ICGC projects

    what's missing? work individually

    issues: 

    differences in formats used for clinical identifiers

    sample donor identifiers were not necesarily the same

    which sample tumor or normal?? will be aligning tumor and normal (pancan)

    EGA dataset can match a number of different studies

    older projects: EGA no longer existed, mapped to different study

    metadata issues - donor identiffier submitted to EGA and ICGC were different

    also, no info on normal or tumor

    sent to project and asked to match/fill out

    sometimes information captured in free text - hard to parse programmatically

    use controlled vocabulary

    Status:

    still working with projects to clean up the data

    study level matching between ICGC and EGA

    where projets subit data to ICGC - required to fill in accession identifier - cross field validation check. improve validation criteria

    need to improve documentation and guidelines

    dont' realize that identifier info is what cross links the information together

    also working with EGA  - fix future data 

    validate metadata in consistent

Q & A
Q: How does it come to be that it's not so easy for submitters to submit the information? Web interface?
A: That's an option. Right now, they put an xml file together.
----------------------------------
11:45 - 12:00
The ENCODE metadata standard to integrate diverse experimental data sets 
Eurie L. Hong 
Stanford University, USA
Abstract 
The Encyclopedia of DNA Elements (ENCODE) project is a collaborative effort to create a comprehen- sive catalog of functional elements in the human and mouse genomes. Now in its 9th year, ENCODE has grown to include more than 40 experimental techniques to survey DNA-binding proteins, RNA-binding proteins, the transcriptional landscape, and chromatin structure in 400+ cell lines and tissues. All exper- imental data and computational analyses of these data are submitted to the Data Coordination Center (DCC) for validation, tracking, storage, and distribution to community resources and the scientific com- munity. Metadata describing important experimental conditions, such the biological samples, specific reagents, and protocols necessary to replicate the assays and their analysis, as well data standards, such as quality metrics for data files and antibody characterization documents, have been expanded and are being submitted to a newly-formed DCC. As the volume of data increases, the identification and or- ganization of data sets becomes challenging. Here, we describe the design principles of how metadata are organized and annotated at the ENCODE DCC in order to facilitate the identification and comparison of data sets generated by the ENCODE project. The metadata are stored in a structured data model and annotated using ontologies to ensure high quality metadata that is interoperable between multiple projects. In addition, the breadth of metadata that describe the biological samples, reagents, methods, and protocols support reproducibility of assays and promote easy identification of shared resources in the project. The organization of the metadata will allow flexible and powerful searches on the revamped ENCODE Portal, the public website of the ENCODE project, as well as support intuitive displays of bi- ological samples, reagents, and protocols used for an experiment. Data from the ENCODE project can be accessed via the ENCODE portal (http://www.encodeproject.org) a
nd the UCSC Genome Browser (http://genome.ucsc.edu/cgi-bin/hgGateway).
Notes
3 ways working to integrate diverse experimental sets
1) within ENCODE - datamodel
2) Ontologies to annotate metadata
3) ENCODE technical implementation of website- access metadata

    ENCODE DCC

    Mike Cherry, Kent

    Data WRanglers and soft engineers

    qa, sysadmin, admin

    What do we do

    Data produces

    Analysis group - computational methods/predictions

    DCC - store all these files, track progress, make portal so the rest of the community can consume

    Challenge: Can you define a metadata standard for diverse assays in multiple species?

    mouse, human, fly, worm

    understand what labs doing, generating data, wnat to retrieve data, what questions want to ask

    set of principles driving metadata to collect:

    transparency - how assays being done. provide protocols

    reproducibility - capture what files being used to generate additional files

    Capture experimental design

    experimetn:

    all assays - 2 biological replicates

    control - replicate

    data files

    generate results files

    capture this all flexibly

    identify reusable experimental variables

    biosample - reusable reagents

    run on the same samples - broad picture of genomic state of sample

    antibodies

    libraries

    files

    help labs uniquely identify the specific metadata, sources, ids

    define relationship to each other

    labs have been using this for almost a year to submit their metadata

    battle tested- working pretty well

    Annotating terms - using Ontologies

    common ontologies among different data projects

    can really improve the integration of data

    ENCODE & Roadmap Epigenomics - both internally consistent

    but only 3 biosamples match exactly between projects

    need to read every single term and description to figure out what samples you need to look at!

    Remap to ontology terms

    uberon for issues

    biosamples - uberon- ontologies

    talking to other nih projects - key points of interaction: assays that were done, samples used, ontologies can integrate the data

    faceted browsing

    can use the ontologies to drive searches

    Technical implementation of website

    integrate with other resources

    metadata in JSON-LD - viewed as web page

    labs can query and submit their metadata using REST API

    integration with other resources - grab json object and parse for identifier

    Conclusion: datamodel

    biocurators - set data standards - how to best define data across consortiums

    standards should be setting

    data access - how to make available for all the other communities

    programmatically access

Q & A
Q: LIMS systems - did you try to write a tool/plugin for the LIMS to make them generate the metadata you want?
A: we've been talking to production labs using LIMS systems to use ontologies. How to control their LIMS system. Production labs - providing their own scripts to access our API. Short scripts. Just do a mapping - doesn't matter what their LIMS system. Every single lab has made scripts to submit data.
ISB2014 Session5 Data Integration and Sharing

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools