ISB2014 Session5 Data Integration and Sharing
TimeWednesday, April 9, 201409:00 - 10:00 Keynote Lecture, Chaired by Weimin Zhu The Great Hall, Hart House link to all ISB2014 notes: http://etherpad.wikimedia.org/p/isb2014 The Great Hall, Hart House April 8, 2014 http://biocuration2014.events.oicr.on.ca/agenda-5 Help me take collaborative notes on this session! Fill out your name in the box in the top right-hand corner. Add yourself to the editors list, then edit away! I make a lot of typos. Sorry. Editors name / affiliation / twitter Abigail Cabunoc / OICR / @abbycabs Marc Perry / OICR/ @mdperry Karen Yook/WormBase-Caltech/@wbperson712 Will Big Data Crush Curation? Lincoln Stein OICR, Canada 10.00 - 12:00 Session 5 - Data Integration and Sharing, Chaired by Gary Bader and Winston Hide, Sponsored by Heidelberg Institute for Theoretical Studies The Great Hall, Hart House Bioinformatics will die? http://www.oreillynet.com/pub/a/network/biocon2003/stein.html Big Data - trend organized Big Data in Biology - Keystone 2008 - watershed moment in Big Data google announce able to - domain agnostic stats, could do better job than CDC timing and scale of flu outbreaks http://www.google.org/flutrends/ca/#CA using ppls search terms to predict flu outbreaks tracks closely - didn't need to be a doctor or know about the flu. Just apply clever algos to large amonts of data Nate Silver - NYT statistician highly accurate predictions for 2012 presidential and congresional races - not traditional polling agencies trick: gather info from many polls historical data: error corrections down to state level Feb 4, 2014 big data predicting EVERYTHING (sports, politics, hollywood) financial reporting: business model before (subjective) - now quants (big data trends) Our field - big data rolling over us - driven by tech fact, cost of sequencing fell off ('07-08, NGS) less expensive to sequence gemoe than before vast increase in primary sequencing data and chipseq rise to large quantitative and qualitatively complex datasets (UCSC ENCODE) ICGC Project - International Cancer Genome Consortium project Data release in Feb- 42 projects, 50K genomes, 18 cancer types, 10K+ donors, 4M somatic mutaions, clinical data huge db We're no longer looking at individual datasets in conventional publications - is curation necessary? has the dataload overturned the donkey Reading papers - doesn't scale Hand editing gene models (C. elegans) - doesn't scale Is there a clear path for biocurators? (metadata) massage therapist (data) wrangler (complex data) modeler Data /doesn't/ speak to itself Google flu - they may high overtrained their system... Did pretty well last coule of years but last few years - completely missed the mark -> getting more innaccurate Needs ppl who understand flu to make this work again Changing role of the biocurator curation of primary literature - less important Future: Working hand in hand with computational biologists - domain experts definte the data model for complex data in ways that make sense are resilient and flexible curate metadata - (usually a complete mess) spend a lot of time thinking about the data. QCs, sanity checks anectode: at site visit for another cancer genome project put together approprate budget for curating data 'think you underestimate curator by half' - needs a lot of massaging invited to double the numbers of curators indication: curation is not going to be a lost art ICGC - (http://www.icgc.org/) example of curators/computational bio/software engineers - working hand in hand all continents (except africa & antarctica) Identify - cancer data pathways, drug targets new methods of treating, diagnosing, predicting outcome simple project: for each major cancer type, each project looks at ~500 patients who have taht type sequence their genomes (tumour and normal tissue - usually blood) comparing the sequences - identify the cancer related mutations (point mutations and larger scale - rearrangements, transposistions, etc) relate mutations to tumor biology translate knowledge to diagnosis, treatment 5 years now 15 data releases - next release in May rapid growth in # of donors - 20K genomes (10K donors) Ingest data - software engineering team put together data processing pipeline data in, validations system, combined with dother data system (cosmic, references, more) quality control index with elastic search output on website -> depth of data http://dcc.icgc.org/search Key to making the system work - data dictionary machine readable and human readable controlled vocabularies, constrains what data can be submitted to us and how we present it used in all parts of the system formatting submission docs running quality control check create the datamodel - integration and search engine Hardeep Nahal, Marc Perry, Jun jun Zhang - Data curators! - Ouellette & Ferretti created standards, worked with software team to create interface Biocurators - needed more than ever! ICGC alone (not the only big data project in our field) will hit ~10PB mark maybe as early as 2018 metadata & raw read data Big challenges! who can download this to their laptop and mine it? probably only a couple group in the world that have the storage space and cluster resources and bandwidth to download and do whole genome analysis on it we do not want a world where only a couple of groups can do it ICGC & TCGA => met to create Pan-Cancer Whole Genome Analysis Project (PAWG) Goals: understand what's going on in the 95% of the cancer genomes isn't protein coding whole genome analysis integrated fasion -> to date, using a silo type of pipeline gives us insights to particular tumor type but comparing one tumor to another - can't easily do each pipeline is different enough - methodological differences get together all whole genome tumor/normal pairs from ICGC reanalyze in a standard way- same alignment, variant calling, qc filters completely uniform dataset - work at similarities and differences among them Need raw data already 0.5 PB - getting larger Using cloud computing to make this possible collaborating with 6 datacenters - hold dataset and run uniform alignment and variant calling in their centers after done alignment and variant calling - synchronizing data accross centers July: Opening up most of the centers to log in from members of the working groups - access the data in the cloud launch vms on the cloud centers - have direct access to reads and interprted data share with memebers of working groups no downloading involved - run cluster jobs in cloud center Working on: Develop metadata specs - BAM? Design data submission and validation - make sure data going to the right place wrangle data training Coming soon: The Cancer Genome Collaboratory Would be a waste if we put all the data in these clouds then turned them off Long term project - PAWG continued all cancer data in a few datacenters along with vms and access control needed to authenticate ppl and do analysis on them but instead - open to general public! not limited to ICGC working groups Not free - get to data for free, download for free running vms - pay per cpu hour (same model as amazon/google/MS for commercial clouds) Teams One team building infrastructure/hardware/software Ethics - usage agreements Driving biological projects - test infrastructure, develop use cases training and outreach benchmarking Biocurators are playing a role in each one of these teams! Two data centers: Chicago & Toronto - connected by high speed link Up and running for internal testing - end of this year public testing - end of 2015 You can do it! Without us, it's not going to happen. ==== This Segment written by Marc Perry (who has worked on Lincoln's team since 2008) ===== Comments are my own ====== Quotation Highlights (in some cases paraphrased): LDS: Re: Death of Bioinformatics in 2010: Caveat Emptor: "Whatever I say is going to be wrong" "We have gone from the sublime to the absurd" -- Media hype on Big Data as the end-all and be-all for forecasting the future ISB2014 Bingo!!! Moore's Law on this slide!! (see Lincoln's original article in Genome Biology with this plot (2010)) "ENCODE is complex" Hey, ICGC DCC Data Portal, that is the project that I work on now!! 10,068 Donors as of Feb. 2014 "Has the data load overturned the donkey?" "I don't want to call you guys donkeys, but that _is_ the implication." metadata massage therapist, that sounds so much nicer than data wrangler for a job title Google Flu Predictions sort of sound like weather forecasting "Big Data is going to push through the change in biocurator job activities" 1. Define model for complex data 2. Curate metadata 3. Thinking about the Data (is this different from day dreaming?) Yay! ICGC, my new project again!! That is the same map that I had on my poster #108 (in case you missed it) The Biocurators create the Data Dictionary, it is both human and machine readable (JSON). Man, LDS put up a slide with us on it!! Hmmmmm, that is sweet, being acknowledged during a keynote at a biocuration meeting. 10 PB of data and metadata by ~ 2018 (is that another "prediction") We are going to the Cloud now . . . . Now Lincoln is channeling my poster #108 from Biocuration 2014 (in case you missed it) Creating a level playing field for downstream analysis in the Pan-Cancer Analysis Project, by reanalyzing them in a uniform way. I think that guy in the deck chair is supposed to be Todd Harris (there was a slide about people analyzing the data) "Don't tell my boss about this" - seconding staff to PCAP "Its big and its scary" ====================================== Q & A Q: Possible to get ICGC data programmatically? Find all samples with this mutation in this particular gene? A: Available via rest interface. All connected via REST API. Q: You mentioned there's a few international data centers - in each of these centers include whole dataset? Or each have their own data set they contribute to? A: Recruited 6 centers, access to compute clusters. EBI, Tokyo, Barcelona, University of Chicago, DKFZ, Korea. Mirroring data to each of these in a way to distribute the analytic work that they do. Can encourage working groups to move centers if one is overload - This is PAWG. After done - will clear data out and go back to business. However, Chicago and new datacenter in Toronto will keep copies of the data and open to thecommunity as cloud compute resource. Will bring in the rest of the ICGC data - exome, genomes sequenced beteween now and 2018. Q: How do you deal with data transfer? A: FTP doesn't really cut it. Better to have a fat pipe - multi connection protocol. Using GeneTorrent (based on bittorrent) - multiple connection, transfer bit of file simultaneously. Peer-to-peer, uses bandwith appropriately. Other protocols that we can use - espera, udr Q: PAWG - using certain parameters for the calls - how did you come up with these parameters? Collectively done? A: Benchmarking. Co-sponsoring DREAM challenge - using winners Q: 6 curators, tons of samples, many sites. Challenges? A: Main challenges - the centers have to stop doing what they've been doing (stop calling themselves and submitted, start packaging everything together using different submission software). Education, trial error, testing refinement. Communication. Q: How can we make sure that the NCI is one of your supporting groups? A: Will put NCI logo up... open ended question. The NCI has been a great support. There continue to be policy differences between TCGA and ICGC - logistic problems. Biggest - the current policy of dbGap taht prevents TCGA datasets from being analyzed on commercial clouds. It would give us a great deal of flexibility if that policy were reviewed and reversed. And allowing TCGA data to mingle with data in Europe and Asia. Changing circumstance. ---------------------------------- 10:00 - 10:15 The eGenVar data management system (eGDMS) - cataloguing and sharing sensitive data and meta-data for the life sciences Sabry Razick Norwegian University of Science and Technology, Norway "My data is my data, your data is my data too" - Sydney Brenner Abstract Systematic data management and controlled data sharing aim at increasing reproducibility, reducing re- dundancy in work, and providing a way to efficiently locate complementing or contradicting information. One method of achieving this is collecting data in a central repository or in a location that is part of a federated system and providing interfaces to the data. However, certain data, such as data from biobanks or clinical studies, may, for legal and privacy reasons, often not be stored in public repositories. Instead, we describe a metadata cataloguing system and a software suite for reporting the presence of data from the life sciences domain. The system stores three types of metadata: file information, file provenance with data lineage, and content descriptions. Our software suite includes both graphical and command line interfaces that allow users to report and tag files with these different metadata types. Importantly, the files remain in their original locations with their existing access control mechanisms in place, while our system provides descriptions of their contents and relationships. Our system and software suite thereby provide a common framework for cataloguing and sharing both public and private data. Notes Mainly talking about - the data that you can't have eGenVar - sensitive data special use case from bio-bank data ppl wont' integrate and share - not allowed to look at the data, but must come up with a system to somehwo share why share? data sharing is very important published data vs unpublished data - paper not finished yet, or secretive reasons why coudln't share: privacy secrecy publication not ready - you don't want to put the data out there ethical issues Sharing strategies email ssh advanced fis - iRODS, TwinNEET, extract data, describe data parse files, extract pipeline analysis - invite ppl to use your data Sensitive data strict standards, difficult to follow-up. Try to collect data, ppl will do that as necessity, not freely eGenVar - data managemetn used to report extended set of metadata about existing sensitive data metadata provenance content description - tagged. Described with strict standards. Tags - easy, less effort standardized, controlled vocab Q: Can a user apply for permission to see raw data? A: If you want it, apply for permission ---------------------------------- 10:15 - 10:30 A controlled vocabulary for entities and events in the Reactome database Steven Jupe EMBL-EBI, UK Abstract Reactome is a database of human pathway and processes accessible via a website, available as down- loads in standard reusable formats and via a RESTFUL API. Entities involved in pathways require de- scriptive and unambiguous names that are not available elsewhere. We have devised and partially incor- porated a controlled vocabulary (CV) that creates unambiguous and unique names for reactions (pathway events) and all the molecular entities they involve. This will significantly improve naming consistency and readability, with consequent benefits for searching, data mining and data exchange. Notes Controllved vocab - pathway adn events when it comes to naming things- ppl get very uptight CS background - why do names matter? Why do you just use the db identifier? Users are biologists- we want them to access our data we shouldn't expect them to learn some special magic to find the information in our resources should be able to find nicely wihtout learning testing ppl who were expect in looking up information in a particular system surprised how quickly they gave up when they coudln't find what they were looking for! users will give up quite quickly users will search for things with names they're familiar with How do you name things? naming events - reactions that constitute the steps in pathways we can just use the names from the literature? no: rather ambiguous don't necessarily accurately describe the things they're being applied to typically more than one need to make a selection upset someone may be acceptible in context, but in broad sense of db, inappropriate Often trying to name things - bc they're intermediates in pathways - don't have a name in literature Solution: controlled vocab - used for naming all events and entitites in Reactome - could be applied to any other system improve consistency and interprability of names, searches, finding things, data mining, reduce burden on curators -> takes time to select a good name Peptide Name no universal authoritative source of names for peptides (no name for all the peptide fragments you can derive from the inital translated product) no agreed vocabulary for peptide products no universal authoritative source for all co- and post- translational modifications Complex, set and event names even more tricky- groups or sets of proteins names ought to reflect components event names- identify event particpants and category of event want to use simple text searches to find all events - molecule of interest Peptide CV Names Gene symbol core - reactome is very human centric. HGNC approved gene symbols Peptide coordinates suffix uniprot - chain feature if peptide reprent is not identical to peptide in chain feature uncertain - '?' Post-translational modification PTM prefix PSI-MOD - contains majority of post-translatinal modifications, short abreviations uniquely labels peptide most of the time Small molecule (chemical) CV sources from ChEBI, KEGG compounds, PubChem, literature http://www.allacronyms.com/ Complex and set CV names if you have a complex with more than one member, concat names of members similar for sets complex: separate with ':' set: separate with ':' occurs more than once: name preceded by nx Optional extension: not generally applicable, useful in Reactome candidates set - has additional possible members in round brakets precedence/hierarchy - square brakets Pathway event (reaction) CV names defined small set of terms - classify, define what's happening in a pathway event e.g. transformation, binding, jtransfter, active transport, and more anything with a defined catalyst with a definable molecular function GO molecular function - verbs derived form GO terms Q & A ---------------------------------- 11:00 - 11:15 Pathway Commons: A public library of biological pathways Emek Demir MSKCC, USA Abstract Pathway Commons (http://www.pathwaycommons.org) is a network biology resource and acts as a con- venient point of access to biological pathway information collected from public pathway databases, which you can search, visualize and download. All data is freely available, under the license terms of each contributing database. Biologists can search, visualize and download Pathway Commons pathways as part of an integrated network analysis workflow. Computational biologists and software developers can download all pathways in BioPAX, SIF and other formats for pathway and network analysis. They can also build software on top of Pathway Commons using our web service API. Database providers can share their pathway data via a common repository. Pathways include biochemical reactions, complex assembly, transport and catalysis events and physical interactions involving proteins, DNA, RNA, small molecules and complexes. Pathway Commons aims to collect and integrate all public pathway data available in stan- dard formats (BioPAX, PSI-MI). A completely redesigned database and query system was developed in late 2013 to support a powerful web service API. Recent work focuses on developing a Pathway Com- mons query app for Cytoscape and new web based visualization tools, as well as expanding the data in Pathway Commons. Notes Data-driven discovery cycle: read a lot of papers, form hypothesis, etc combine with high throughput profiles - don't play well with thousands of datapoints TCGA paper: pathway covers 60% of patients. Go to pathways commons - missed it. missing oppourtunity - not using systematic computational work computational cell map: still use traditional site culture change in computational biology historically - pathway dbs from different rules, use cases, schemas hard to combine and integrate Want to find a common language - standardize - interoperability (Pathway Commons) use easily to answer biological questions a lot of barriers - hard to solve this problem BioPAX - standard language - process language. Simlar to what Reactome/biocyc/etc does Pathway Commons: workflow - once we get exports in biopax, can merge and align pathways curation differences - curators pay more attention to different parts reading different papers, reading different parts. Different preferences reluctant to merge automatically want to find solution for this Paxtools started as simple reading/writing library now expanded import, export, huge library, can just use parts if you want Pathway Commons access multiple ways removing barriers webservice - graph serch, fields search, path traversal SPARQL endpoint batch downloads formats: BioPAX, SIF, GSEA, SBGN (exports topology and automated layout) Clients- cytoscape, chibe2, virtual cell, pc java client, paxtoolsR Use cases much can be done using pathway information Q & A ---------------------------------- 11:15 - 11:30 The UniRule system for data integration and sharing Claire O'Donovan EMBL-EBI, United Kingdom Abstract The UniProt Knowledgebase is a central hub for the collection of functional information on proteins. It consists of 2 sections: UniProtKB/Swiss-Prot which is manually annotated and reviewed and UniPro- tKB/TrEMBL which is automatically annotated and not reviewed. Over the last few years, we have devel- oped the UniRule system that leverages experimental UniProtKB/Swiss-Prot curation for the automatic annotation of UniProtKB/TrEMBL in order to address the exponential growth in uncharacterized proteins from the genome sequencing centers. UniRule consists of manually created annotation rules that specify functional annotations and the conditions which must be satisfied for them to apply (such as taxonomic scope, family membership as defined by the 11 InterPro Consortium members (GENE3D, HAMAP, PAN- THER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, TIGRFAMs and the presence of specific sequence features and as such is an illustrative example of data integration and sharing. The UniRules are applied at each UniProt release, and their predictions are continuously evaluated against the content of matching UniProtKB/Swiss-Prot entries, guaranteeing that the predictions remain in synch with the ex- pert curated knowledge of UniProtKB/Swiss-Prot. The collaboration with the InterPro Consortium is an illustrative example of data integration and sharing and UniProt now wants to extend this collaboration to the functional annotation community as a whole, exchanging rules and setting standards for consistent annotation. Notes Why automatic annotation necessary? Lots of new sequences lots of new sequences- no experimental work will be done on it mission to attach information to it automatic annotation enables us to provide accurate and standardized data major issue: lots of data coming in is somewhat mixed in quality - not just correcting bu updating. While you might get funding at creation - no funding to keep it up to date annotation could have been corrrect at some point, but need to maintain helps us in our own db annotation we were able to find new knowledge highlights inconsistent annotation across family Automatic Annotation UniRule and SAAS Steps in Rule curation start with a protein family - look in papers, dbs, resrouces, distil that to manual curation common annotation - across a certain family/orthology range some annotation not appropriate, but as you're annotating you're thinking about that rules become useful when biocurators and developers get together - supportive programming environment (rule maintenance) data refreshed every month tool will tell you when the source data for your rule has been updated for any reason. Automatic Annotation in UniProtKB rule creation - manual anatomy of a rule: conditions & annotations UniRule tool: webapp repository for rules in prod and dev navigation - filtering by annotation/condition, searching (free text, field) batch editing - across system history of changes for each rule - QA you can see what the curator was thinking about - don't have to second guess developers: drop down menus, enforce standards, enabling curators to do what they want to do very easy to add/change new feature - sequence features using predictors signatures from interpro and elsewhere are good for diagnostic, but not as specific as biocurators would like, they're not specific align component to template sequence- check if feature is actually there new version beta available in september Data sharing! webapp - use from anywhere upload/download in xml changes documented adn reversible experts can contribute knowledge plan: make available to scientific community tested with visiting expert from Spain want to include more researchers want to work with other in automated function prediction want to organize automated anntation workshop Q & A Q: Igor form Bader lab: Don't often see somone so excited abt software tool. Question about Rule: possible to use external resources? A: Conditions and annotation - software engineers set it up so that new ones can be easily integrated ---------------------------------- 11:30 - 11:45 Metadata audit between European Genome-phenome Archive and International Cancer Genome Consortium Hardeep K. Nahal OICR, Canada Abstract The International Cancer Genome Consortium (ICGC) currently comprises of data from over 10,000 can- cer genomes from over 40 different tumour types. Processed mutational data is submitted and stored at the ICGC, while the accompanying raw sequencing data (FASTQ and BAM) is required by ICGC mem- ber projects to be submitted to the European Bioinformatics Institute’s (EBI) European Genome-phenome Archive (EGA). Currently, EGA has 91 controlled-access ICGC datasets. Mapping of data in ICGC with read data file submissions in EGA is facilitated by requiring ICGC member projects to submit sample/donor metadata information captured in XML files. A major challenge in distributed projects is to keep in sync submissions and data which are split between multiple resources. In response to user feedback which highlighted difficulties in linking data across the EGA and ICGC datasets, a recent preliminary file audit was performed for the purpose of identifying current problems in order to target corrective measures. This exercise identified differences in the formats used for sample and donor identifiers in the metadata information submitted to EGA and clinical data submitted to the ICGC. These identifier format differences pose problems for researchers and web portal developers trying to obtain or point to raw sequencing data for mutations they may find of interest in the ICGC. Here, we present the current status of the ICGC/EGA audit and details on our procedures to track and coordinate our efforts to correctly curate and map a consistent set of sample and donor metadata identifiers between EGA and ICGC. Future goals will in- volve implementing a validation step or enforcing quality control measures to ensure metadata information submitted to EGA maps to information submitted to ICGC. This audit will be critical for the Pan-Cancer Analysis project, which will require obtaining raw sequencing data from EGA based on information about sample/donors in ICGC. Notes Bioinformatician at OICR - Francis group Metadata audit between EGA & ICGC ICGC - 71 cancer project , 18 countries 42 cancer types 10K donors submission cycle - 3-4 releases each year challenges: we have to make sure that submissions and data between different resources are in sync collect mutation data: simple somatic, cnv, structural comatic, gene expression, splicing, miRNA, and more open and controlled data open: publicaly acessible - no indivdual identification controlled: identifiable. Apply for permission from DACO - approve reseracher can get raw sequence (BAM, fastq) ICGC DCC Pipeline connection icgc, reseracher, ega ICGC- collect tumor/normal sample from donor, sequence send to DCC ICGC validated, DCC data portal raw data -> EGA also metatdata and xml files researchers - interact with DCC data portal further analysis, apply to DACO, get BAM files from EGA EGA - 110 controlled access datasets ICGC Data portal: search for mutation, go to donors in project interest in specific donor - specimin level - obtain clinical info. Raw data apply to DACO, get approval, go to EGA but then can't find BAM files associated to patient??? metadata not always complete ICGC & EGA mapping between submissions, study, sample/donor - submitters hierarchy to get to sequence file - lots of potential discrepancies in response to user feedback - difficulty linking EGA and ICGC datasets identify current problem corrective measures tool xml files from EGA and x-ref with ICGC compiled some project info, sample/donor identifiers, accessions audit reports - sent to individual ICGC projects what's missing? work individually issues: differences in formats used for clinical identifiers sample donor identifiers were not necesarily the same which sample tumor or normal?? will be aligning tumor and normal (pancan) EGA dataset can match a number of different studies older projects: EGA no longer existed, mapped to different study metadata issues - donor identiffier submitted to EGA and ICGC were different also, no info on normal or tumor sent to project and asked to match/fill out sometimes information captured in free text - hard to parse programmatically use controlled vocabulary Status: still working with projects to clean up the data study level matching between ICGC and EGA where projets subit data to ICGC - required to fill in accession identifier - cross field validation check. improve validation criteria need to improve documentation and guidelines dont' realize that identifier info is what cross links the information together also working with EGA - fix future data validate metadata in consistent Q & A Q: How does it come to be that it's not so easy for submitters to submit the information? Web interface? A: That's an option. Right now, they put an xml file together. ---------------------------------- 11:45 - 12:00 The ENCODE metadata standard to integrate diverse experimental data sets Eurie L. Hong Stanford University, USA Abstract The Encyclopedia of DNA Elements (ENCODE) project is a collaborative effort to create a comprehen- sive catalog of functional elements in the human and mouse genomes. Now in its 9th year, ENCODE has grown to include more than 40 experimental techniques to survey DNA-binding proteins, RNA-binding proteins, the transcriptional landscape, and chromatin structure in 400+ cell lines and tissues. All exper- imental data and computational analyses of these data are submitted to the Data Coordination Center (DCC) for validation, tracking, storage, and distribution to community resources and the scientific com- munity. Metadata describing important experimental conditions, such the biological samples, specific reagents, and protocols necessary to replicate the assays and their analysis, as well data standards, such as quality metrics for data files and antibody characterization documents, have been expanded and are being submitted to a newly-formed DCC. As the volume of data increases, the identification and or- ganization of data sets becomes challenging. Here, we describe the design principles of how metadata are organized and annotated at the ENCODE DCC in order to facilitate the identification and comparison of data sets generated by the ENCODE project. The metadata are stored in a structured data model and annotated using ontologies to ensure high quality metadata that is interoperable between multiple projects. In addition, the breadth of metadata that describe the biological samples, reagents, methods, and protocols support reproducibility of assays and promote easy identification of shared resources in the project. The organization of the metadata will allow flexible and powerful searches on the revamped ENCODE Portal, the public website of the ENCODE project, as well as support intuitive displays of bi- ological samples, reagents, and protocols used for an experiment. Data from the ENCODE project can be accessed via the ENCODE portal (http://www.encodeproject.org) a nd the UCSC Genome Browser (http://genome.ucsc.edu/cgi-bin/hgGateway). Notes 3 ways working to integrate diverse experimental sets 1) within ENCODE - datamodel 2) Ontologies to annotate metadata 3) ENCODE technical implementation of website- access metadata ENCODE DCC Mike Cherry, Kent Data WRanglers and soft engineers qa, sysadmin, admin What do we do Data produces Analysis group - computational methods/predictions DCC - store all these files, track progress, make portal so the rest of the community can consume Challenge: Can you define a metadata standard for diverse assays in multiple species? mouse, human, fly, worm understand what labs doing, generating data, wnat to retrieve data, what questions want to ask set of principles driving metadata to collect: transparency - how assays being done. provide protocols reproducibility - capture what files being used to generate additional files Capture experimental design experimetn: all assays - 2 biological replicates control - replicate data files generate results files capture this all flexibly identify reusable experimental variables biosample - reusable reagents run on the same samples - broad picture of genomic state of sample antibodies libraries files help labs uniquely identify the specific metadata, sources, ids define relationship to each other labs have been using this for almost a year to submit their metadata battle tested- working pretty well Annotating terms - using Ontologies common ontologies among different data projects can really improve the integration of data ENCODE & Roadmap Epigenomics - both internally consistent but only 3 biosamples match exactly between projects need to read every single term and description to figure out what samples you need to look at! Remap to ontology terms uberon for issues biosamples - uberon- ontologies talking to other nih projects - key points of interaction: assays that were done, samples used, ontologies can integrate the data faceted browsing can use the ontologies to drive searches Technical implementation of website integrate with other resources metadata in JSON-LD - viewed as web page labs can query and submit their metadata using REST API integration with other resources - grab json object and parse for identifier Conclusion: datamodel biocurators - set data standards - how to best define data across consortiums standards should be setting data access - how to make available for all the other communities programmatically access Q & A Q: LIMS systems - did you try to write a tool/plugin for the LIMS to make them generate the metadata you want? A: we've been talking to production labs using LIMS systems to use ontologies. How to control their LIMS system. Production labs - providing their own scripts to access our API. Short scripts. Just do a mapping - doesn't matter what their LIMS system. Every single lab has made scripts to submit data.