ISB2014 Workshop4 Big Data Curation
The Seventh International Biocuration Conference Workshop 4 - Big Data Curation, Chaired by Owen White & Francis Ouellette The Great Hall, Hart House April 8, 2014 http://biocuration2014.events.oicr.on.ca/agenda-5 link to all ISB2014 notes: http://etherpad.wikimedia.org/p/isb2014 Help me take collaborative notes on this session! Fill out your name in the box in the top right-hand corner. Add yourself to the editors list, then edit away! I make a lot of typos. Sorry. Editors name / affiliation / twitter Abigail Cabunoc / OICR / @abbycabs Workshop 4 - Big Data Curation, Chaired by Owen White & Francis Ouellette The Great Hall, Hart House Panel: Owen White (OW) - chair @bffo Francis Ouellette (BFO) - chair What is Big Data? BFO: Asked panelists - 140 character definition of big data Chris Hunter (CH) - GigaScience/BGI-Hong Kong @gigascince David Landsman (DL) - NCBI @DavidLandsman New name for an old thing - deal with multiple layers and multiple types of information Mike Livstone (ML) - Princeton More data you're working with it- more important metadata What makes big data big - pushes the boundaries of what you're able to handle. How much data is big? Recuration of metatdata of datasets. Collect more and more datasets, become more and more diverse. Raja Maumder (RM) - @rmazumde big data today might not be big tomorrow. All of the tools for biocurators may not apply for big data analysis. Effort has gone into annotation. Data shoudln't move, computation should move. Victoria Newman (VN) - @SciData_VN Editorial biocurator of scientific data - new journal from nature. Marc Perry (MP) - OICR - @mdperry When tech lets us abandon hypothesis-driven questions in favour of data-driven research Curator @ modencode last five years. Time & energy. Sometimes $$. How long to copy files, launch vm, computation. BFO: Curation of big data - what big data looks like. Do we have to curate big data? Or just let machines process it? Do we have to have biologists look at it? DL: I think that what we do is collaborate with experimentalists who make datasets for us, then we analyze the datasets, then we talk to them about hte results. Largely chip-seq data, rnaseq data. Big experiement: merged into one analysis. There are so many pitfalls in preparingthe data. It've difficult to know what the outcome's going to be till you see it on the chromosome or averaged out in some way. End up putting garbage into place - antibody incorrectly applied, amplification at wrong place. Once we have a view of what we're looking at- that information is very useful. But how accurate its it? How easy to annotate it? Hard to say. How would one expect a biocurator to apply that. Challenged by idea of curation of big data - chipseq data. RnaSeq data - can maybe do differently. OW: Example in mind: generating lots of ESTs. Faster to maange the data by searching the protein archives against the ESTs. Inflection point. Another point: instead of moving the data to computational resources, but bring the analysis to the data. have we hit oen of those inflection point? RM: There has to be a paradigm shift in biocuration because big data is big. Has lot of information. Means ppl are concerned with security. Platform we are thinking of - analyze big data, not wait 10 days, share analysis results where everything that was done is recorded - evidence tracking. Transparent to ppl who have access to the system. Store it in a way that is smaller than what we started off with. Platform should have ability so that other ppl can write application on top of this platform OW: 2 inflection poitns you mentioned. 1) reduce data. Not just raw reports. 2) There is something useful about apps. BFO: 3) Security. Everyone elses genome Q: Inflection point: humble blast search - where big data has become unhelpful. top 500 hits same gene. Swimming in uncharacterized stuff - focus on curating that set. Suzi Lewis: Waht do you want to use it for? Locating things - not interested in that. Waht are the relatinoships, how is it tied together. Just located - one thing. Seeing relationships between data - another thing. Variant - Apollo. Warren: From a biological standpoint : thinking of networks of interactions. Not diving deep in one particular gene. Seeing ppl now thoughout biomed community thinking about things from a fundamentally different perspective. A lot of things. Fundamental shift in teh way we start thinkign about biology adn clinical care. Q: Bad as well as Big. Also curate in our field. Challenge - takes well curated data set from chipseq. Very excited, then very depressed. Even though all of us have access to datasets - how many replicates, how reproducible, can i put my name behind. Big data - few number of datasets. Significant challenges - moving into new technology space, degree erroneous reporting of findings is going to increase -> noise is very high. Poor quality - ppl are unused to working with big data. Credibility challenge. Work closely with publication environment. Iddo: Analogy - microbiology - rare biosphere - really long tail of distribution of species. Very few species out there we can see. Others barely exist - still there, still exchanging genes, waiting for next environment shift. Big data like rare biosphere - most data we won't look at. 10^40 phages on earth - will not look at all of them. Ppl dealing with data - duty to maintain and curate in teh old sense - scientists we know that we may need it at some future date. Big data is taking care of what is interesting and beneficial and good for us right now. Idea that we may need the long tail of data. BFO: Function of journals and publishers in big data space. Responsibilities, functions, wishes? Disk storage is expensive. Jave journals and authors thought about how to work in that space? CH - Gigasciene - relatively new journal, new way of thinking about it. Make sure all reproducible. Galaxy installation - reproduce ppls pipelines and analysis. Anyone can reproduce using same infrastructure. Comes at cost - need an informatician in place for ppl to use Galaxy. funded - so its free now. continue as long as we can. Will have to address cost at some point. BFO: PB storage? Entry point for this space? RM: It is so expensive to store data. Compute on it - even more expensive. Completely replicate a study - it will be really hard for a journal to store all of hte data indefinitely. Don't have answer, but skeptical for a journal. Maybe some other institute should step up. ML: turn qustion around? Journal storing data - is a reponsibility and also a privelege. Control the access to it. Yes, there are costs. Perhaps thinkign about passing costs along to ppl who want to use the data. A lot of the data was paid for by public funds. Does the journal own the data? Funding agencies? Can public access? Shoudl it go to a repo controlled by funding agency? Suzi Lewis: Question for journals - there is always going to be cost involved. We've already figured out how to pay for things (ie editors). Bunch of ppl in this room that are effectively data editors. What would the world be like if we had not just copy editors but data editor CH: Good point - I would like to see it - gigascience covers all of science. Expert reviews coming in to make sure data is in tact and curate it. Looking at that. VN: Scientific data is working wiht 6 different repos catering to science. Hope to outsource the storage - have in house curation and data review (editorial board, academic scientist in various domains). Big data is not a new thing - romans wiht census. Now we havethe tools and capacity to generate granular data at significant volume. Find relationships among datasets unliek before. Garbage in, garbage out. Generate as much data as you like, but without expert curation - need quality control on teh data. Important to have peer reviews. We have had some editorial board members re-doing statistical analysis. High level of engagement wiht work that's coming out. Cut down on retractions - waste of public funding. I think that curation forms an important part of this whole pipeline. DL: Inlection point - good thought process. A lot of the data we derive is publically funded. Apply for grants. Open source library of APIs can interact with. Data would then have to be open to everyone else. It's in that environment //Sorry zoned out BFO: as a government employee - what kind of timeframe? DL: More than one effort. There' a lot to be ironed out. Cost effective? Will it promote the right activities? OW: Instead of archive - some agency pay-as-you-go? Apply resources to NCBI to host that information? Another inflection point: Concept of the archive is eroding. Before: Generate sequence - pristine object. Sequencing is an assay - why would you hold on to it? DL: One pile you can play with. Another pile with manual annotation. MP: At Modencode - announced that SRA is going to stop accepting sequences. Why do we need to archive this stuff? As it gets cheaper, I can run a chipseq and throw it away. Hinting at this down teh road for certain things. One-off - do you really need to keep it? BFO: Another inflection point - unlimited amounts of DNA - cheaper to store the DNA than the sequence DL: Sometimes a sample might be so rare and invaluable. ML: Storing in cloud: Google doc. Q. value of sequence - cancer genome. Irreproducible. Many things we do are highly reproducible. No value behind tissue - what is the value of a specific data item, dataset, collection OW: Seen at NIH - sequence itself not relevant at all. Doesn't compare in relevant to the metadata associated with it. Role of this society cna be stronger. Arguing more strongly for value of curation. Precious - information that Q: Provenance: how can we do a beter job of describing the datasets? How can we predict what is important from that? What will be useful to keep in teh future? OW: world where thousands of microbiome samples. Label 'stool' on thousands of them. Other types of quality issues. Wonderful things from Google - taken thousands of links and takes you to more relevant sites than not. SRA doesn't really do that. There's lots of differnet levels of quality - ranking index. Publication, methods, metadata, other publications cite - biochem pathways. Nice to have information ranked for you. RM: How do you train biocurators to handle a differnet datatype than what other biocurators are used to. OW: Small number of donations to do this, develop committees to look at minimal set of datatypes to track. Narrow down to somethinglike 21 variables. Best way to describe these variables. Build software - make it much easier for the users - naive users to mark up their variables and submit to dbgap. High degree of scholarship required to do that. Diversity of data - specialist how to retrain. Be prepared to be generalists. BFO: degrees, workshops, papers, online tutorials - many venues available. Put it in a blog - curators will find it. Q: Inflection point question: webapollo - our data is connecting many ppls small data. Generate data - big data. ML: make sure researchers workign form the same playbook, comment set of information. Inlection point: having curators side by side wiht actual researchers doing the bench workfieldwork, working together on that. Poster 88. Q: We need to train ppl use some of hte tools - need to make distinction between professional curators and amateur annotators. ITs' all research - one big things. We're working a bit onteh side since we're developing specific tools. BFO: Differentiation between professional and amateur curaotrs? OW: I find the possibility of using systems like Mechanical Turk - exhilerating. Few years ago tried to take a step further- two levels of contributors. Received mroe pay to review output of new guys. From the presentation, you cna paste together - how much $$ (didn't care, just participating) VN: Working in publishing - not so much amateur curation. Lots comes out with zero curation. Release data with papers - some stuffed somewhere- very little qulity control. Someone comes along tries to use the data - can't always get at it. Amateur curation one step up from zero curation. Q: Amazon Mechanical Turk - gene mutation capture - will talk at workshop. We found that throughput is terrific, paying 7 cents an item. Good way of recruiting good turkers. If you frame the task appropriately, you may not need to the deep expertise that curators have. You cna get pretty good accuracy with smart aggregation. Expert curator needs to review it. Suzie Lewis : Every little bit helps. Every single point - getting ppl involved. How cna it be more embedded and integrated so there isn't zero curation. Every little bit we can add is all to the good. Figure out lots of solutions. Q: Dorothy Riley: Doing teh same kind of thing - require bench scientists to annotate some minimal things. Before samples submitted to inhouse lab - captive audience. Where publicaly could you put that requirement in? ML: resistance. Having actual curators working next to ppl dorothy just described. It would help to have curators there at the beginning. Help take responsibility off the bench scientists. Q: Amateur curator: In my experience - complete specialists care about just their area. Provide standards with ways to do it. Standards with ontologies. Software tools (web apollo and other). Use this to generate the data. They're going to do this a lot. Q: Still ppl who do a project, go annotations available for their species, their project, adn it's not good. They have useful information that is out there, it doesn't come back into the databases. Try to integrate them into the databases. Suzi Lewis: biodiversity folks - process of writing up and submitting for publication. They have a lot of data - full of specimins. good to look beyond our current community - other forms of curation. Museum collections. CH: More about analysis in present. Broad spectrum of things being discussed. Gigascience is completely open - no pay wall. RM: In terms of big data, right now one of big chalenges is NGS data - create secondary and teriarty dbs on top of datasets. Metadata curation. CAn be traced back to uniprot, panther, etc. Technical capabilities to analyze data not there for most groups. Expensive and challenging. BFO: Even that front - ICGC is doing that. Processing/interpretating large scale human genome data (part controlled, part open acess) There are several efforts working on this. Inflection point: hasnt' happened to everyone yet. Know there's a prolem, but how to thinking differently about it? Software to data is an inflection point. DL: Expand definition. Big data not only at ncbi - all over the place. Point want to make abt big data: although we require to deposit our experimental data in open data repositories. Suspect number of ppl who actually do that is very low -- <50%. Why don't they do it? It's too hard! OW: when i was invited here to co-chair. First reaction: big data is sort of a buzz work - all this is nonsense. WE're not presentated with anything new. As someone who has the perpective over the long haul - we have identified some inflection points that are real. But there are some things that are very different. unancitipated, crept up on us, some obvious , some not. Went to one of the first big bacterial meetings - stunned silence - b/c everyone in the room was dealing with the fact that we are in a different era - wonder if in our own way going through a period where wrestling with how to fit in with this. Huge opportunities.