WBConfCall 2011.11.03-Agenda and Minutes

WormBase Site-Wide Conference Call Meeting Minutes

November 3rd, 2011

modENCODE Update

- 1- Gene regulation networks paper
- 2- Transcriptome paper
- 3- Chromatin paper

Data providers trying to get data in
Deadline was midnight on Monday (Halloween)
409 submissions working on releasing (working on over next month); will then go into modMine
New tracks will be on WormBase
Lead project, potentially 38 datasets
Snyder, Gerstein - chromatin/transcription factors
Waterston couple hundred (~233) datasets being vetted:
- RNASeq datasets
- briggsae, remanei, japonica, brenneri alignments
- Transcript profiling of embryos before, now 4-cell stage and on, every 30 minutes
  - Spot checking number of cells at peak of time points?
  - Aligned with other data in WormBase, timing needs to be calibrated
  - Hopefully embryos accurately staged
  - Example, "2 hours" vs "80-cell stage"
- daf-2, him-8, fem-2, etc. mutants
- pathogen-infected strains
March 31, 2012 last data will come in; 10 labs (data providers) no more funding
Waterston et al will be providing data up to the very last day
DCC is being funded longer (March 31, 2013?)
DCC working on archiving the data and making data usable, accessible before closing shop
DCC has to clean up after big papers; supplemental analyses that were not provided adequately
Data analyses' protocols will be documented on DCC Wiki

Todd's Points

Couple possible options:
- 1) Discard with Tiers strategy OR
- 2) Keep it internal

Can ACEDB handle the Tiers system?
Maintaining identifiers between builds is a fairly major commitment
Manual curation associated with gene identifiers
What happens if we don't maintain unique identifiers? Versions, more digits on ID?

Assemlies we have, each gene has an ID, just not WBGene IDs
Treat a genome as just a genome
Don't need to worry about conditional checks for each species
- OK for build and website?
- Website, yes
- Build - we don't have all the data for Tier III species like we do Tier II species, etc.
- Todd - Absence of data isn't a problem
Curatorially keep separate, but from user perspective keep consistent
Need to consider before majorly scaling up number of genomes
Build may require more thought
Different data types will pop up for other species; eg. RNASeq data for other Caenorhabditis
Erich Schwarz planning to do a reassembly of C. angaria
Brugia will get reassembled
Tiering, to some degree, is just internal; have tried to keep Tier system away from users
Non-core species, objects will not stick around in current state
Generally, genomes will not be reanalyzed substantially;

Two threads of conversation
- 1) What does an identifier mean? Entering into contract with users when providing identifiers
- 2) Some genomes get identifiers, others don't; this should be addressed

Virtually no work on the build side to add many genomes with GFF files, etc.
How do we develop concise descriptions, build pathways without IDs?
What do we plan to manually curate?
May want to see C. elegans concise descriptions on other genomes gene pages (orthologs)?

We could: get from third parties, datasets already have identifiers
Instead of applying WBGene IDs, use existing IDs from data providers
Problem for new assemblies; actively reworking assemblies
Part of data submission standards; if you re-annotate genome, mandate mapping IDs over?
Stable genomes may not be an issue
How onerous is remapping the genome?
Add fine print to contract with community, requires user remapping
Let gene ID perish? Keep sequence for gene model, but don't re-use it?

Navigation menu