Dear All, Here are minutes from the conference call yesterday. Thank you to Matt for checking over the notes I made about the incorporation of (many) other genomes. Please send corrections to me. Thanks, Mary Ann Minutes from the conference call Nov 17th 2011 * 1) Infrastructure changes. * * WormBook mirror: will move from wormbase.sanger.ac.uk to * wormbook.ensemblgenomes.org * => the link on the main WormBook site needs to be changed to the new URL * * WormBase mirror: will go offline next week Friday * => the link should be removed from the respective WormBase master and * mirror sites * * I announced that already a few weeks/month back, but as the deadline * is approaching, I think it is worth bringing that topic up again. Michael P: These changes happen on 25th Nov. Discussion about Sanger FTP site. Kev: We are still using Sanger FTP. Code will not be ported to EBI until 2012. Todd: There should be a single FTP site - ftp.wormbase.org This mirrors the Sanger site, but there is a delay in data being availabe on Sanger FTP and ftp.wormbase Raymond: Would be good if there were two dynamic links which were updated as data became available (for current and dev release). Q. Is there a plan to have this on ftp.wormbase A. Yes. There will be automatically updated symbolic links pointing to appropriate releases (Todd) Kev: Q. Is the link to the dev release created when it is staged to staging.wormbase.org? A. Yes (Todd) There is a delay of 1 month between release being made available at Sanger and ftp.wormbase. Todd: It does not take that long but it allows for any problems to be resolved. Todd: Release is not staged on ftp.wormbase but in(by?) a staging machine: 1) data mirrored from Sanger 2) convolutions 3) push to ftp site and create sym. link. Raymond: it's useful to have new release (dev) ASAP. Would like to continue to get latest release by cron. job to Sanger. Resolution: Todd will push development release to a Caltech server after it has been synched from Hinxton to OICR * Models * * 1) From Michael Paulini * I would like to clean up some tags in the Homology_group and Accession_number * * In Homology_group i would like to change * DB_info Database ?Database ?Database_field ?Accession_number XREF Homology_group * to * DB_info Database ?Database ?Database_field ?Accession_number * * and in Accession_number remove following then unused tags: * Web_location * 3d_data * Homology_group This model need further consideration. Other classes should also be reviewed. * 2) Extending WormBase beyond nematodes. * * Another thing for the agenda would be, if there has been a decision on * extending outside of nematodes (based on the flatworm thing last * week). Useful information: SchistoDB http://schistodb.net/schistodb20/ (is this the one?) Uses GUS. Schistosoma japonicum available at GeneDB http://www.genedb.org/Homepage/Sjaponicum also being loaded into a GUS database (a collaboration involving J.Kissinger and a group from Shaghai). Are we able to accommodate flatworms in WormBase? SchistoDB (Brazil) - has the infrastructure to handle incoming data. - The PI (Guilherme Oliveira) is writing a Brazilian application to establish flatwormDB using existing GUS infrastructure Matt: - 50 genomes project - including flatworms (10-20%) - picking high profile genomes across the phylum - generating preliminary draft genomes for each one. This way, many groups in the community get a little information, rather than a few groups getting a lot of information. - a new MRC funded position is available for informatician working on gene finding. Assemblies will be done within Berriman group - from an overall list of ~100 genomes, the ~50 most tractable will be prioritised. - from list of 100 genomes, effort will be put into first 50. - at the time of writing, 30 are in progress, 3 are assembled. - Example: Onchocerca volvulus (roundworm causing river blindness) has 1/2 of genome ~250 pieces. - data will be available by the end of Dec 2011. The pipeline: @ get sequences (fasta) @ submit to INSDC - can we skip this step? (Matt) @ analysis - homolgies, repeats @ FTP @ to Todd @ make GBrowse and BlastDB Exploring doing more than GBrowse and Blasts: Gary W created a 100 genomes acedb. It worked, but was slow. It certainly simplies certain processes if all genomes are in acedb, but with clunky availability, it it necessary? Todd: We could use ace for data integration and then create a separate database for driving the website. It's hard work getting data in and out of acedb. Would be easier with a relational database e.g. Oracle, though this is not an option we would seriously consider at the moment. Take-home message: acedb doesn't scale well for speed, but it can cope. Q. How deep do we want integration of all these new genomes to be? Matt: - initially, the ability to browse and blast will be fine. - the species have been chosen because groups are working on them i.e. there is community interest. - we therefore do need scope to go deeper in the future and as groups do further work. - the initial findings will lead to publications which will the create the need for further integration, and this should be addressed in the grant renewal. Currently Blasts are not available for all species (in WormBase). This is because they are hard coded. The new website will look at what's available and then create Blasts accordingly - Xiaoqi is working on this. Ascaris - if there are no orthlogs, certainly need Blast. Trichinella - available on gbrowse, but no Blasts S.ratti - there is a problem - should be resolved with new site. Michael P: Heterorhabditis bacteriophora will be WS229. Trichinella should have been in WS226. Todd: it's 1/2 way in place. Michael P: Orthologies to elegans genes are in GFF file, along with approved-style gene names. Tier I and II genes have orthologies to tier III proteins. This info. is in the acedb database (not flatfiles) and appears under Other_orthologs in the Gene summary page. Q. Website displays - what do we need? Matt: - browsable way to hilight clade specfic genes - host information e.g. Mark Blaxter has decorative icons in his Nat. Gen. nematoda phylogeny paper or e.g. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2432500/figure/pntd-0000258-g001/ - all 50 genomes are parastites, but not all human. - In orthoMCL you can click and select what you want to see when defining a group e.g. IN clade I & II BUT NOT V. (see http://orthomcl.org/cgi-bin/OrthoMclWeb.cgi?rm=groupQueryForm&type=ppform ) - Interactive phylogeny browser - would be good. Raymond: acedb has benefits. Could it be used to store all data other than sequence data. Kevin: The website would need a way to link from non-seq to seq data 50 genomes - RNA seq data. - no - will sequence RNA if provided but but in most cases is not practical given the time-frame. - although there will be a mixture of analysis method, mostly it will be highly automated gene finding. - if someone is doing RNAi, will be relatively trivial to do additional RNA sequencing, resulting in better gene models. Raymond: Is it possible to include genomic and RNA seq when assembling? Matt: It could be done, but currently the pipeline does not allow it. It will be done later though.