From WormBaseWiki
Jump to navigationJump to search

CCC for dicty Timeline - 2011

  1. November 2012 - Weekly Pipeline Set-up
  2. June 2012 - Weekly Pipeline Set-up
  3. Implement Textpresso 2.5 for dictyBase - Arun DONE
  4. Automating dicty literature upload - Arun, Petra, Sidd - in progress
  5. Clone CCC curation form for dictyBase - Juancarlos, Kimberly
  6. Test searches - from web site or by script? - Arun, Kimberly, Juancarlos
  7. dicty curation - Petra, Kimberly
  8. Output files - Petra, Kimberly, Sidd, Juancarlos

Conference Call Notes - 05/05/2011

Review dicty Paper Pipeline and Textpressso

Current Textpresso for dicty probably used more by dictyBase users than by curators

Papers have been added to the Textpresso corpus as curated; last year not so many because curation focus was on gene models

PubMed searches using keywords (e.g. Dictyostelium) find papers, PDFs are downloaded manually and relevant genes attached

This is a bottleneck; is there an easier way to do this, e.g. upload multiple papers at once?

Other Options for Paper Download

Can use scp or another file transfer protocol, or give Arun an account on a machine at dictyBase and he can get the papers via a script

Preprints are okay for Textpresso, so can download the full text as soon as it's available

Alternatively, automated downloads from PMC or directly from journal web sites could be put into place, although for downloading from journal sites, there are more specifics

dictyBase could provide Textpresso with relevant PMIDs and Textpresso could set up the download pipeline

PMC has a six-month delay, though, but right now that should not be a problem - we can revisit that in the future

Future Plans

Once new papers are in the corpus, perform CCC search on dicty papers from April 2010 - November 2010. This will allow dicty curators to look over the results and determine how well the searches are working for them.

Update dicty gene list (last update was August 2008), add synonyms, and consider if there are any gene names that might contribute to false positives (e.g., ER for TAIR)

Time frame: mid-June for generating source files for testing search results

Previous Work - 2009/2010

Proposed Curation Pipeline

  1. New papers are processed with an SVM developed for dicty cellular component curation
  2. Sentences from resulting positives (confidence level TBD) are subject to Textpresso category searches
  3. Positive sentences are presented to curators along with suggested GO terms and GO IDs

11/11/2009 Pascale, Petra finished uploading of dicty papers for Cellular Component SVM.

11/20/2009 Status of SVM?

11/20/2009 Categories on dicty site - two groups of CCC categories, remove Curator Specific categories?

04/07/2010 Awaiting analysis of dicty SVM - high confidence positives and checkFN file.

Back to Gene Ontology