Difference between revisions of "June 2012 - Weekly Pipeline Set-up"

From WormBaseWiki
Jump to navigationJump to search
(Created page with 'Goal: To set up a weekly CCC pipeline for dictyBase. *Proposed workflow from Petra **On weekly basis:')
 
 
(13 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
Back to [[DictyBase]]
 +
 
Goal: To set up a weekly CCC pipeline for dictyBase.
 
Goal: To set up a weekly CCC pipeline for dictyBase.
  
 
*Proposed workflow from Petra
 
*Proposed workflow from Petra
**On weekly basis:
+
**On a weekly basis:
 +
***Monday - dictyBase will identify PDFs for curation
 +
***Tuesday - Textpresso will process new PDFs, perform CCC search, and send source file output to CCC curation form on tazendra
 +
***CCC curation performed by dictyBase curators will be stored on tazendra and exported to protein2go database at EBI (UniProt-GOA)
 +
**We will use four Textpresso categories for the searches
 +
***dicty gene
 +
***CCC TAIR
 +
***CCC assay terms
 +
***CCC verbs
 +
**For now, the source file format will be the same as what we used for the dicty searches for the BioCreative task.
 +
**However, we should consider naming the file by the date the search was performed so curators see that on the curation form.
 +
***For an example, please see:  [http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/dicty/dicty_ccc.cgi dicty curation form]
 +
 
 +
 
 +
*Outstanding issues
 +
**dicty gene names
 +
***known issues: 'myosin' vs 'myosin IE - Petra's suggestion: remove 'myosin' from the gene list
 +
***Greek characters in gene names - can this be addressed at all?
 +
**Textpresso web sites for dicty
 +
***currently, two web sites:  [http://textpresso-dev.caltech.edu/ccc_for_dicty/ ccc_for_dicty] and [http://www.textpresso.org/dicty/ Textpresso for dicty]
 +
****we should choose one name and stick with it; I would propose we go with www.textpresso.org/dicty, to keep it simple
 +
****ccc_for_dicty site is currently running the most updated Textpresso software, but I cannot fine the dicty gene or dicty phenotype categories on this site right now
 +
****we need to clean up the categories on the site we're going to use - make sure we have the correct dicty categories and remove some of the C. elegans-specific categories (for example, WormBase C. elegans phenotype is the second category listed under Biological Concepts while a dicty curator has to search down the list for any dicty-related categories)
 +
**CCC curation form
 +
***We will need a better interface for source file display - perhaps a scrolling menu like the category menu on the Textpresso site.
 +
***We could sort by year, month, week, etc.
 +
**Output to protein2go tool
 +
***What file format? GPAD or GAF?
 +
***Will need to consult with Tony and Rachael on this.

Latest revision as of 20:49, 28 June 2012

Back to DictyBase

Goal: To set up a weekly CCC pipeline for dictyBase.

  • Proposed workflow from Petra
    • On a weekly basis:
      • Monday - dictyBase will identify PDFs for curation
      • Tuesday - Textpresso will process new PDFs, perform CCC search, and send source file output to CCC curation form on tazendra
      • CCC curation performed by dictyBase curators will be stored on tazendra and exported to protein2go database at EBI (UniProt-GOA)
    • We will use four Textpresso categories for the searches
      • dicty gene
      • CCC TAIR
      • CCC assay terms
      • CCC verbs
    • For now, the source file format will be the same as what we used for the dicty searches for the BioCreative task.
    • However, we should consider naming the file by the date the search was performed so curators see that on the curation form.


  • Outstanding issues
    • dicty gene names
      • known issues: 'myosin' vs 'myosin IE - Petra's suggestion: remove 'myosin' from the gene list
      • Greek characters in gene names - can this be addressed at all?
    • Textpresso web sites for dicty
      • currently, two web sites: ccc_for_dicty and Textpresso for dicty
        • we should choose one name and stick with it; I would propose we go with www.textpresso.org/dicty, to keep it simple
        • ccc_for_dicty site is currently running the most updated Textpresso software, but I cannot fine the dicty gene or dicty phenotype categories on this site right now
        • we need to clean up the categories on the site we're going to use - make sure we have the correct dicty categories and remove some of the C. elegans-specific categories (for example, WormBase C. elegans phenotype is the second category listed under Biological Concepts while a dicty curator has to search down the list for any dicty-related categories)
    • CCC curation form
      • We will need a better interface for source file display - perhaps a scrolling menu like the category menu on the Textpresso site.
      • We could sort by year, month, week, etc.
    • Output to protein2go tool
      • What file format? GPAD or GAF?
      • Will need to consult with Tony and Rachael on this.