CCC Form 2.0 Specifications

From WormBaseWiki
Revision as of 20:22, 15 November 2012 by Vanaukenk (talk | contribs)
Jump to navigationJump to search

This page is intended to document specifications for the next version of the Textpresso for Cellular Component Curation (CCC) tool. The changes to the tool, and pipeline, have been suggested by curators and are also part of the broader plan for Textpresso-based curation pipelines and the GO's Common Annotation Framework.

Tool Features

Textpresso search specifications

  • Frequency of searches - this will vary by group (e.g., weekly for dictyBase)
  • Corpus - this will also vary by group
  • Categories - gene/protein name, CCC, assay term, verb
  • Filtering (Textpresso)
  1. Journal
  2. Date
  3. Document IDs
  • Filtering (non-Textpresso)
  1. SVM
  2. Gene Ontology Gene Association File
  • Ranking search results - e.g., highest scoring papers presented first
  • Naming search results file
  • Storing search histories
  1. Recording versions of pdf2text conversion
  2. Recording version of categories used
  3. Recording search criteria, i.e. categories, corpus, filters
  4. Recording curator or group and date of search

Curation form

  • The form presents sentences for annotation - some sentences will lead to a GO annotation, some will not
  • Not all sentences will be classified (although it'd be great if they were)
  • What data to store for a GO annotation:
  1. Name of search results file
  2. Paper identifier
  3. Gene/gene product identifier
  4. Textpresso component term
  5. GO component term
  6. Evidence code
  7. Sentence ID
  8. Sentence classification
  9. Curator
  10. Annotation date
  11. Annotation history
  • What data to store if no GO annotation:
  1. Name of search results file
  2. Paper identifier
  3. Sentence ID
  4. Sentence classification
  5. Curator
  6. Annotation date
  7. Annotation history
  • Curator login
  • Import of search results files
  1. Can this be automated? dictyBase searches will be run weekly, can the results files be automatically transferred to tazendra?
  • Organization of search results file
  1. If we have many results files, perhaps we can organize them in neater way than just one long list. See the Textpresso categories for an example of cascading menus, one possible solution.
  • Selection of search results file for curation
  • Display of paper bibliographic information
  1. This refers to adding more information than what we currently display, as well as how we display it. I think we could get the additional information from Textpresso and then just pretty up the display by adding some spacing and some bold text, etc. See PubMed for one possible example: [1]
  • Search functionality on form - this includes some new features
  1. Gene - search for all sentences that mention a specified gene and/or synonym (all sentences or specific sentence classification)
  2. Paper - search for sentences from a paper (all or specific sentence classifications)
  3. Curator - search for all sentences classified by a given curator (all or specific sentence classification)
  4. Annotation date - search for all work done for a given date (use wild cards)
  5. Component term in sentence - search for all sentences that matched a given Textpresso component term (all or sentence classification)
  6. GO term used for annotation - search for all sentences that used a specific GO term for annotation (Curate or Already Curated)
  • Curation when all entities are recognized - straightforward
  • Curation when one or more entities is not recognized - add a value to either of first two columns
  1. Enter a new gene name and database identifier
  2. Enter a new component term in sentence
  • Feedback from form to Textpresso
  1. Add gene name or synonym plus database identifier
  2. Add component term to Textpresso cellular component category
  • Evidence codes
  1. IDA (default), IPI (complex membership)
  • Sentence classification
  1. Curate - select one or more entities from each column, will add new GO annotations to database
  2. Already curated - select one or more entities from each column, will record as already annotated, mark in red, filter, different location on page?
  3. Scrambled sentence
  4. Run-on sentence
  5. Positive for localization, not for GO curation (formerly not go curatable)
  6. False positive
  • Edit a previous annotation
  1. Change gene annotated, change component term used, change GO term assigned, change evidence code
  • Edit relationship index
  1. This would be a separate functionality, but would allow a curator to view and edit the relationship index if needed.
  • Delete a search results file
  1. This could be tricky. We'd need to make sure there are no annotations associated with that search file.
  • Export annotations
  1. To a MOD
  2. To Protein2GO
  3. As a file - GO Gene Association File (GAF)

Files needed

  • Mapping file for gene names and synonyms to MOD identifier and UniProtKB identifier
  1. GO's gpi file format would have all of the information we need

Other issues

  • What to do with old data - can we map all old data onto new tables? Some information may be missing in old data, is that okay?