Difference between revisions of "CCC Form 2.0 Specifications"

From WormBaseWiki
Jump to navigationJump to search
Line 105: Line 105:
 
#To a MOD - GO Gene Association File (GAF)(preferred) or three-column file (Gene ID, GO ID, Paper ID)
 
#To a MOD - GO Gene Association File (GAF)(preferred) or three-column file (Gene ID, GO ID, Paper ID)
 
#To Protein2GO - using web services  See [http://www.ebi.ac.uk/seqdb/confluence/display/GOAP/Protein2GO+Web+Services Protein2GO Web Services]
 
#To Protein2GO - using web services  See [http://www.ebi.ac.uk/seqdb/confluence/display/GOAP/Protein2GO+Web+Services Protein2GO Web Services]
 +
*This would happen whenever a curator clicked on a 'Submit to Protein2GO' button.
 +
  
  

Revision as of 16:11, 4 December 2012

This page is intended to document specifications for the next version of the Textpresso for Cellular Component Curation (CCC) tool. The changes to the tool, and pipeline, have been suggested by curators and are also part of the broader plan for Textpresso-based curation pipelines and the GO's Common Annotation Framework.


Textpresso search specifications

  • Frequency of searches - this will vary by group (e.g., weekly for dictyBase)
  • Corpus - this will also vary by group
  • Categories - gene/protein name, CCC, assay term, verb
  • Filtering (Textpresso)
  1. Journal
  2. Date
  3. Document IDs
  • Filtering (non-Textpresso)
  1. SVM
  2. Gene Ontology Gene Association File
  • Ranking search results - e.g., highest scoring papers presented first
  • Naming search results file
  • Storing search histories
  1. Recording versions of pdf2text conversion
  2. Recording version of categories used
  3. Recording search criteria, i.e. categories, corpus, filters
  4. Recording curator or group and date of search

Curation form

Overall Workflow

  • Curator will login
  • Directed to a blank annotation page with drop down menu for Textpresso sentence file selection
  • Curators will select a sentence file for annotation
  • If a new sentences file, curators can begin to create annotations and classify sentences (if desired)
  • If a previously annotated file, curators can modify existing annotations or add new annotations and classify sentences (if desired)
  • Any new annotations can be sent to Protein2GO annotation tool when curator is ready
  • Searches can also be performed using gene names, paper IDs, component terms in sentences, and GO terms

Detailed Workflow

  • The form presents sentences for annotation - some sentences will lead to a GO annotation, some will not
  • Not all sentences will be classified (although it'd be great if they were)
    • Does that mean that we need a way to permanently skip sentences, or can they be set as "not classified" to skip it (thus classifying it)?
      • Sentences without a classification could be stored as 'unclassified'. We could add 'unclassified' as a sentence classification check box option?
  • What data to store for a GO annotation:
  1. Name of search results file
  2. Paper identifier - this will come from the Textpresso sentences file and will be a PMID or doi
  3. Gene/gene product identifier - this will be derived from the match between the gene product in the sentence and the information in the 'gpi' file that maps a gene product name to a database identifier
  4. Textpresso component term - column 2 in current form
  5. GO component term - column 3 in current form
  6. Evidence code - previously all annotations via the CCC form used the IDA evidence code; the new form will also allow for the IPI evidence code.
  7. Sentence ID
  8. Sentence classification
  9. Curator
  10. Annotation date - recorded upon pressing submit button
  11. Annotation history - - this would be a way to track how often an annotation or classification was touched and by home; akin to the curator history field on the concise description oa
  • What data to store if no GO annotation:
  1. Name of search results file
  2. Paper identifier
  3. Sentence ID
  4. Sentence classification
  5. Curator
  6. Annotation date
  7. Annotation history
  • Curator login
  • Import of search results files
    • Can this be automated? dictyBase searches will be run weekly, can the results files be automatically transferred to tazendra?
      • If textpresso does this in a way where the output is always in a specific public_html/ directory with a sensible filename structure, then we can have a cronjob pick them up.
  • Organization of search results file
    • If we have many results files, perhaps we can organize them in neater way than just one long list. See the Textpresso categories for an example of cascading menus, one possible solution.
      • One possibility would be by date of search - from top to bottom level, the menu would progress from year to month to day
  • Selection of search results file for curation - this would be one starting point for curation, while more specific searches for gene product, paper ID, Textpresso component term, GO term searches might be another
  • Display of paper bibliographic information
    • This refers to adding more information than what we currently display, as well as how we display it. I think we could get the additional information from Textpresso and then just pretty up the display by adding some spacing and some bold text, etc. See PubMed for one possible example: [1]
    • For C. elegans papers, the information would be in postgres, but for dictyBase and TAIR papers (and any other group that used the form), that information would come from Textpresso. We could make getting this information uniform across all groups, though, and just get everything from Textpresso.
  • Search functionality on form - this includes some new features, allowing curators to search previous annotations, individual papers, etc.
  1. Gene - search for all sentences that mention a specified gene and/or synonym (all sentences or specific sentence classification)
  2. Paper - search for sentences from a paper (all or specific sentence classifications)
  3. Curator - search for all sentences classified by a given curator (all or specific sentence classification)
  4. Annotation date - search for all work done for a given date (use wild cards)
  5. Component term in sentence - search for all sentences that matched a given Textpresso component term (all or sentence classification)
  6. GO term used for annotation - search for all sentences that used a specific GO term for annotation (Curate or Already Curated)
  • Curation when all entities are recognized - straightforward
  • Curation when one or more entities is not recognized - add a value to either of first two columns
  1. Enter a new gene name and database identifier
  2. Enter a new component term in sentence
  • Feedback from form to Textpresso - this will have to be worked out with Textpresso; we would have to establish a mechanism for automatically updating a category
  1. Add gene name or synonym plus database identifier
  2. Add component term to Textpresso cellular component category
  • Evidence codes
  • IDA (default), IPI (complex membership) - only one evidence code is allowed per annotation, the curator would have to select
  • If a single sentence contained evidence for more than one annotation, could we have duplicate functionality? How best to handle this?
  • Sentence classification - should be check boxes, as it is possible, for example, to have a run-on sentence that is also a false positive
  1. Curate - select one or more entities from each column, will add new GO annotations to database
  2. Already curated - select one or more entities from each column, will record as already annotated, mark in red, filter, different location on page?
  3. Scrambled sentence
  4. Run-on sentence
  5. Positive for localization, not for GO curation (formerly not go curatable)
  6. False positive
  • Edit a previous annotation
  1. Change gene annotated, change component term used, change GO term assigned, change evidence code
  • Edit relationship index
  • This would be a separate functionality, but would allow a curator to view and edit the relationship index if needed.
  • It would be a way to add, for example, a new relationship even if it didn't directly result from an annotation, but was instead a relationship a curator thought might be useful.
  • Delete a search results file
  • This could be tricky. We'd need to make sure there are no annotations associated with that search file.
    • If this didn't happen often, it would probably be easiest if you did it manually through the shell. If this happens often, we'd

probably have two directories: 1) stuff to use 2) stuff not to use and a way to list all of them and move them back and forth. assuming that Textpresso files will always have a unique name and we never want to get the same file twice.

      • I don't honestly know how often this might happen. Perhaps having a separate drop down for files used and files deleted would be reasonable?
  • Export annotations
  1. To a MOD - GO Gene Association File (GAF)(preferred) or three-column file (Gene ID, GO ID, Paper ID)
  2. To Protein2GO - using web services See Protein2GO Web Services
  • This would happen whenever a curator clicked on a 'Submit to Protein2GO' button.


Files needed

  • Mapping file for gene names and synonyms to MOD identifier and UniProtKB identifier
  1. GO's gpi file format would have all of the information we need

Other issues

  • What to do with old data - can we map all old data onto new tables? Some information may be missing in old data, is that okay?