Difference between revisions of "CCC Form 2.0 Specifications"

Revision as of 16:11, 4 December 2012

This page is intended to document specifications for the next version of the Textpresso for Cellular Component Curation (CCC) tool. The changes to the tool, and pipeline, have been suggested by curators and are also part of the broader plan for Textpresso-based curation pipelines and the GO's Common Annotation Framework.

Textpresso search specifications

Frequency of searches - this will vary by group (e.g., weekly for dictyBase)
Corpus - this will also vary by group
Categories - gene/protein name, CCC, assay term, verb
Filtering (Textpresso)

Journal
Date
Document IDs

Filtering (non-Textpresso)

SVM
Gene Ontology Gene Association File

Ranking search results - e.g., highest scoring papers presented first
Naming search results file
Storing search histories

Recording versions of pdf2text conversion
Recording version of categories used
Recording search criteria, i.e. categories, corpus, filters
Recording curator or group and date of search

Curation form

Overall Workflow

Curator will login
Directed to a blank annotation page with drop down menu for Textpresso sentence file selection
Curators will select a sentence file for annotation
If a new sentences file, curators can begin to create annotations and classify sentences (if desired)
If a previously annotated file, curators can modify existing annotations or add new annotations and classify sentences (if desired)
Any new annotations can be sent to Protein2GO annotation tool when curator is ready
Searches can also be performed using gene names, paper IDs, component terms in sentences, and GO terms

Detailed Workflow

The form presents sentences for annotation - some sentences will lead to a GO annotation, some will not
Not all sentences will be classified (although it'd be great if they were)
- Does that mean that we need a way to permanently skip sentences, or can they be set as "not classified" to skip it (thus classifying it)?
  - Sentences without a classification could be stored as 'unclassified'. We could add 'unclassified' as a sentence classification check box option?
What data to store for a GO annotation:

Name of search results file
Paper identifier - this will come from the Textpresso sentences file and will be a PMID or doi
Gene/gene product identifier - this will be derived from the match between the gene product in the sentence and the information in the 'gpi' file that maps a gene product name to a database identifier
Textpresso component term - column 2 in current form
GO component term - column 3 in current form
Evidence code - previously all annotations via the CCC form used the IDA evidence code; the new form will also allow for the IPI evidence code.
Sentence ID
Sentence classification
Curator
Annotation date - recorded upon pressing submit button
Annotation history - - this would be a way to track how often an annotation or classification was touched and by home; akin to the curator history field on the concise description oa

What data to store if no GO annotation:

Name of search results file
Paper identifier
Sentence ID
Sentence classification
Curator
Annotation date
Annotation history

Curator login
Import of search results files
- Can this be automated? dictyBase searches will be run weekly, can the results files be automatically transferred to tazendra?
  - If textpresso does this in a way where the output is always in a specific public_html/ directory with a sensible filename structure, then we can have a cronjob pick them up.
Organization of search results file
- If we have many results files, perhaps we can organize them in neater way than just one long list. See the Textpresso categories for an example of cascading menus, one possible solution.
  - One possibility would be by date of search - from top to bottom level, the menu would progress from year to month to day
Selection of search results file for curation - this would be one starting point for curation, while more specific searches for gene product, paper ID, Textpresso component term, GO term searches might be another
Display of paper bibliographic information
- This refers to adding more information than what we currently display, as well as how we display it. I think we could get the additional information from Textpresso and then just pretty up the display by adding some spacing and some bold text, etc. See PubMed for one possible example: [1]
- For C. elegans papers, the information would be in postgres, but for dictyBase and TAIR papers (and any other group that used the form), that information would come from Textpresso. We could make getting this information uniform across all groups, though, and just get everything from Textpresso.
Search functionality on form - this includes some new features, allowing curators to search previous annotations, individual papers, etc.

Gene - search for all sentences that mention a specified gene and/or synonym (all sentences or specific sentence classification)
Paper - search for sentences from a paper (all or specific sentence classifications)
Curator - search for all sentences classified by a given curator (all or specific sentence classification)
Annotation date - search for all work done for a given date (use wild cards)
Component term in sentence - search for all sentences that matched a given Textpresso component term (all or sentence classification)
GO term used for annotation - search for all sentences that used a specific GO term for annotation (Curate or Already Curated)

Curation when all entities are recognized - straightforward
Curation when one or more entities is not recognized - add a value to either of first two columns

Enter a new gene name and database identifier
Enter a new component term in sentence

Feedback from form to Textpresso - this will have to be worked out with Textpresso; we would have to establish a mechanism for automatically updating a category

Add gene name or synonym plus database identifier
Add component term to Textpresso cellular component category

Evidence codes
IDA (default), IPI (complex membership) - only one evidence code is allowed per annotation, the curator would have to select
If a single sentence contained evidence for more than one annotation, could we have duplicate functionality? How best to handle this?
Sentence classification - should be check boxes, as it is possible, for example, to have a run-on sentence that is also a false positive

Curate - select one or more entities from each column, will add new GO annotations to database
Already curated - select one or more entities from each column, will record as already annotated, mark in red, filter, different location on page?
Scrambled sentence
Run-on sentence
Positive for localization, not for GO curation (formerly not go curatable)
False positive

Edit a previous annotation

Change gene annotated, change component term used, change GO term assigned, change evidence code

Edit relationship index
This would be a separate functionality, but would allow a curator to view and edit the relationship index if needed.
It would be a way to add, for example, a new relationship even if it didn't directly result from an annotation, but was instead a relationship a curator thought might be useful.
Delete a search results file
This could be tricky. We'd need to make sure there are no annotations associated with that search file.
- If this didn't happen often, it would probably be easiest if you did it manually through the shell. If this happens often, we'd

probably have two directories: 1) stuff to use 2) stuff not to use and a way to list all of them and move them back and forth. assuming that Textpresso files will always have a unique name and we never want to get the same file twice.

- - I don't honestly know how often this might happen. Perhaps having a separate drop down for files used and files deleted would be reasonable?
Export annotations

To a MOD - GO Gene Association File (GAF)(preferred) or three-column file (Gene ID, GO ID, Paper ID)
To Protein2GO - using web services See Protein2GO Web Services

This would happen whenever a curator clicked on a 'Submit to Protein2GO' button.

Files needed

Mapping file for gene names and synonyms to MOD identifier and UniProtKB identifier

GO's gpi file format would have all of the information we need

Other issues

What to do with old data - can we map all old data onto new tables? Some information may be missing in old data, is that okay?

@@ Line 105: / Line 105: @@
 #To a MOD - GO Gene Association File (GAF)(preferred) or three-column file (Gene ID, GO ID, Paper ID)
 #To Protein2GO - using web services  See [http://www.ebi.ac.uk/seqdb/confluence/display/GOAP/Protein2GO+Web+Services Protein2GO Web Services]
+*This would happen whenever a curator clicked on a 'Submit to Protein2GO' button.

Difference between revisions of "CCC Form 2.0 Specifications"

Revision as of 16:11, 4 December 2012

Contents

Textpresso search specifications

Curation form

Overall Workflow

Detailed Workflow

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools