Difference between revisions of "Specifications for CCC Curation from Textpresso Search Page"

From WormBaseWiki
Jump to navigationJump to search
Line 1: Line 1:
 
===Requirements for Using Textpresso Search Results in General CCC Curation===
 
===Requirements for Using Textpresso Search Results in General CCC Curation===
  
These specifications are for allowing a curator to search any Textpresso implementation using the CCC categories, submit the resulting sentences to a curation form, make annotations, and download the annotations in a gene_association file format.
+
These specifications are for allowing a curator to search any Textpresso implementation using the CCC categories, submit the resulting sentences to a curation form, make annotations, and download the annotations in a simple three-column format, or the GO's gene_association file (GAF) format.
  
This pipeline would make use of the XML format of a returned sentence.  An XML version of sample search results from WBPaper00037859 was edited:
+
==Searches==
  
1) Removed all category names between the <annotation> tags.
+
Curators would be able to search Textpresso using their chosen criteria and export the sentences to the CCC curation form.
 +
The XML mark-up of each returned sentence will be used to populate the curation boxes on the form and color-code the search results.
  
2) Kept all information in within the <bibliography> tags.
+
''Future options for filtering search results may include filtering papers that are in the corpus but not appropriate for curation (e.g., WB's 'functional annotation papers' or TAIR's black list of papers on other organisms) or restricting searches to a particular level of SVM classification (e.g., high-confidence SVM papers only). For the former, we would need a consistent tag?  For the latter, we would need to integrate the SVM classification results into postgres?''
  
3) Removed information within the <field_references> tags - this was a scrambled sentence, is this how they are typically identified?
+
==From Searches to the Curation Form==
  
4) Potentially curatable sentences are found within the <field_results> tags.
+
This pipeline would make use of the XML format of returned sentences to construct a version of the current CCC curation form; a version for TAIR is shown here:
  
5) Going from XML to curation form:
+
http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/tair/tair_ccc.cgi
  
Display information within <bibliography> at the top of the page:
+
1) Keep three action buttons on the top - Submit, Search for Sentence, Search for Paper
  
Title:
+
2) Display all information from within the <bibliography> tags.
  
Authors:
+
''This will display more information than is currently shown, but seemed easier than just picking a few tags from within the bibliography section.  The additional information may also be helpful to curators.''
  
Journal:
+
3) Do not display information within any <field_references> tags.
  
Year:
+
''Is this how scrambled sentences are typically identified?''
  
DocID:
+
4) Potentially curatable sentences are found within the <field_results> tags.  Working from left to right on the curation form, here is how the information from the XML would translate to the curation form:
  
Type:
+
First box, Gene/Protein Name
  
Literature:
+
This box lists all entities within the species-specific protein or gene tag.  Right now, the category name for this will be different for each implementation, for example:
  
Accession (PMID):
+
protein_celegans
 +
 
 +
genes_arabidopsis
 +
 
 +
dicty_genes
 +
 
 +
''For TAIR we displayed both the name of the gene as presented in the sentence, as well as the name in the sentence mapped to a specific TAIR gene ID.  This is because some Arabidopsis gene names are used for more than one locus and the TAIR curators wanted a way to ensure they'd be making the annotation to the correct locus.  It would be fine to implement this universally, I think.  This requires a mapping file from each group and a regular pipeline for updating the gene or protein names file.''
 +
 
 +
 
 +
Second box, Component Term in Sentence
 +
 
 +
This box lists the component term as identified by the component category.
 +
 
 +
For the various implementations there are currently several component categories.
 +
 
 +
For elegans and Arabidopsis:
 +
 
 +
localization_cell_components_082208
 +
CCC_TAIR
 +
 
 +
For dicty:
  
Abstract:
+
localization_cell_components_050808
 +
localization_cell_components_082208
  
6) Working from left to right on the curation form:
+
''If more than one cellular component category is used in the mark-up, we will need to figure out how to determine which one the curator wants to display on the curation form.  Alternatively, we could restrict each implementation to only one component category at a time.''
  
First box: all entities within the protein or gene tagThe exact tag name for this will be different for each implementation, for example:
+
''Currently, this box only displays the component terms in sentences that matched the categoryIt might be helpful, though, if in the
  
protein_celegans
 
  
genes_arabidopsis
+
Third box, CC Term in GO
  
dicty_genes
+
This box displays, if available, any GO terms that have already been curated from component terms in sentences, or allows curators to enter a new GO term, if needed.
  
  

Revision as of 21:09, 14 January 2011

Requirements for Using Textpresso Search Results in General CCC Curation

These specifications are for allowing a curator to search any Textpresso implementation using the CCC categories, submit the resulting sentences to a curation form, make annotations, and download the annotations in a simple three-column format, or the GO's gene_association file (GAF) format.

Searches

Curators would be able to search Textpresso using their chosen criteria and export the sentences to the CCC curation form. The XML mark-up of each returned sentence will be used to populate the curation boxes on the form and color-code the search results.

Future options for filtering search results may include filtering papers that are in the corpus but not appropriate for curation (e.g., WB's 'functional annotation papers' or TAIR's black list of papers on other organisms) or restricting searches to a particular level of SVM classification (e.g., high-confidence SVM papers only). For the former, we would need a consistent tag? For the latter, we would need to integrate the SVM classification results into postgres?

From Searches to the Curation Form

This pipeline would make use of the XML format of returned sentences to construct a version of the current CCC curation form; a version for TAIR is shown here:

http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/tair/tair_ccc.cgi

1) Keep three action buttons on the top - Submit, Search for Sentence, Search for Paper

2) Display all information from within the <bibliography> tags.

This will display more information than is currently shown, but seemed easier than just picking a few tags from within the bibliography section. The additional information may also be helpful to curators.

3) Do not display information within any <field_references> tags.

Is this how scrambled sentences are typically identified?

4) Potentially curatable sentences are found within the <field_results> tags. Working from left to right on the curation form, here is how the information from the XML would translate to the curation form:

First box, Gene/Protein Name

This box lists all entities within the species-specific protein or gene tag. Right now, the category name for this will be different for each implementation, for example:

protein_celegans

genes_arabidopsis

dicty_genes

For TAIR we displayed both the name of the gene as presented in the sentence, as well as the name in the sentence mapped to a specific TAIR gene ID. This is because some Arabidopsis gene names are used for more than one locus and the TAIR curators wanted a way to ensure they'd be making the annotation to the correct locus. It would be fine to implement this universally, I think. This requires a mapping file from each group and a regular pipeline for updating the gene or protein names file.


Second box, Component Term in Sentence

This box lists the component term as identified by the component category.

For the various implementations there are currently several component categories.

For elegans and Arabidopsis:

localization_cell_components_082208 CCC_TAIR

For dicty:

localization_cell_components_050808 localization_cell_components_082208

If more than one cellular component category is used in the mark-up, we will need to figure out how to determine which one the curator wants to display on the curation form. Alternatively, we could restrict each implementation to only one component category at a time.

Currently, this box only displays the component terms in sentences that matched the category. It might be helpful, though, if in the


Third box, CC Term in GO

This box displays, if available, any GO terms that have already been curated from component terms in sentences, or allows curators to enter a new GO term, if needed.





Back to Gene Ontology