Difference between revisions of "TAIR"

From WormBaseWiki
Jump to navigationJump to search
Line 69: Line 69:
 
If there are new svm results, it runs :
 
If there are new svm results, it runs :
 
  * /home/azurebrd/work/get_kimberly_go_gene_component_verb_localization/get_go_gene_component.pl  
 
  * /home/azurebrd/work/get_kimberly_go_gene_component_verb_localization/get_go_gene_component.pl  
which outputs goo results to :
+
which outputs good results to :
 
   /data2/srv/textpresso-dev.caltech.edu/www/docroot/azurebrd/ccc_datafiles/good_sentences_file.<date>
 
   /data2/srv/textpresso-dev.caltech.edu/www/docroot/azurebrd/ccc_datafiles/good_sentences_file.<date>
 
  * then runs /home/azurebrd/work/get_kimberly_go_gene_component_verb_localization/get_recentsent.pl  
 
  * then runs /home/azurebrd/work/get_kimberly_go_gene_component_verb_localization/get_recentsent.pl  

Revision as of 21:54, 3 November 2010

Gene Ontology Curation at TAIR

Specifications for Curation Pipeline

Summary

This document is an outline of the Arabidopsis Textpresso for CCC pipeline. The initial trial run will be for all papers in the Textpresso for Arabidopsis corpus published in 2008. Results will be split into two files: results from papers already curated for GO Cellular Component and results from papers not curated for GO Cellular Component. Curation made using on on-line curation form will be output to a Gene Association file format for incorporation into the main TAIR file submitted to GO.

Pipeline

Paper Acquisition

  • ~2500 papers/year
  • TAIR curator sends PDFs of Arabidopsis or other relevant papers to Textpresso (Michael) approximately every six months.
  • Other relevant papers without a PDF are downloaded by Textpresso team, if possible.

Textpresso Search for Cellular Component Annotations

At present, the search criteria have been specified by personal communication. In the future, a web-based form for inputting search criteria might be helpful.

  • Search Arabidopsis corpus using four categories:

Note: get exact category names from Michael, path and file name

1) CCC assay terms

2) CCC cellular component

3) CCC verbs

4) genes (arabidopsis)

  • Filters

Year: 2008

  • Running this search on the Textpresso for Arabidopsis site on 10/28/2010 gives 55814 matches in 1421 documents, with total paper scores ranging from 660 to 7.
  • A preliminary check of the results suggests that sorting by score, high to low, would be most productive. This can always be changed.

? for TAIR - papers curated for CCC in 2008 (may be able to do this with Textpresso), do they want static filter or dynamically as their annotations are updated. If dynamic, then we need a pipeline/work flow to do this.

? which machine to do this on, i.e search results and curation

From WormBase documentation:

Script explanation - Juancarlos

The script that gets stuff to tazendra from textpresso is :

 /home/postgres/public_html/cgi-bin/data/ccc_gocuration/get_newset.pl

called by :

 /home/postgres/work/pgpopulation/textpresso/wrapper.sh             

(For the ontology annotator, there have been no new matches since September 13, 2010)

The full results are on textpresso-dev at :

 /data2/srv/textpresso-dev.caltech.edu/www/docroot/azurebrd/ccc_datafiles/good_sentences_file.*

The new matches are at :

 /data2/srv/textpresso-dev.caltech.edu/www/docroot/azurebrd/ccc_datafiles/recent_sentences_file.*

The wrapper script /home/azurebrd/work/get_kimberly_go_gene_component_verb_localization/wrapper.pl runs every wednesday at 2am. It compares the svm results from :

 http://caprica.caltech.edu/celegans/svm_results/Juancarlos/otherexpr

to the previous svm results stored in /home/azurebrd/work/get_kimberly_go_gene_component_verb_localization/otherexpr

If the results are the same, nothing happens. If there are new svm results, it runs :

* /home/azurebrd/work/get_kimberly_go_gene_component_verb_localization/get_go_gene_component.pl 

which outputs good results to :

 /data2/srv/textpresso-dev.caltech.edu/www/docroot/azurebrd/ccc_datafiles/good_sentences_file.<date>
* then runs /home/azurebrd/work/get_kimberly_go_gene_component_verb_localization/get_recentsent.pl 

which compares the new good_sentences_file.<date> with the previous good_sentences_file.<date> and outputs a /data2/srv/textpresso-dev.caltech.edu/www/docroot/azurebrd/ccc_datafiles/recent_sentences_file.<date> with the current results that are in the svm list. Also outputs /data2/srv/textpresso-dev.caltech.edu/www/docroot/azurebrd/ccc_datafiles/recent_cccfile which points at the new recent_sentences_file.<date> for the tazendra cronjob to know there's a new file.

You can check the recent good_sentences_file.* and see if any of those should be in SVM, or you can look at the full text of the SVM results and see if any of those should be in the good_sentences_file.* If you don't have a textpresso-dev account, you can ask Michael, and he'll ask the its people. I log on with my its account.

If stuff should be in the good_sentences_file.* and isn't, check that the categories have what they should have, and let me know which category isn't matching what it should.

The good_sentences_file.* is generated by :

 /home/azurebrd/work/get_kimberly_go_gene_component_verb_localization/get_go_gene_component.pl

The category files being used are in :

 /data2/data-processing/data/celegans/Data/indices/body/semantic/categories

files :

 protein_celegans 
 localization_cell_components_082208 
 localization_verbs_082208 
 localization_other_120107


Curation - Sentence Input and Annotation

  • The output of the searches will be sentences sorted into two separate files. The criterion for sorting is whether any Cellular Component Annotations exist for that paper in TAIR's Gene Association file.

The TAIR gene association file can be downloaded from here:

http://www.geneontology.org/GO.downloads.annotations.shtml

References are in Column 6 of the GAF.

The Cellular Component aspect (C) is in Column 9 of the GAF.

  • Curation will use a version of the CCC curation form found here:

http://tazendra.caltech.edu/~postgres/cgi-bin/ccc_go_curation.cgi

  • Each file will be available as a source file in the curation form for curators to assess search results.
  • Results of searches will be presented in the curation form, in a three-column format: 1) gene or protein name, 2) cellular component category term, 3) if available, a suggested GO term from the existing category term-GO term relationship index.

The suggested GO term will come from the existing relationship index. New relationships will be added.

? for Michael - put forms/results on textpresso or textpresso-dev

Curation - Output to Gene Association File

Curation will be output to in the form of a Gene Association file (GAF) that can be picked up by the TAIR curators and added to the main Gene Association file that they submit to GO.

One possible variation on the output file is a file that maps the sentence to the GO annotation and the paper ID.

Specifications for GAF2.0 can be found here:

http://www.geneontology.org/GO.format.gaf-2_0.shtml

Output will need to reference the Arabidopsis gene mapping file here:

ftp://ftp.arabidopsis.org/home/tair/Genes/gene_aliases.20101027

There are three columns:

1. locus name (can be a symbol, if this a genetic locus, or an AGI identifier, if it is cloned)

2. symbol

3. full name (not always available)

? Is there a mapping of locus names to TAIR gene ids already in Textpresso?

Check other columns in TAIR file to get output right.

Features to Implement for Future Use

  • Paper sectioning - this can reduce the number of false positives by allowing curators to select from which section of the paper they will retrieve sentences.
  • Web form for specifying search criteria, filters.
  • SVM results - combining the results of an SVM document classification algorithm with Textpresso search results readily identifies potential false negative and false positive papers.
  • Arabidopsis-specific categories - editing the categories created for C. elegans curation to better reflect use of terms in the Arabidopsis literature.

Collaboration with GO

  • Possibly propose and populate a new class of synonym, Published_as, in GO to catalog how component terms are expressed in the published literature. This would be a general aid to all text mining efforts.


Back to Gene Ontology