Difference between revisions of "TAIR CCC"

From WormBaseWiki
Jump to navigationJump to search
Line 7: Line 7:
 
The trial run will be a search on all papers in the Textpresso for Arabidopsis corpus published in 2008.   
 
The trial run will be a search on all papers in the Textpresso for Arabidopsis corpus published in 2008.   
  
'''Search results''' will be stored in three files:  
+
Search results will be stored in three files:  
  
 
1) all sentences returned by the search
 
1) all sentences returned by the search
Line 15: Line 15:
 
3) sentences from papers not curated by TAIR for GO Cellular Component   
 
3) sentences from papers not curated by TAIR for GO Cellular Component   
  
'''Annotations''' can be made using an on-line curation form:  
+
Annotations can be made using an on-line curation form:  
  
 
http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/tair/tair_ccc.cgi
 
http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/tair/tair_ccc.cgi
Line 24: Line 24:
  
 
2) a standard GO Gene Association File (GAF) format
 
2) a standard GO Gene Association File (GAF) format
 +
  
  
Line 36: Line 37:
 
*Any additional papers for which TAIR does not have a PDF are downloaded by Textpresso team, if possible
 
*Any additional papers for which TAIR does not have a PDF are downloaded by Textpresso team, if possible
  
====Textpresso Search Details====
+
====Textpresso Search====
  
===Categories===
+
=====Categories=====
  
 
*Search Arabidopsis corpus on:
 
*Search Arabidopsis corpus on:

Revision as of 16:59, 9 December 2010

Specifications for Curation Pipeline

Summary

This document is an outline of the Arabidopsis Textpresso for CCC pipeline for the initial trial run.

The trial run will be a search on all papers in the Textpresso for Arabidopsis corpus published in 2008.

Search results will be stored in three files:

1) all sentences returned by the search

2) sentences from papers already curated by TAIR for GO Cellular Component

3) sentences from papers not curated by TAIR for GO Cellular Component

Annotations can be made using an on-line curation form:

http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/tair/tair_ccc.cgi

with two different outputs:

1) a three-column 'user submission' output

2) a standard GO Gene Association File (GAF) format


Pipeline Details

Paper Acquisition

  • TAIR corpus averages ~2500 papers/year
  • TAIR curator sends PDFs of papers to be included in the corpus to Textpresso (Michael) approximately every six months
  • Any additional papers for which TAIR does not have a PDF are downloaded by Textpresso team, if possible

Textpresso Search

Categories
  • Search Arabidopsis corpus on:

http://www.textpresso.org/arabidopsis/

Using these four categories:

1) CCC assay terms

2) CCC cellular components

3) CCC verbs

4) genes (arabidopsis)


Paths for categories on textpresso-dev.caltech.edu:

/data2/data-processing/data/arabidopsis/Data/ontology/lexica

1) CCC assay terms

2) CCC cellular component

3) CCC verbs

Juancarlos, can you fill in the exact path names here? --K

Path name for arabidopsis genes on textpresso-dev.caltech.edu:

4) genes (arabidopsis)

/data2/data-processing/data/arabidopsis/Data/ontology/lexica/genes_arabidopsis.0-gram

Juancarlos, can you confirm that this is the path we used for genes (arabidopsis)? --K



Back to Gene Ontology