Difference between revisions of "TAIR CCC"

From WormBaseWiki
Jump to navigationJump to search
Line 130: Line 130:
 
The curation form for TAIR can be found here:  http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/tair/tair_ccc.cgi
 
The curation form for TAIR can be found here:  http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/tair/tair_ccc.cgi
  
'''1)''' Display Title and Abstract on the form
+
=====Features of the Curation Form=====
Note: Space for this already exists on the form, just not being filled in
 
  
Check with Michael
+
1) The Title and Abstract for each paper is displayed on the form
  
/data2/data-processing/data/arabidopsis/Data/processedfiles/year/
+
This information is found here:
  
'''2)''' Label each of the three curation boxes at the top for clarity
+
/data2/data-processing/data/arabidopsis/Data/processedfiles/title/
 +
/data2/data-processing/data/arabidopsis/Data/processedfiles/abstract/
 +
 
 +
'''Juancarlos, please confirm that this is where this information is found.  --K'''
 +
 
 +
2) The three boxes on the left side of the form are labeled:
 
First: Gene/Protein Name
 
First: Gene/Protein Name
 
Second: Component Term in Sentence
 
Second: Component Term in Sentence
 
Third: CC Term in GO
 
Third: CC Term in GO
  
'''3)''' Provide color-coded key to category terms, i.e. blue = gene product, green = verb, orange = assay term, and red/brown = component term
+
3) There is a color-coded key to category terms above the sentences:
  
'''4)''' Display of Gene/Protein name is more complicated for TAIR, because they have cases where the same symbol can map to multiple, unique locus names.  For example, PAP1 is a symbol for at least
+
blue = gene product, green = verb, orange = assay term, and red/brown = component term
four different locus names.
 
  
PAP1 = AT2G27190
+
4) The display of Gene/Protein name includes each individual symbol in the sentence as well as each symbol mapped to a TAIR locus name.  The mappings are taken from the gene_aliases file on the TAIR ftp site:
  
PAP1 = AT3G16500
+
ftp://ftp.arabidopsis.org/home/tair/Genes/gene_aliases.20101027
etc.
 
  
We will need to use the information in the ftp file below to present, when applicable, all possible combinations of symbol and locus names so that the TAIR curators can select the right one.
+
PAP1:AT2G27190
 
+
PAP1:AT3G16500
ftp://ftp.arabidopsis.org/home/tair/Genes/gene_aliases.20101027
 
  
Check with Tanya about what their update schedule is  
+
'''Kimberly, check with Tanya about what the update schedule is for this file, i.e. how often should we update this file on the curation site, weekly, monthly?'''
  
'''5)''' Make each paper object above the matching sentence a link out to TAIR's paper object in their curation database.
+
5) Paper object above the matching sentence a link out to TAIR's paper object in their curation database.
  
 
Base URL:  http://germany.tairgroup.org:8090/pub/DisplayArticle?article_id=19905
 
Base URL:  http://germany.tairgroup.org:8090/pub/DisplayArticle?article_id=19905

Revision as of 18:02, 9 December 2010

Specifications for Curation Pipeline

Summary

This document is an outline of the Arabidopsis Textpresso for CCC pipeline for the initial trial run.

The trial run will be a search on all papers in the Textpresso for Arabidopsis corpus published in 2008.

Search results will be stored in three files:

1) all sentences returned by the search

2) sentences from papers already curated by TAIR for GO Cellular Component

3) sentences from papers not curated by TAIR for GO Cellular Component

Annotations can be made using an on-line curation form:

http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/tair/tair_ccc.cgi

with two different outputs:

1) a three-column 'user submission' output

2) a standard GO Gene Association File (GAF) format


Search Details

Paper Acquisition

  • TAIR corpus averages ~2500 papers/year
  • TAIR curator sends PDFs of papers to be included in the corpus to Textpresso (Michael) approximately every six months
  • Any additional papers for which TAIR does not have a PDF are downloaded by Textpresso team, if possible


Textpresso Search

Categories
  • Search Arabidopsis corpus on:

http://www.textpresso.org/arabidopsis/

Using these four categories:

1) CCC assay terms

2) CCC cellular components

3) CCC verbs

4) genes (arabidopsis)

Paths for categories on textpresso-dev.caltech.edu:

/data2/data-processing/data/arabidopsis/Data/ontology/lexica

1) CCC assay terms

2) CCC cellular component

3) CCC verbs

Juancarlos, can you fill in the exact path names we used here? --K

Path name for arabidopsis genes on textpresso-dev.caltech.edu:

4) genes (arabidopsis)

/data2/data-processing/data/arabidopsis/Data/ontology/lexica/genes_arabidopsis.0-gram

Juancarlos, can you confirm that this is the exact path we used for genes (arabidopsis)? --K

Filtering by Paper Section

Sectioning (i.e., Introduction, Materials and Methods, Results, etc.) of the Arabidopsis corpus was scheduled to be included with the next update in November 2010. Yuling - status?

Filtering by Year

For the initial round of searches for TAIR, the results will be filtered by year.

Year: 2008

Year information for Arabidopsis papers is found here:

/data2/data-processing/data/arabidopsis/Data/processedfiles/year/


Curation Details

Source Files

Juancarlos, just to confirm, will TAIR curators need the username and password to access the form? --K

The output of the searches will be sentences from papers published in 2008.

These sentences will then be available as three separate files:

1) results_2008_ccc_genesarabidopsis - this file contains all of the sentences returned by the search

2) results_2008_in_geneassociation - this file contains all of the sentences from papers that are already in the TAIR gene_association file with a corresponding Cellular Component (C) annotation.

3) results_2008_not_geneassociation - this file contains all of the sentences from papers that are NOT in the TAIR gene_association file with a corresponding Cellular Component (C) annotation.

The content of each of the files can be seen here:

http://tazendra.caltech.edu/~azurebrd/cgi-bin/data/tair_ccc_datafiles/results_2008_ccc_and_genesarabidopsis

http://tazendra.caltech.edu/~azurebrd/cgi-bin/data/tair_ccc_datafiles/results_2008_in_geneassociation

http://tazendra.caltech.edu/~azurebrd/cgi-bin/data/tair_ccc_datafiles/results_2008_not_geneassociation

Files 2 and 3 above, are generated by sorting File 1 using reference identifiers and ontology aspect found in the TAIR gene_association file available here:

ftp://ftp.geneontology.org/pub/go/gene-associations/

The file is zipped and named: gene_association.tair.gz

The GO FTP site can be accessed anonymously using the username anonymous and your email address as the password. Command-line FTP clients can use the instruction

ftp ftp.geneontology.org:/pub/go/

Curation Form

The curation form for TAIR can be found here: http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/tair/tair_ccc.cgi

Features of the Curation Form

1) The Title and Abstract for each paper is displayed on the form

This information is found here:

/data2/data-processing/data/arabidopsis/Data/processedfiles/title/ /data2/data-processing/data/arabidopsis/Data/processedfiles/abstract/

Juancarlos, please confirm that this is where this information is found. --K

2) The three boxes on the left side of the form are labeled: First: Gene/Protein Name Second: Component Term in Sentence Third: CC Term in GO

3) There is a color-coded key to category terms above the sentences:

blue = gene product, green = verb, orange = assay term, and red/brown = component term

4) The display of Gene/Protein name includes each individual symbol in the sentence as well as each symbol mapped to a TAIR locus name. The mappings are taken from the gene_aliases file on the TAIR ftp site:

ftp://ftp.arabidopsis.org/home/tair/Genes/gene_aliases.20101027

PAP1:AT2G27190 PAP1:AT3G16500

Kimberly, check with Tanya about what the update schedule is for this file, i.e. how often should we update this file on the curation site, weekly, monthly?

5) Paper object above the matching sentence a link out to TAIR's paper object in their curation database.

Base URL: http://germany.tairgroup.org:8090/pub/DisplayArticle?article_id=19905



Back to Gene Ontology