TAIR CCC

From WormBaseWiki
Jump to navigationJump to search

Specifications for Curation Pipeline

Summary

This document is an outline of the Arabidopsis Textpresso for CCC pipeline for the initial trial run.

The trial run will be a search on all papers in the Textpresso for Arabidopsis corpus published in 2008.

Search results will be stored in three files:

1) all sentences returned by the search

2) sentences from papers already curated by TAIR for GO Cellular Component

3) sentences from papers not curated by TAIR for GO Cellular Component

Annotations can be made using an on-line curation form:

http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/tair/tair_ccc.cgi

with two different outputs:

1) a three-column 'user submission' output

2) a standard GO Gene Association File (GAF) format


Search Details

Paper Acquisition

  • TAIR corpus averages ~2500 papers/year
  • TAIR curator sends PDFs of papers to be included in the corpus to Textpresso (Michael) approximately every six months
  • Any additional papers for which TAIR does not have a PDF are downloaded by Textpresso team, if possible

Textpresso Search

Categories
  • Search Arabidopsis corpus on:

http://www.textpresso.org/arabidopsis/

Using these four categories:

1) CCC assay terms

2) CCC cellular components

3) CCC verbs

4) genes (arabidopsis)

Paths for categories on textpresso-dev.caltech.edu:

/data2/data-processing/data/arabidopsis/Data/ontology/lexica

1) CCC assay terms

2) CCC cellular component

3) CCC verbs

Juancarlos, can you fill in the exact path names we used here? --K

Path name for arabidopsis genes on textpresso-dev.caltech.edu:

4) genes (arabidopsis)

/data2/data-processing/data/arabidopsis/Data/ontology/lexica/genes_arabidopsis.0-gram

Juancarlos, can you confirm that this is the exact path we used for genes (arabidopsis)? --K

Filtering by Paper Section

Sectioning (i.e., Introduction, Materials and Methods, Results, etc.) of the Arabidopsis corpus was scheduled to be included with the next update in November 2010. Yuling - status?

Filtering by Year

For the initial round of searches for TAIR, the results will be filtered by year.

Year: 2008

Year information for Arabidopsis papers is found here:

/data2/data-processing/data/arabidopsis/Data/processedfiles/year/


Curation Details

Curation form: http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/tair/tair_ccc.cgi

  • Output of the searches will be sentences from papers published in 2008.

These sentences will then be sorted into two separate files based on whether any Cellular Component Annotations exist for that paper in TAIR's Gene Association file.

To do this we will need to sort by both reference identifier and ontology aspect:

1) The TAIR gene association file (GAF) can be downloaded from the go ftp site:

ftp://ftp.geneontology.org/pub/go/gene-associations/

The file is zipped and named: gene_association.tair.gz

The GO FTP site can be accessed anonymously using the username anonymous and your email address as the password. Command-line FTP clients can use the instruction

ftp ftp.geneontology.org:/pub/go/



Back to Gene Ontology