TAIR

Gene Ontology Curation at TAIR

Specifications for Curation Pipeline

Summary

This document is an outline of the Arabidopsis Textpresso for CCC pipeline. The initial trial run will be for all papers in the Textpresso for Arabidopsis corpus published in 2008.

Search results will be stored in three files:

1) all sentences returned by the search

2) sentences from papers already curated by TAIR for GO Cellular Component

3) sentences from papers not curated by TAIR for GO Cellular Component

Annotations can be made using an on-line curation form with two different outputs:

1) a three-column 'user submission' output

2) a standard GO Gene Association File (GAF) format

Pipeline Details

Paper Acquisition

TAIR corpus averages ~2500 papers/year

TAIR curator sends PDFs of papers to be included in the corpus to Textpresso (Michael) approximately every six months

Any additional papers for which TAIR does not have a PDF are downloaded by Textpresso team, if possible

Textpresso Search: Categories

Search Arabidopsis corpus using four categories:

Path name for first three categories on textpresso-dev.caltech.edu:

/data2/data-processing/data/arabidopsis/Data/ontology/lexica

1) CCC assay terms

2) CCC cellular component

3) CCC verbs

Path name for arabidopsis genes on textpresso-dev.caltech.edu:

/data2/data-processing/data/arabidopsis/Data/ontology/lexica/genes_arabidopsis.0-gram

4) genes (arabidopsis)

For celegans, we use the path : /data2/data-processing/data/celegans/Data/indices Followed by the subdirectory of each of these categories : body introduction results materials discussion references Followed by these four files protein_celegans localization_cell_components_082208 localization_verbs_082208 localization_experimental_082208 But for arabidopsis we want to use the path above ? I don't see the three ccc files in that directory. In the genes_arabidopsis.0-gram I see stuff like :

_ORB_ AT _CRB_ SRC2 source='arabidopsis' ##### _ORB_ AT _CRB_ VAP source='arabidopsis' #####

While the /data2/data-processing/data/celegans/Data/indices/body/semantic/categories/protein_celegans file we use has data like :

WBPaper00001038# 113-17

WBPaper00001320# 90-3 72-19 72-28 47-14 47-26

WBPaper00001364# 248-1 220-18

I don't think we're talking about the same thing, I don't know what files to use.

At /data2/data-processing/data/celegans/Data/indices/body/semantic/categories/ There are ccc ccc_ps2_mc2 ccc_ps3_mc2 ccc_ps4_mc2 as well as genes_arabidopsis I don't want to write stuff until there's confirmation that this is correct (I don't think it is) -- J

Textpresso Search for Cellular Component Annotations: Filtering on Year

For the initial round of searches for TAIR, the results will be filtered by year.

Year: 2008

To find year information for TAIR papers:

/data2/data-processing/data/arabidopsis/Data/processedfiles/year/

Running this search on the Textpresso for Arabidopsis site on 10/28/2010 gives 55814 matches in 1421 documents, with total paper scores ranging from 660 to 7.

A preliminary check of the results suggests that sorting by score, high to low, would be most productive. This can always be changed.

Papers curated for CCC in 2008 (may be able to do this with Textpresso), do they want static filter or dynamically as their annotations are updated. If dynamic, then we need a pipeline/work flow to do this.

Which machine to do this on, i.e search results and curation?

Search results and curation forms will be stored on textpresso-dev.

From WormBase documentation:

Script explanation - Juancarlos

The script that gets stuff to tazendra from textpresso is :

 /home/postgres/public_html/cgi-bin/data/ccc_gocuration/get_newset.pl

called by :

 /home/postgres/work/pgpopulation/textpresso/wrapper.sh

(For the ontology annotator, there have been no new matches since September 13, 2010)

The full results are on textpresso-dev at :

 /data2/srv/textpresso-dev.caltech.edu/www/docroot/azurebrd/ccc_datafiles/good_sentences_file.*

The new matches are at :

 /data2/srv/textpresso-dev.caltech.edu/www/docroot/azurebrd/ccc_datafiles/recent_sentences_file.*

The wrapper script /home/azurebrd/work/get_kimberly_go_gene_component_verb_localization/wrapper.pl runs every wednesday at 2am. It compares the svm results from :

 http://caprica.caltech.edu/celegans/svm_results/Juancarlos/otherexpr

to the previous svm results stored in /home/azurebrd/work/get_kimberly_go_gene_component_verb_localization/otherexpr

Emails Kimberly whether or not there are new svm results.

If the results are the same, nothing happens. If there are new svm results, it runs :

* /home/azurebrd/work/get_kimberly_go_gene_component_verb_localization/get_go_gene_component.pl

which outputs good results to :

 /data2/srv/textpresso-dev.caltech.edu/www/docroot/azurebrd/ccc_datafiles/good_sentences_file.<date>
* then runs /home/azurebrd/work/get_kimberly_go_gene_component_verb_localization/get_recentsent.pl

which compares the new good_sentences_file.<date> with the previous good_sentences_file.<date> and outputs a /data2/srv/textpresso-dev.caltech.edu/www/docroot/azurebrd/ccc_datafiles/recent_sentences_file.<date> with the current results that are in the svm list. Also outputs /data2/srv/textpresso-dev.caltech.edu/www/docroot/azurebrd/ccc_datafiles/recent_cccfile which points at the new recent_sentences_file.<date> for the tazendra cronjob to know there's a new file.

You can check the recent good_sentences_file.* and see if any of those should be in SVM, or you can look at the full text of the SVM results and see if any of those should be in the good_sentences_file.* If you don't have a textpresso-dev account, you can ask Michael, and he'll ask the its people. I log on with my its account.

If stuff should be in the good_sentences_file.* and isn't, check that the categories have what they should have, and let me know which category isn't matching what it should.

The good_sentences_file.* is generated by :

 /home/azurebrd/work/get_kimberly_go_gene_component_verb_localization/get_go_gene_component.pl

The category files being used are in :

 /data2/data-processing/data/celegans/Data/indices/body/semantic/categories

files :

 protein_celegans 
 localization_cell_components_082208 
 localization_verbs_082208 
 localization_other_120107

Curation - Source Files for the Curation Form: Sorting by Curation Status

Curation form: http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/tair/tair_ccc.cgi

Output of the searches will be sentences from papers published in 2008.

These sentences will then be sorted into two separate files based on whether any Cellular Component Annotations exist for that paper in TAIR's Gene Association file.

To do this we will need to sort by both reference identifier and ontology aspect:

1) The TAIR gene association file (GAF) can be downloaded from the go ftp site:

ftp://ftp.geneontology.org/pub/go/gene-associations/

The file is zipped and named: gene_association.tair.gz

The GO FTP site can be accessed anonymously using the username anonymous and your email address as the password. Command-line FTP clients can use the instruction

ftp ftp.geneontology.org:/pub/go/

Modifications to CCC Curation Form for TAIR

1) Display Title and Abstract on the form Note: Space for this already exists on the form, just not being filled in

Check with Michael

/data2/data-processing/data/arabidopsis/Data/processedfiles/year/

2) Label each of the three curation boxes at the top for clarity First: Gene/Protein Name Second: Component Term in Sentence Third: CC Term in GO

3) Provide color-coded key to category terms, i.e. blue = gene product, green = verb, orange = assay term, and red/brown = component term

4) Display of Gene/Protein name is more complicated for TAIR, because they have cases where the same symbol can map to multiple, unique locus names. For example, PAP1 is a symbol for at least four different locus names.

PAP1 = AT2G27190

PAP1 = AT3G16500 etc.

We will need to use the information in the ftp file below to present, when applicable, all possible combinations of symbol and locus names so that the TAIR curators can select the right one.

ftp://ftp.arabidopsis.org/home/tair/Genes/gene_aliases.20101027

Check with Tanya about what their update schedule is

5) Make each paper object above the matching sentence a link out to TAIR's paper object in their curation database.

Base URL: http://germany.tairgroup.org:8090/pub/DisplayArticle?article_id=19905

Output Files

TAIR would like two output files:

1) 3-column user submission file

Locus Name	GO ID	Paper ID
Column 1 of ftp mapping file, e.g. AT1G47389	GO:0005554	PMID:21074051 or TAIR:42184

For GO ID: map entries in Name or Synonym Field to GO ID using this file:

http://www.geneontology.org/ontology/obo_format_1_2/gene_ontology_ext.obo

2) a gene_association file (gaf-the standard GO annotation file)

Output - Gene Association File (GAF)

We'll use GAF2.0.

Column	Content	Required	Cardinality	TAIR Entry
1	DB	Required	1	TAIR
2	DB Object ID	Required	1	locus:ASK TANYA
3	DB Object Symbol	Required	1	Column 2 of ftp mapping file, ftp://ftp.arabidopsis.org/home/tair/Genes/gene_aliases.20101027
4	Qualifier	Optional	0 or greater	NULL
5	GO ID	Required	1	GO:0005654
6	DB Reference	Required	1 or greater	PMID:21074051 or, if no PMID, TAIR:42184
7	Evidence Code	Required	1	IDA
8	With or From	Optional	0 or greater	NULL
9	Aspect	Required	1	C
10	DB Object Name	Optional	0 or 1	Column 3 of ftp mapping file, ftp://ftp.arabidopsis.org/home/tair/Genes/gene_aliases.20101027
11	DB Object Synonym	Optional	0 or greater	ASK TANYA
12	DB Object Type	Required	1	protein
13	taxon)	Required	1 or 2	taxon:3702
14	Date	Required	1	Date annotation is made
15	Assigned By	Required	1	TAIR
16	Annotation Extension	Optional	0 or greater	NULL
17	Gene Product Form ID	Optional	0 or greater	NULL

2) References are in Column 6 of the GAF.

In TAIR's file, references are displayed in column 6 as: TAIR:Publication:501681566|PMID:12068095

We will need to use the PMID for filtering.

On textpresso-dev, the TAIR PMIDs can be found here (for example):

/data2/data-processing/data/arabidopsis/Data/processedfiles/accession/1234

contains the PMID number of TAIR pub 1234

The Cellular Component aspect (C) is in Column 9 of the GAF.

Curation will use a version of the CCC curation form found here:

Location of elegans curation form:

http://tazendra.caltech.edu/~postgres/cgi-bin/ccc_go_curation.cgi

Each file will be available as a source file in the curation form for curators to assess search results.

http://textpresso-dev.caltech.edu/azurebrd/tair_ccc_datafiles/

Results of searches will be presented in the curation form, in a three-column format: 1) gene or protein name, 2) cellular component category term, 3) if available, a suggested GO term from the existing category term-GO term relationship index.

The suggested GO term will come from the existing relationship index. New relationships will be added.

? for Michael - put forms/results on textpresso or textpresso-dev

Curation - Output to Gene Association File

Curation will be output to in the form of a Gene Association file (GAF) that can be picked up by the TAIR curators and added to the main Gene Association file that they submit to GO.

One possible variation on the output file is a file that maps the sentence to the GO annotation and the paper ID.

Specifications for GAF2.0 can be found here:

http://www.geneontology.org/GO.format.gaf-2_0.shtml

Output will need to reference the Arabidopsis gene mapping file here:

ftp://ftp.arabidopsis.org/home/tair/Genes/gene_aliases.20101027

There are three columns:

1. locus name (can be a symbol, if this a genetic locus, or an AGI identifier, if it is cloned)

2. symbol

3. full name (not always available)

? Is there a mapping of locus names to TAIR gene ids already in Textpresso?

Check other columns in TAIR file to get output right.

Features to Implement for Future Use

Paper sectioning - this can reduce the number of false positives by allowing curators to select from which section of the paper they will retrieve sentences.

Use Textpreso home page for searches; output could be sent to curation form with a link 'Use these results for curation'.

SVM results - combining the results of an SVM document classification algorithm with Textpresso search results readily identifies potential false negative and false positive papers.

Arabidopsis-specific categories - editing the categories created for C. elegans curation to better reflect use of terms in the Arabidopsis literature.

Collaboration with GO

Possibly propose and populate a new class of synonym, Published_as, in GO to catalog how component terms are expressed in the published literature. This would be a general aid to all text mining efforts.

Back to Gene Ontology

TAIR

Contents