Difference between revisions of "TAIR CCC"

From WormBaseWiki
Jump to navigationJump to search
Line 228: Line 228:
 
|1||DB||Required||1||TAIR
 
|1||DB||Required||1||TAIR
 
|-
 
|-
|2||DB Object ID||Required||1||locus:'''ASK TANYA'''
+
|2||DB Object ID||Required||1||Column 1 of gene_aliases file
 
|-
 
|-
|3||DB Object Symbol||Required||1||Column 2 of ftp mapping file, ftp://ftp.arabidopsis.org/home/tair/Genes/gene_aliases.20101027
+
|3||DB Object Symbol||Required||1||Column 2 gene_aliases file
 
|-
 
|-
 
|4||Qualifier||Optional||0 or greater||NULL
 
|4||Qualifier||Optional||0 or greater||NULL
Line 244: Line 244:
 
|9||Aspect||Required||1||C
 
|9||Aspect||Required||1||C
 
|-
 
|-
|10||DB Object Name||Optional||0 or 1||Column 3 of ftp mapping file, ftp://ftp.arabidopsis.org/home/tair/Genes/gene_aliases.20101027
+
|10||DB Object Name||Optional||0 or 1||Column 3 of gene_aliases file
 
|-
 
|-
 
|11||DB Object Synonym||Optional||0 or greater||'''ASK TANYA'''
 
|11||DB Object Synonym||Optional||0 or greater||'''ASK TANYA'''
Line 261: Line 261:
 
|-
 
|-
 
|}
 
|}
 
 
 
2) '''References''' are in Column 6 of the GAF.
 
 
In TAIR's file, references are displayed in column 6 as: TAIR:Publication:501681566|PMID:12068095
 
 
We will need to use the PMID for filtering.
 
 
'''On textpresso-dev, the TAIR PMIDs can be found here (for example):'''
 
 
/data2/data-processing/data/arabidopsis/Data/processedfiles/accession/1234
 
 
contains the PMID number of TAIR pub 1234
 
 
 
 
 
The Cellular Component aspect (C) is in Column 9 of the GAF.
 
 
* Curation will use a version of the CCC curation form found here:
 
 
'''Location of elegans curation form:'''
 
 
http://tazendra.caltech.edu/~postgres/cgi-bin/ccc_go_curation.cgi
 
 
*Each file will be available as a source file in the curation form for curators to assess search
 
  
 
====Adding New Terms to the Categories====
 
====Adding New Terms to the Categories====

Revision as of 22:21, 9 December 2010

Specifications for Curation Pipeline

Summary

This document is an outline of the Arabidopsis Textpresso for CCC pipeline for the initial trial run.

The trial run will be a search on all papers in the Textpresso for Arabidopsis corpus published in 2008.

Search results will be stored in three files:

1) all sentences returned by the search

2) sentences from papers already curated by TAIR for GO Cellular Component

3) sentences from papers not curated by TAIR for GO Cellular Component

Annotations can be made using an on-line curation form:

http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/tair/tair_ccc.cgi

with two different outputs:

1) a three-column 'user submission' output

2) a standard GO Gene Association File (GAF) format


Search Details

Paper Acquisition

  • TAIR corpus averages ~2500 papers/year
  • TAIR curator sends PDFs of papers to be included in the corpus to Textpresso (Michael) approximately every six months
  • Any additional papers for which TAIR does not have a PDF are downloaded by Textpresso team, if possible


Textpresso Search

Categories
  • Search Arabidopsis corpus on:

http://www.textpresso.org/arabidopsis/

Using these four categories:

1) CCC assay terms

2) CCC cellular components

3) CCC verbs

4) genes (arabidopsis)

Paths for categories on textpresso-dev.caltech.edu:

/data2/data-processing/data/arabidopsis/Data/ontology/lexica

1) CCC assay terms

2) CCC cellular component

3) CCC verbs

Juancarlos, can you fill in the exact path names we used here? --K

Path name for arabidopsis genes on textpresso-dev.caltech.edu:

4) genes (arabidopsis)

/data2/data-processing/data/arabidopsis/Data/ontology/lexica/genes_arabidopsis.0-gram

Juancarlos, can you confirm that this is the exact path we used for genes (arabidopsis)? --K

Filtering by Paper Section

Sectioning (i.e., Introduction, Materials and Methods, Results, etc.) of the Arabidopsis corpus was scheduled to be included with the next update in November 2010. Yuling - status?

Filtering by Year

For the initial round of searches for TAIR, the results will be filtered by year.

Year: 2008

Year information for Arabidopsis papers is found here:

/data2/data-processing/data/arabidopsis/Data/processedfiles/year/


Curation Details

Source Files

Juancarlos, just to confirm, will TAIR curators need the username and password to access the form? --K

The output of the searches will be sentences from papers published in 2008.

These sentences will then be available as three separate files:

1) results_2008_ccc_genesarabidopsis - this file contains all of the sentences returned by the search

2) results_2008_in_geneassociation - this file contains all of the sentences from papers that are already in the TAIR gene_association file with a corresponding Cellular Component (C) annotation.

3) results_2008_not_geneassociation - this file contains all of the sentences from papers that are NOT in the TAIR gene_association file with a corresponding Cellular Component (C) annotation.

The content of each of the files can be seen here:

http://tazendra.caltech.edu/~azurebrd/cgi-bin/data/tair_ccc_datafiles/results_2008_ccc_and_genesarabidopsis

http://tazendra.caltech.edu/~azurebrd/cgi-bin/data/tair_ccc_datafiles/results_2008_in_geneassociation

http://tazendra.caltech.edu/~azurebrd/cgi-bin/data/tair_ccc_datafiles/results_2008_not_geneassociation

Files 2 and 3 above, are generated by sorting File 1 using reference identifiers and ontology aspect found in the TAIR gene_association file available here:

ftp://ftp.geneontology.org/pub/go/gene-associations/

The file is zipped and named: gene_association.tair.gz

The GO FTP site can be accessed anonymously using the username anonymous and your email address as the password. Command-line FTP clients can use the instruction

ftp ftp.geneontology.org:/pub/go/

Curation Form

The curation form for TAIR can be found here: http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/tair/tair_ccc.cgi

Features of the Curation Form

1) The Title and Abstract for each paper is displayed on the form

This information is found here:

/data2/data-processing/data/arabidopsis/Data/processedfiles/title/ /data2/data-processing/data/arabidopsis/Data/processedfiles/abstract/

Juancarlos, please confirm that this is where this information is found. --K

2) The three boxes on the left side of the form are labeled: First: Gene/Protein Name Second: Component Term in Sentence Third: CC Term in GO

3) There is a color-coded key to category terms above the sentences:

blue = gene product, green = verb, orange = assay term, and red/brown = component term

4) The display of Gene/Protein name includes each individual symbol in the sentence as well as each symbol mapped to a TAIR locus name. The mappings are taken from the gene_aliases file on the TAIR ftp site:

ftp://ftp.arabidopsis.org/home/tair/Genes/gene_aliases.20101027

PAP1:AT2G27190 PAP1:AT3G16500

Kimberly, check with Tanya about what the update schedule is for this file, i.e. how often should we update this file on the curation site, weekly, monthly?

5) The paper object ID above the matching sentence links out to TAIR's paper object in their curation database. The curator would need to be logged into the database in order for the link to work.

Base URL: http://germany.tairgroup.org:8090/pub/DisplayArticle?article_id=19905

Making an Annotation

To make a new GO annotation, curators need to select, or create, an entry in each of the three boxes on the left-hand side of the form. Once a selection is made, the region will be highlighted in blue. For the last column, CC term in GO, curators can either select one of the suggested GO terms in the list, or enter a new one. Then select curate from the list of radio buttons above the sentence and, if you are ready to enter your annotations, click on Make connections:Submit at the top of the page.

If there is no list of GO terms in the third box, that means that a GO annotation using the term in the second box has not yet been made. In this case, the curator will need to enter the new GO term manually. For the next iteration of the form, we will work on adding an autocomplete and drop-down feature to this step.

Marking Sentences Not Used for Curation

If an annotation cannot be made from a sentence, then curators may record the reason that an annotation was not made. Keeping track of these sentences will help build up a training set for improving search results.

The reasons are described below:

Already curated: if a curator does not wish to make another annotation for information previously curated, they can select this radio button. In elegans curation, we are starting to handle these cases a bit differently with some common markers not showing up in the curation boxes, since they arise very frequently.

Scrambled sentence: if, during the pdf-to-text conversion, a sentence has become scrambled, you can mark it as such here. These are becoming less frequent as the conversion improves.

False positive: if a returned sentence has nothing to do with subcellular localization, then it is marked as a false positive. For example:

In contrast, PP2AA3 rescues root tip organization weakly even when expression is driven by the RCN1 promoter, demonstrating a more stringent requirement for A subunit function in the root apical meristem.

Not curatable: This is intended to mark sentences that may describe subcellular localization, but the information contained in them would not normally be curated for GO. For example, the localization is for a mutant protein, or the localization is for the wild-type protein in a mutant background. An example sentence:

No alteration in expression levels of soluble GFP or GFP::RAB-3 was observed in the synapses, cell body or axon in uba-1 animals (Figure 1A: d1-d6, f1-f6, 1C, 1D, Figure S4F, S4G, S4H).

Dumping Annotation Files

The annotation files can be dumped in two different formats by selecting the desired format from the drop-down menu. For TAIR, the two options are a three-column tab delimited format or an 18-column gene_association file format.

Here are examples of what would be contained in the files using the sample sentence below:

SentenceID 9 -- S 7 P 43065 S s3 E The first enzyme, gamma-glutamate cysteine ligase (GSH1), responsible for synthesis of gamma-glutamylcysteine (gamma-EC), is, in Arabidopsis, exclusively located in the plastids, whereas the second enzyme, glutathione synthetase (GSH2), is located in both plastids and cytosol.


1) Three-column, tab-delimited format

AT4G23100 GO:0009536 TAIR:43065

AT5G27380 GO:0009536 TAIR:43065

AT5G27380 GO:0005829 TAIR:43065

For column 1, map the gene symbol (second column) to the locus name (first column) in the gene_aliases file:

ftp://ftp.arabidopsis.org/home/tair/Genes/gene_aliases.20101027

For column 2, use the geneontology.obo file to map the GO term to the GO ID:

http://www.geneontology.org/ontology/obo_format_1_2/gene_ontology_ext.obo

For column 3, use the TAIR paper ID, prefaced with TAIR:


2) 18-column, tab-delimited GAF 2.0

Column Content Required Cardinality TAIR Entry
1 DB Required 1 TAIR
2 DB Object ID Required 1 Column 1 of gene_aliases file
3 DB Object Symbol Required 1 Column 2 gene_aliases file
4 Qualifier Optional 0 or greater NULL
5 GO ID Required 1 GO:0005654
6 DB Reference Required 1 or greater PMID:21074051 or, if no PMID, TAIR:42184
7 Evidence Code Required 1 IDA
8 With or From Optional 0 or greater NULL
9 Aspect Required 1 C
10 DB Object Name Optional 0 or 1 Column 3 of gene_aliases file
11 DB Object Synonym Optional 0 or greater ASK TANYA
12 DB Object Type Required 1 protein
13 taxon) Required 1 or 2 taxon:3702
14 Date Required 1 Date annotation is made
15 Assigned By Required 1 TAIR
16 Annotation Extension Optional 0 or greater NULL
17 Gene Product Form ID Optional 0 or greater NULL

Adding New Terms to the Categories

It will immediately be apparent that there are terms missing from the Arabidopsis categories, particularly component. (See sentence above - no plastids in worms!). For now, please keep track of the terms that need to be added in a text file and we'll give that to the Textpresso team to add for the next mark-up.

For future iterations of the form, I'd like to be able to add the missing term to the second box of the curation form and have the term then automatically be added to the Textpresso component category.


Back to Gene Ontology