TAIR CCC
Contents
Specifications for Curation Pipeline
Summary
This document is an outline of the Arabidopsis Textpresso for CCC pipeline for the initial trial run.
The trial run will be a search on all papers in the Textpresso for Arabidopsis corpus published in 2008.
After the initial trial run, Tanya realized that there are papers included in the corpus that describe experiments from other organisms that would not likely lead to Arabidopsis experimental annotations. The search should exclude these 1243 papers, if possible, or the results from these papers (i.e., sentences) filtered from the sentence files used on the curation form. I have the list of papers. --K.
Search results will be stored in three files:
1) all sentences returned by the search
2) sentences from papers already curated by TAIR for GO Cellular Component
3) sentences from papers not curated by TAIR for GO Cellular Component
Annotations can be made using an on-line curation form:
http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/tair/tair_ccc.cgi
with two different outputs:
1) a three-column 'user submission' output
For now, the three-column output is the only output that TAIR requires. See below for specifications. --K.
2) a standard GO Gene Association File (GAF) format
Not needed right now, but should be an option for future implementations of the pipeline. --K.
Search Details
Paper Acquisition
- TAIR corpus averages ~2500 papers/year
- TAIR curator sends PDFs of papers to be included in the corpus to the Textpresso team approximately every six months
- Any additional papers for which TAIR does not have a PDF are downloaded by Textpresso team, if possible
Textpresso Search
Categories
- Search Arabidopsis corpus on:
http://www.textpresso.org/arabidopsis/
Using these four categories and the script getting_TAIR_CCC_data.pl (name?):
1) CCC assay terms
/data2/data-processing/data/arabidopsis/Data/ontology/lexica/localization_experimental_082208
2) CCC TAIR
There is a new cellular component category called CCC TAIR on the Textpresso for Arabidopsis site. This category contains additional plant-specific terms, additional plural forms of cellular component terms, and terms for macromolecular complexes. We will need to re-run the search using this category. The path for the new category is below. --K.
/data2/data-processing/data/arabidopsis/Data/ontology/lexica/CCC_TAIR.0-gram
3) CCC verbs
/data2/data-processing/data/arabidopsis/Data/ontology/lexica/localization_verbs_082208
4) genes (arabidopsis)
/data2/data-processing/data/arabidopsis/Data/ontology/lexica/genes_arabidopsis.0-gram
Juancarlos, can you confirm the path names we used for CCC assay terms and CCC verbs? --K
Filtering by Paper Section
Sectioning (i.e., Introduction, Materials and Methods, Results, etc.) of the Arabidopsis corpus was scheduled to be included with the next update in November 2010. Yuling - status?
The check boxes for filtering by paper section are now available on the Textpresso for Arabidopsis dev site, but it looked like the sentences are still prefaced by MATCH. --K.
Filtering by Year
For the initial round of searches for TAIR, the results will be filtered by year.
Year: 2008
Year information for Arabidopsis papers is found here:
/data2/data-processing/data/arabidopsis/Data/processedfiles/year/
Curation Details
Source Files
Juancarlos, just to confirm, will TAIR curators need the username and password to access the form? --K
Since we need to perform a new search with the latest CCC TAIR category, we could name the resulting sentence files something like:
results_CCC_TAIR_2008_all
results_CCC_TAIR_2008_geneassociation
results_CCC_TAIR_2008_not_geneassociation
The procedure for generating each of the above files, i.e., generating the latter two files by filtering on paper IDs in the TAIR gene association file, as described below, is the same. --K.
The sentences from the previous search would still be available as three separate files:
1) results_2008_ccc_genesarabidopsis - this file contains all of the sentences returned by the search
2) results_2008_in_geneassociation - this file contains all of the sentences from papers that are already in the TAIR gene_association file with a corresponding Cellular Component (C) annotation.
3) results_2008_not_geneassociation - this file contains all of the sentences from papers that are NOT in the TAIR gene_association file with a corresponding Cellular Component (C) annotation.
The format of the source files is a tab-delimited text file with information presented in each column as follows:
1) Name of file
2) Sentence number in file
3) S sentence score P paper id S s sentence number in document E
4) gene names as matched in the genes (arabidopsis) category
5) component terms as matched in the CCC TAIR category
6) matching sentence
The content of each of the files from the first round of searches can be seen here:
Files 2 and 3 above, are generated by sorting File 1 using reference identifiers (column 6) and ontology aspect (column 9) found in the TAIR gene_association file available here:
ftp://ftp.geneontology.org/pub/go/gene-associations/
The file is zipped and named: gene_association.tair.gz
The GO FTP site can be accessed anonymously using the username anonymous and your email address as the password. Command-line FTP clients can use the instruction
ftp ftp.geneontology.org:/pub/go/
Curation Form
The curation form for TAIR can be found here: http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/tair/tair_ccc.cgi
Features of the Curation Form
1) The Title and Abstract for each paper is displayed on the form
This information is found here:
/data2/data-processing/data/arabidopsis/Data/processedfiles/title/ /data2/data-processing/data/arabidopsis/Data/processedfiles/abstract/
Juancarlos, please confirm that this is where this information is found. --K
2) The three boxes on the left side of the form are labeled: First: Gene/Protein Name Second: Component Term in Sentence Third: CC Term in GO
3) There is a color-coded key to category terms above the sentences:
blue = gene product, green = verb, orange = assay term, and red/brown = component term
4) The display of Gene/Protein name includes each individual symbol in the sentence as well as each symbol mapped to a TAIR locus name. The mappings are taken from the gene_aliases file on the TAIR ftp site:
ftp://ftp.arabidopsis.org/home/tair/Genes/gene_aliases.20101027
PAP1:AT2G27190 PAP1:AT3G16500
At the moment, this file is generated at TAIR as needed, but after talking with Tanya we agreed that refreshing the file we use should be done monthly and we could schedule a cronjob to do this on the 5th of each month. --K
5) The paper object ID above the matching sentence links out to TAIR's paper object in their curation database. The curator would need to be logged into the database in order for the link to work.
Base URL: http://germany.tairgroup.org:8090/pub/DisplayArticle?article_id=19905
Making an Annotation
To make a new GO annotation, curators need to select, or create, an entry in each of the three boxes on the left-hand side of the form. Once a selection is made, the region will be highlighted in blue. For the last column, CC term in GO, curators can either select one of the suggested GO terms in the list (where is this relationship file located, i.e. its path --K), or enter a new one. Then select curate from the list of radio buttons above the sentence and, if you are ready to enter your annotations, click on Make connections:Submit at the top of the page.
If there is no list of GO terms in the third box, that means that a GO annotation using the term in the second box has not yet been made. In this case, the curator will need to enter the new GO term manually. For the next iteration of the form, we will work on adding an autocomplete and drop-down feature to this step.
Tanya did some test curation before the holidays. Is there data currently saved in the postgres tables? --K.
Marking Sentences Not Used for Curation
If an annotation cannot be made from a sentence, then curators may record the reason that an annotation was not made. Keeping track of these sentences will help build up a training set for improving search results.
The reasons are described below:
Already curated: if a curator does not wish to make another annotation for information previously curated, they can select this radio button. In elegans curation, we are starting to handle these cases a bit differently with some common markers not showing up in the curation boxes, since they arise very frequently.
Scrambled sentence: if, during the pdf-to-text conversion, a sentence has become scrambled, you can mark it as such here. These are becoming less frequent as the conversion improves.
False positive: if a returned sentence has nothing to do with subcellular localization, then it is marked as a false positive. For example:
In contrast, PP2AA3 rescues root tip organization weakly even when expression is driven by the RCN1 promoter, demonstrating a more stringent requirement for A subunit function in the root apical meristem.
Not curatable: This is intended to mark sentences that may describe subcellular localization, but the information contained in them would not normally be curated for GO. For example, the localization is for a mutant protein, or the localization is for the wild-type protein in a mutant background. An example sentence:
No alteration in expression levels of soluble GFP or GFP::RAB-3 was observed in the synapses, cell body or axon in uba-1 animals (Figure 1A: d1-d6, f1-f6, 1C, 1D, Figure S4F, S4G, S4H).
Already Done
On the elegans CCC form, we implemented functionality for showing annotations that had been made previously. These were displayed after the red already done text below each returned sentence. I believe this data was collected specifically by curators selecting a value from each of the three curation boxes and selecting the already curated radio button. The resulting associations would have been stored in a table and when the protein and component term again in any sentence, the terms are shown in red after the already done text. The idea was that curators could see that the information was there in the sentence, but the protein and component terms would not be included in the potentially curatable list. Is anything like this enabled for the TAIR form? --K.
Dumping Annotation Files
The annotation files can be dumped in two different formats by selecting the desired format from the drop-down menu. For TAIR, the two options are a three-column tab delimited format or an 18-column gene_association file format.
For now, we only need the 3-column format for TAIR. --K.
Here are examples of what would be contained in the files using the sample sentence below:
SentenceID 9 -- S 7 P 43065 S s3 E The first enzyme, gamma-glutamate cysteine ligase (GSH1), responsible for synthesis of gamma-glutamylcysteine (gamma-EC), is, in Arabidopsis, exclusively located in the plastids, whereas the second enzyme, glutathione synthetase (GSH2), is located in both plastids and cytosol.
1) Three-column, tab-delimited format
1) 3-column user submission file
Locus Name | GO ID | Paper ID |
Column 1 of ftp mapping file | GO:0009536 | PMID:nnnnnnnn or TAIR:43065 |
AT4G23100 GO:0009536 TAIR:43065
AT5G27380 GO:0009536 TAIR:43065
AT5G27380 GO:0005829 TAIR:43065
For column 1, map the gene symbol (second column) to the locus name (first column) in the gene_aliases file:
ftp://ftp.arabidopsis.org/home/tair/Genes/gene_aliases.20101027
For column 2, use the geneontology.obo file to map the GO term to the GO ID:
http://www.geneontology.org/ontology/obo_format_1_2/gene_ontology_ext.obo
For column 3, use the PMID and TAIR paper ID, pipe separated, prefaced PMID: and TAIR: respectively.
/data2/data-processing/data/arabidopsis/Data/processedfiles/accession/
2) 18-column, tab-delimited GAF 2.0 - NOT NEEDED RIGHT NOW --K.
Column | Content | Required | Cardinality | TAIR Entry |
1 | DB | Required | 1 | TAIR |
2 | DB Object ID | Required | 1 | Column 1 of gene_aliases file |
3 | DB Object Symbol | Required | 1 | Column 2 gene_aliases file |
4 | Qualifier | Optional | 0 or greater | NULL |
5 | GO ID | Required | 1 | GO:0005654 |
6 | DB Reference | Required | 1 or greater | PMID:21074051 or, if no PMID, TAIR:42184 |
7 | Evidence Code | Required | 1 | IDA |
8 | With or From | Optional | 0 or greater | NULL |
9 | Aspect | Required | 1 | C |
10 | DB Object Name | Optional | 0 or 1 | Column 3 of gene_aliases file |
11 | DB Object Synonym | Optional | 0 or greater | ASK TANYA |
12 | DB Object Type | Required | 1 | protein |
13 | taxon) | Required | 1 or 2 | taxon:3702 |
14 | Date | Required | 1 | Date annotation is made |
15 | Assigned By | Required | 1 | TAIR |
16 | Annotation Extension | Optional | 0 or greater | NULL |
17 | Gene Product Form ID | Optional | 0 or greater | NULL |
ftp://ftp.arabidopsis.org/home/tair/Genes/gene_aliases.20101027
http://www.geneontology.org/ontology/obo_format_1_2/gene_ontology_ext.obo
/data2/data-processing/data/arabidopsis/Data/processedfiles/accession/
Adding New Terms to the Categories
It will immediately be apparent that there are terms missing from the Arabidopsis categories, particularly component. (See sentence above - no plastids in worms!). For now, please keep track of the terms that need to be added in a text file and we'll give that to the Textpresso team to add for the next mark-up.
For future iterations of the form, I'd like to be able to add the missing term to the second box of the curation form and have the term then automatically be added to the Textpresso component category.
Checking Data in Postgres
The data generated by the form, i.e. sentence categorization and annotations, is stored in a postgres database.
For TAIR, the data is stored in the table: ccc_tair_gene_comp_go
To check data that has been entered, log on to tazendra, fire up postgres (type testdb pqsl) and type the following query:
SELECT * FROM ccc_tair_gene_comp_go ORDER BY ccc_timestamp DESC;
Issues - Future Development
Implement Sectioning
Using sectioning will help reduce the number of false positives in the search returns.
Sectioning is available on the Textpresso for Arabidopsis web sites.
The next search can include sectioning - can we implement the feature of allowing users to search on the web site and then send those results to the curation form?
Editing Categories
We will need an interface to edit categories. We could create one ourselves, adapt OBO-Edit, other possibilities?
In the meantime, should TAIR continue to maintain flat files and edit those as needed?
For example, remove the list of ambiguous gene names:
AN AND BP CAN CO ER FED FL FOR IMP LAB LAN LC LD LY MER MIN NOT PLU PMD PRE QC QUI RE SDS SIM SUB SUL TIP
Also, for TAIR, remove these terms from the cellular component (CCC TAIR) category:
lateral chromosome (Keep for now) chromatin (Keep for now) process
Improved Gene Name Category
Case sensitivity of Arabidopsis gene names is an issue. We can either modify the category to try to accomodate all variations, or use the gene name list as a keyword list where the search is not case-sensitive, or something else?
1. If accomodating variations in the gene list, what are the variations? Case-sensitivity is one type of variation, but there are others (see below).
Here is the example that Donghui sent:
DH: We provided you with a gene name list where almost all names are capitalized (except a few cases for example AtCXE5). The current search setting is case sensitive. As a result, some genes in the paper are not picked up. One example here:
SentenceID 495 -- S 11 P 43508 S s107 E GFP fluorescence was mainly detected in the nuclei of root it issue ( Fig 1 C ) and also leaves (data not shown ) , which indicates that the HsfA3 protein is accumulated in nuclei under HS condition .
Name spelling in paper: HsfA3 Spelling in tair gene name list: HSFA3 Kimberly: sentence was returned with HS as the matching gene name
What is the precision/recall using the current gene category?
Add complete list of locus names to the gene list, e.g. at5g35390 or AT5G35390?
Can we strip the 'At' from in front of some of the gene names and also add the remaining text to the gene list, e.g. AtRABA4b?
From Tanya:
...here's the regexp of gene names for which we should NOT strip the AT from the beginning:
[Aa][Tt][1-5CcMm][Gg][0-5]{5}[[\.][0-9]{1,2}]
so At1g01010 or At1g01010.1 or At1g01010.10 would be excluded from stripping but AtHsp70 would become Hsp70.
2. Is it possible to use a gene list as a long list of keywords for a search? If the search was neither exact match, nor case sensitive would it help to find more gene name variations? How much would it increase the false positive rate?
Here are some examples I could find (the bold gene names are examples of gene names that are modified in the text and would be missed by the category search):
Doc ID:53141
Although the plasma membrane localization of CBL9n : : GFP further conrms the dominant targeting function of the N-terminal domain of lipid-modied CBL proteins like CBL1 and CBL9 , the observed differences between the localization of the CBL4 , CBL5 and CBL8 proteins and their N-terminal domains suggest a more complex regulation of the localization for the three latter proteins . [Field: results, subscore: 20.00]
A nal deduction , resulting from our localization studies of CBL / CIPK complexes in the presence of the Sar1H74L protein , is that the cellular targeting of the investigated CBL1 / CIPK24 , CBL5 / CIPK24 , CBL8 / CIPK14 , CBL2 / CIPK24 and CBL10 / CIPK24 complexes does not involve COPII-mediated vesicle transport via the Golgi compartment ( Figure 8 ) . [Field: discussion, subscore: 19.00]
Note: In C. elegans, many genes in the published literature are designated with a sequence of letters, a hyphen or dash, and then a number, e.g. rde-4. Textpresso for C. elegans doesn't split on hyphens, to maintain that nomenclature. Further, reporter fusion are expressed using a capitalized version of the gene name (to indicate the corresponding protein) and two semi-colons, e.g. RDE-4::GFP.
In Arabidopsis, gene names are a series of letters and possibly also numbers that are strung together without punctuation (is that generally true?). Is there a standard nomenclature for GFP, YFP, etc. fusion proteins? Are we missing localization statements because we're not splitting on hyphens, e.g. CRN-GFP colocalized with CLV1-mCherry at the plasma membrane in the presence of CLV2 . [Field: references, subscore: 5.00]?
Introduce a Weighting Scheme
Some terms in the verb category, e.g. localized and its variations, are more often associated with curatable sentences than others. What type of weighting scheme could be employed to take advantage of this information?
Curation Form Changes, New Features
- Change display so that only gene symbols with matching locus IDs are displayed in column 1
- Select more than one gene symbol at a time if all genes selected will get the same annotation
- Manually add a term to column 2, if needed. Added terms go to category in Textpresso.
- Autocomplete for GO terms in column 3 (like the OA)
- Is it possible to query the comments?
- Add a timestamp to each entry of the output file
- Add a javascript feature to navigate through directories of sentences
Hyphenated Localization Terms
Hyphenated terms that described localization, e.g. nuclear-localized or nucleolus-associated, are not marked up in the sentences because terms in sentences are not split based upon hyphens. Some possible soluations are: 1) Construct and add these types of terms to the lexicon, 2) Post-process the source files to include matching words in hyphenated terms for possible curation, 3) Adapt the mark up protocol for organsisms whose gene names are not typically designated with a hyphen, like Arabidopsis or budding yeast.
Back to Gene Ontology