From WormBaseWiki
Jump to navigationJump to search

Specifications for Curation Pipeline


This document is an outline of the Arabidopsis Textpresso for CCC pipeline for the initial trial run.

The trial run will be a search on all papers in the Textpresso for Arabidopsis corpus published in 2008.

After the initial trial run, Tanya realized that there are papers included in the corpus that describe experiments from other organisms that would not likely lead to Arabidopsis experimental annotations. The search should exclude these 1243 papers, if possible, or the results from these papers (i.e., sentences) filtered from the sentence files used on the curation form. I have the list of papers. --K.

Search results will be stored in three files:

1) all sentences returned by the search

2) sentences from papers already curated by TAIR for GO Cellular Component

3) sentences from papers not curated by TAIR for GO Cellular Component

Annotations can be made using an on-line curation form:


with two different outputs:

1) a three-column 'user submission' output

For now, the three-column output is the only output that TAIR requires. See below for specifications. --K.

2) a standard GO Gene Association File (GAF) format

Not needed right now, but should be an option for future implementations of the pipeline. --K.

Search Details

Paper Acquisition

  • TAIR corpus averages ~2500 papers/year
  • TAIR curator sends PDFs of papers to be included in the corpus to the Textpresso team approximately every six months
  • Any additional papers for which TAIR does not have a PDF are downloaded by Textpresso team, if possible

Textpresso Search

  • Search Arabidopsis corpus on:


Using these four categories and the script getting_TAIR_CCC_data.pl (name?):

1) CCC assay terms



There is a new cellular component category called CCC TAIR on the Textpresso for Arabidopsis site. This category contains additional plant-specific terms, additional plural forms of cellular component terms, and terms for macromolecular complexes. We will need to re-run the search using this category. The path for the new category is below. --K.


3) CCC verbs


4) genes (arabidopsis)


Juancarlos, can you confirm the path names we used for CCC assay terms and CCC verbs? --K

Filtering by Paper Section

Sectioning (i.e., Introduction, Materials and Methods, Results, etc.) of the Arabidopsis corpus was scheduled to be included with the next update in November 2010. Yuling - status?

The check boxes for filtering by paper section are now available on the Textpresso for Arabidopsis dev site, but it looked like the sentences are still prefaced by MATCH. --K.

Filtering by Year

For the initial round of searches for TAIR, the results will be filtered by year.

Year: 2008

Year information for Arabidopsis papers is found here:


Curation Details

Source Files

Juancarlos, just to confirm, will TAIR curators need the username and password to access the form? --K

Since we need to perform a new search with the latest CCC TAIR category, we could name the resulting sentence files something like:




The procedure for generating each of the above files, i.e., generating the latter two files by filtering on paper IDs in the TAIR gene association file, as described below, is the same. --K.

The sentences from the previous search would still be available as three separate files:

1) results_2008_ccc_genesarabidopsis - this file contains all of the sentences returned by the search

2) results_2008_in_geneassociation - this file contains all of the sentences from papers that are already in the TAIR gene_association file with a corresponding Cellular Component (C) annotation.

3) results_2008_not_geneassociation - this file contains all of the sentences from papers that are NOT in the TAIR gene_association file with a corresponding Cellular Component (C) annotation.

The format of the source files is a tab-delimited text file with information presented in each column as follows:

1) Name of file

2) Sentence number in file

3) S sentence score P paper id S s sentence number in document E

4) gene names as matched in the genes (arabidopsis) category

5) component terms as matched in the CCC TAIR category

6) matching sentence

Note that the source files have been moved to /home/acedb/kimberly/ccc_tair/tair_ccc_datafiles/

The source files are grouped according to the date they were generated and are collected in subdirectories labeled results_nnnn with nn corresponding to the month and day the files were generated.

The curation form was modified to look in the subdirectories of tair_ccc_datafiles for sentence source files.

The content of each of the files from the first round of searches can still be seen here:




Files 2 and 3 above, are generated by sorting File 1 using reference identifiers (column 6) and ontology aspect (column 9) found in the TAIR gene_association file available here:


The file is zipped and named: gene_association.tair.gz

The GO FTP site can be accessed anonymously using the username anonymous and your email address as the password. Command-line FTP clients can use the instruction

ftp ftp.geneontology.org:/pub/go/

Curation Form

The curation form for TAIR can be found here: http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/tair/tair_ccc.cgi

Features of the Curation Form

1) The Title and Abstract for each paper is displayed on the form

This information is found here:

/data2/data-processing/data/arabidopsis/Data/processedfiles/title/ /data2/data-processing/data/arabidopsis/Data/processedfiles/abstract/

Juancarlos, please confirm that this is where this information is found. --K

2) The three boxes on the left side of the form are labeled: First: Gene/Protein Name Second: Component Term in Sentence Third: CC Term in GO

3) There is a color-coded key to category terms above the sentences:

blue = gene product, green = verb, orange = assay term, and red/brown = component term

4) The display of Gene/Protein name includes each individual symbol in the sentence as well as each symbol mapped to a TAIR locus name. The mappings are taken from the gene_aliases file on the TAIR ftp site:


and stored on tazendra at: /home/acedb/kimberly/ccc_tair/tair_ccc_datafiles

PAP1:AT2G27190 PAP1:AT3G16500

At the moment, this file is generated at TAIR as needed, but after talking with Tanya we agreed that refreshing the file we use should be done monthly and we could schedule a cronjob to do this on the 5th of each month. --K

5) The paper object ID above the matching sentence links out to TAIR's paper object in their curation database. The curator would need to be logged into the database in order for the link to work.

Base URL: http://germany.tairgroup.org:8090/pub/DisplayArticle?article_id=19905

Making an Annotation

To make a new GO annotation, curators need to select, or create, an entry in each of the three boxes on the left-hand side of the form. Once a selection is made, the region will be highlighted in blue. For the last column, CC term in GO, curators can either select one of the suggested GO terms in the list (where is this relationship file located, i.e. its path --K), or enter a new one. Then select curate from the list of radio buttons above the sentence and, if you are ready to enter your annotations, click on Make connections:Submit at the top of the page.

If there is no list of GO terms in the third box, that means that a GO annotation using the term in the second box has not yet been made. In this case, the curator will need to enter the new GO term manually. For the next iteration of the form, we will work on adding an autocomplete and drop-down feature to this step.

Tanya did some test curation before the holidays. Is there data currently saved in the postgres tables? --K.

Marking Sentences Not Used for Curation

If an annotation cannot be made from a sentence, then curators may record the reason that an annotation was not made. Keeping track of these sentences will help build up a training set for improving search results.

The reasons are described below:

Already curated: if a curator does not wish to make another annotation for information previously curated, they can select this radio button. As with making an annotation, the curator will need to select an entry in each of the boxes so that we can store the already curated information in the database for future look-up. In elegans curation, we are starting to handle these cases a bit differently with some common markers not showing up in the curation boxes, since they arise very frequently.

Scrambled sentence: if, during the pdf-to-text conversion, a sentence has become scrambled, you can mark it as such here. These are becoming less frequent as the conversion improves.

False positive: if a returned sentence has nothing to do with subcellular localization, then it is marked as a false positive. For example:

In contrast, PP2AA3 rescues root tip organization weakly even when expression is driven by the RCN1 promoter, demonstrating a more stringent requirement for A subunit function in the root apical meristem.

Not curatable: This is intended to mark sentences that may describe subcellular localization, but the information contained in them would not normally be curated for GO. For example, the localization is for a mutant protein, or the localization is for the wild-type protein in a mutant background. An example sentence:

No alteration in expression levels of soluble GFP or GFP::RAB-3 was observed in the synapses, cell body or axon in uba-1 animals (Figure 1A: d1-d6, f1-f6, 1C, 1D, Figure S4F, S4G, S4H).

Already Done

On the elegans CCC form, we implemented functionality for showing annotations that had been made previously. These were displayed after the red already done text below each returned sentence. I believe this data was collected specifically by curators selecting a value from each of the three curation boxes and selecting the already curated radio button. The resulting associations would have been stored in a table and when the protein and component term again in any sentence, the terms are shown in red after the already done text. The idea was that curators could see that the information was there in the sentence, but the protein and component terms would not be included in the potentially curatable list. Is anything like this enabled for the TAIR form? --K.

Dumping Annotation Files

The annotation files can be dumped in two different formats by selecting the desired format from the drop-down menu. For TAIR, the two options are a three-column tab delimited format or an 18-column gene_association file format.

For now, we only need the 3-column format for TAIR. --K.

Here are examples of what would be contained in the files using the sample sentence below:

SentenceID 9 -- S 7 P 43065 S s3 E The first enzyme, gamma-glutamate cysteine ligase (GSH1), responsible for synthesis of gamma-glutamylcysteine (gamma-EC), is, in Arabidopsis, exclusively located in the plastids, whereas the second enzyme, glutathione synthetase (GSH2), is located in both plastids and cytosol.

1) Three-column, tab-delimited format

1) 3-column user submission file

Locus Name GO ID Paper ID
Column 1 of ftp mapping file GO:0009536 PMID:nnnnnnnn or TAIR:43065

AT4G23100 GO:0009536 TAIR:43065

AT5G27380 GO:0009536 TAIR:43065

AT5G27380 GO:0005829 TAIR:43065

For column 1, map the gene symbol (second column) to the locus name (first column) in the gene_aliases file:


For column 2, use the geneontology.obo file to map the GO term to the GO ID:


For column 3, use the PMID and TAIR paper ID, pipe separated, prefaced PMID: and TAIR: respectively.


2) 18-column, tab-delimited GAF 2.0 - NOT NEEDED RIGHT NOW --K.

Column Content Required Cardinality TAIR Entry
1 DB Required 1 TAIR
2 DB Object ID Required 1 Column 1 of gene_aliases file
3 DB Object Symbol Required 1 Column 2 gene_aliases file
4 Qualifier Optional 0 or greater NULL
5 GO ID Required 1 GO:0005654
6 DB Reference Required 1 or greater PMID:21074051 or, if no PMID, TAIR:42184
7 Evidence Code Required 1 IDA
8 With or From Optional 0 or greater NULL
9 Aspect Required 1 C
10 DB Object Name Optional 0 or 1 Column 3 of gene_aliases file
11 DB Object Synonym Optional 0 or greater ASK TANYA
12 DB Object Type Required 1 protein
13 taxon) Required 1 or 2 taxon:3702
14 Date Required 1 Date annotation is made
15 Assigned By Required 1 TAIR
16 Annotation Extension Optional 0 or greater NULL
17 Gene Product Form ID Optional 0 or greater NULL




Adding New Terms to the Categories

It will immediately be apparent that there are terms missing from the Arabidopsis categories, particularly component. (See sentence above - no plastids in worms!). For now, please keep track of the terms that need to be added in a text file and we'll give that to the Textpresso team to add for the next mark-up.

For future iterations of the form, I'd like to be able to add the missing term to the second box of the curation form and have the term then automatically be added to the Textpresso component category.

Checking Data in Postgres

The data generated by the form, i.e. sentence categorization and annotations, is stored in a postgres database.

For TAIR, the data is stored in the table: ccc_tair_gene_comp_go

To check data that has been entered, log on to tazendra, fire up postgres (type testdb pqsl) and type the following query:

SELECT * FROM ccc_tair_gene_comp_go ORDER BY ccc_timestamp DESC;

Issues - Future Development

Implement Sectioning

Using sectioning will help reduce the number of false positives in the search returns.

Sectioning is available on the Textpresso for Arabidopsis web sites.

The next search can include sectioning - can we implement the feature of allowing users to search on the web site and then send those results to the curation form?

Editing Categories

We will need an interface to edit categories. We could create one ourselves, adapt OBO-Edit, other possibilities?

In the meantime, should TAIR continue to maintain flat files and edit those as needed?

For example, remove the list of ambiguous gene names:


Also, for TAIR, remove these terms from the cellular component (CCC TAIR) category:

 chromosome (Keep for now)
 chromatin (Keep for now)

Improved Gene Name Category

Case sensitivity of Arabidopsis gene names is an issue. We can either modify the category to try to accomodate all variations, or use the gene name list as a keyword list where the search is not case-sensitive, or something else?

1. If accomodating variations in the gene list, what are the variations? Case-sensitivity is one type of variation, but there are others (see below).

Here is the example that Donghui sent:

 DH: We provided you with a gene name list where almost all names are
 capitalized (except a few cases for example AtCXE5).  The current
 search setting is case sensitive.  As a result, some genes in the
 paper are not picked up.  One example here:
 SentenceID 495 -- S 11 P 43508 S s107 E
 GFP fluorescence was mainly detected in the nuclei of root it issue
 ( Fig 1 C ) and also leaves (data not shown ) , which indicates that
 the HsfA3 protein is accumulated in nuclei under HS condition .
 Name spelling in paper: HsfA3
 Spelling in tair gene name list: HSFA3
 Kimberly: sentence was returned with HS as the matching gene name

What is the precision/recall using the current gene category?

Add complete list of locus names to the gene list, e.g. at5g35390 or AT5G35390?

Can we strip the 'At' from in front of some of the gene names and also add the remaining text to the gene list, e.g. AtRABA4b?

From Tanya:

...here's the regexp of gene names for which we should NOT strip the AT from the beginning:


so At1g01010 or At1g01010.1 or At1g01010.10 would be excluded from stripping but AtHsp70 would become Hsp70.

2. Is it possible to use a gene list as a long list of keywords for a search? If the search was neither exact match, nor case sensitive would it help to find more gene name variations? How much would it increase the false positive rate?

Here are some examples I could find (the bold gene names are examples of gene names that are modified in the text and would be missed by the category search):

Doc ID:53141

Although the plasma membrane localization of CBL9n : : GFP further conrms the dominant targeting function of the N-terminal domain of lipid-modied CBL proteins like CBL1 and CBL9 , the observed differences between the localization of the CBL4 , CBL5 and CBL8 proteins and their N-terminal domains suggest a more complex regulation of the localization for the three latter proteins . [Field: results, subscore: 20.00]

A nal deduction , resulting from our localization studies of CBL / CIPK complexes in the presence of the Sar1H74L protein , is that the cellular targeting of the investigated CBL1 / CIPK24 , CBL5 / CIPK24 , CBL8 / CIPK14 , CBL2 / CIPK24 and CBL10 / CIPK24 complexes does not involve COPII-mediated vesicle transport via the Golgi compartment ( Figure 8 ) . [Field: discussion, subscore: 19.00]

Note: In C. elegans, many genes in the published literature are designated with a sequence of letters, a hyphen or dash, and then a number, e.g. rde-4. Textpresso for C. elegans doesn't split on hyphens, to maintain that nomenclature. Further, reporter fusion are expressed using a capitalized version of the gene name (to indicate the corresponding protein) and two semi-colons, e.g. RDE-4::GFP.

In Arabidopsis, gene names are a series of letters and possibly also numbers that are strung together without punctuation (is that generally true?). Is there a standard nomenclature for GFP, YFP, etc. fusion proteins? Are we missing localization statements because we're not splitting on hyphens, e.g. CRN-GFP colocalized with CLV1-mCherry at the plasma membrane in the presence of CLV2 . [Field: references, subscore: 5.00]?

Introduce a Weighting Scheme

Some terms in the verb category, e.g. localized and its variations, are more often associated with curatable sentences than others. What type of weighting scheme could be employed to take advantage of this information?

Curation Form Changes, New Features

General Organization

  • Remove the SQL query from the top of the page. Could include on the bottom, but it's perhaps not necessary for curators to see this.
  • Move the Source File drop down to the top of the page, since this is the first action curators will perform.
  • For easier visualization, organize the form into more clearly delineated sections.
  • Put all action buttons in one place.


  • Change display so that only gene symbols with matching locus IDs are displayed in column 1 - FIXED for TAIR 2011-06-22
 Hold down control to select individual ones; hold down shift to select all.
  • Select more than one gene symbol at a time if all genes selected will get the same annotation - ADDED for TAIR 2011-06-22
  • Manually add a term to column 2, if needed. - ADDED for TAIR 2011-06-22
  • Re-format list of curation options so they are one-per-line for easier selection - ADDED for TAIR 2011-06-22
  • Implement functionality to have newly added terms in column 2 incorporated into Textpresso category - STILL to DO
 Note: ccc_component_go_index is the table that holds the index information.  Use information in this table to automatically update the Textpresso category?
  • Autocomplete for GO terms in column 3 (like the OA) - STILL to DO
  • Add a check upon submission for missing data, e.g. if no GO term is entered or selected, user should get an error message - STILL to DO
  • Update already curated information in real-time; this would allow indexing information between a component term in a sentence and a GO term used for annotation to be incorporated in the curation pipeline immediately - STILL to DO
 Note: we would need to change the way we're storing information, i.e. need to index data for each sentence so we could compare the indices for each sentence.
  • String match from terms in column 2 to GO terms and their synonyms in Column 3 for additional auto-suggest - STILL to DO
  • Add submit button next to each sentence - STILL to DO
Note: Currently the submit button functions on per paper basis.
  • Add ability to make NOT annotations.

Source Files

  • Add a javascript feature to navigate through directories of source file sentences - STILL to DO

Comments Box

  • Is it possible to query the comments? - STILL to DO
  • Can the comments be added on a per sentence basis? Perhaps a box could open up when needed, as opposed to having a box attached to each sentence.
Note: Will need to add source file information as a column to the ccc_comments table 

Annotation File

  • Add a timestamp to each entry of the output file - ADDED for TAIR 2011-06-22, MODIFIED to EXCLUDE TIME 2011-06-05

Hyphenated Localization Terms

Hyphenated terms that described localization, e.g. nuclear-localized or nucleolus-associated, are not marked up in the sentences because terms in sentences are not split based upon hyphens. Some possible soluations are:

1) Construct and add these types of terms to the lexicon

2) Post-process the source files to include matching words in hyphenated terms for possible curation

3) Adapt the mark up protocol for organisms whose gene names are not typically designated with a hyphen, like Arabidopsis or budding yeast

Back to Gene Ontology