WormBase

Return to Caltech documentation

1 Search frequency
2 Search according to SVM classification
3 Search Categories
4 Search Filters
5 File Format - File Name
6 File Format - Sentence Format
7 Sentence File Location
8 Mapping Files Location
9 Updates to form code
10 Form Documentation
11 Feedback on Form - C. elegans
12 Feedback on Form - dictyBase
13 Feedback on Form - TAIR
14 Archived Info - Old script explanation - Juancarlos
15 Archived Info - Old Curation Form

Search frequency

Whenever WB SVM is performed

Search according to SVM classification

Search all predicted positives (low, medium, high) of Other_expr pattern SVM

Search Categories

Four categories:
- localization_cell_components_2011-02-11
- protein_celegans
- localization_verbs_082208
- localization_experimental_082208

Search Filters

Remove all sentences with a Textpresso sentence score of 30 or higher.
Other filtering steps may be introduced in the future (e.g., specific proteins like DAF-16 or sentences that also contain words like mutant or RNAi).

File Format - File Name

Date Script was Run_MOD_Type of Textpresso Search

For example: 20130801_WB_ccc

File Format - Sentence Format

SSC:7 PMID:23263989:references:276 ZFP-1 chromosomes 771 FIG 4 <protein_celegans>ZFP-1</protein_celegans> : : <localization_experimental_082208>GFP</localization_experimental_082208> <localization_verbs_082208>localizes</localization_verbs_082208> to <localization_cell_components_2011-02-11>chromosomes</localization_cell_components_2011-02-11> in maturing oocytes and is <localization_experimental_082208>widely</localization_experimental_082208> <localization_verbs_082208>expressed</localization_verbs_082208> 772 at all developmental <localization_experimental_082208>stages</localization_experimental_082208> .

Sentence File Location

On textpresso-dev:
- http://textpresso-dev.caltech.edu/ccc_results/celegans/

On mangolassi:

/home/acedb/kimberly/ccc/ccc_source/worm

On tazendra (when live):

/home/azurebrd/public_html/cgi_bin/forms/ccc/source/worm - this has changed.

Same as above?

Mapping Files Location

PMID to MOD Accession: http://textpresso-dev.caltech.edu/ccc_results/accession
- This file maps PubMed identifiers to MOD paper IDs. A script will run with ccc scripts every time to generate this universal mapping file to use.

gpi files: /home/azurebrd/public_html/cgi-bin/forms/ccc/scripts
- The gpi files map MOD protein names, to MOD and UniProtKB identifiers.
- Current WB gpi file is named ws234_gpi.

Updates to form code

In script: /home/azurebrd/public_html/cgi-bin/forms/ccc/scripts/populate_ccc_pg_indices.pl

Update name of worm gpi file
- Script currently expects: $gpi_files{'worm'} = 'ws234_gpi';
- Note that this will be changed so that the script will always look for a file entitled WB_gpi

In ccc.cgi (located here: /home/azurebrd/public_html/cgi-bin/forms/ccc):

When the curation form is made live, Userid for sending annotations to Protein2GO needs to be updated to remove "test:" and replace this with "production:"
- From my notes: will need to change line 338 in code to remove test prefix from userid for protein2go
- push @ptgoFields, "userid=test:$ptgoUser";

Form Documentation

User Guide for Curators

Detailed Documentation of Form and Scripts

CCC Workflow

Feedback on Form - C. elegans

Examples for color coding and underlining in curation form:

In this first case, the brown category (CCC_TAIR) contains, within its tags, another category localization_experimental_082208. When this happens, we wanted to underline the term or phrase that was contained within both markups, but we can keep the color brown. So, protein complex would be brown but protein would also be underlined.

SSC:12 PMID:22247249:introduction:42 AP1|APETALA1|FD|FRUITFULL|FT|FUL|bZIP protein complex <genes_arabidopsis>FT</genes_arabidopsis> <localization_verbs_082208>forms</localization_verbs_082208> a <CCC_TAIR><localization_experimental_082208>protein</localization_experimental_082208> complex</CCC_TAIR> involving a <genes_arabidopsis>bZIP</genes_arabidopsis> transcription factor <genes_arabidopsis>FD</genes_arabidopsis> , and directly activates <genes_arabidopsis>APETALA1</genes_arabidopsis> ( <genes_arabidopsis>AP1</genes_arabidopsis> ) <localization_experimental_082208>expression</localization_experimental_082208> ( Abe et al . 2005 , Wigge et al . 2005 ) , and either directly or indirectly activates <genes_arabidopsis>FRUITFULL</genes_arabidopsis> ( <genes_arabidopsis>FUL</genes_arabidopsis> ) <localization_experimental_082208>expression</localization_experimental_082208> ( Teper-Bamnolker and Samach 2005 ) .

In this case, there are no other tags within the brown (localization_cell_components_2011-02-11), so the term body would just be brown.

SSC:6 PMID:24120942:abstract:6 MOD-1|NHR-76 body|nuclear We <localization_verbs_082208>show</localization_verbs_082208> that the serotonergic chloride channel <protein_celegans>MOD-1</protein_celegans> relays a long-range endocrine <localization_experimental_082208>signal</localization_experimental_082208> from C . elegans <localization_cell_components_2011-02-11>body</localization_cell_components_2011-02-11> cavity neurons to control distal ATGL-1 function , via the <localization_cell_components_2011-02-11>nuclear</localization_cell_components_2011-02-11> receptor <protein_celegans>NHR-76</protein_celegans> .

Feedback on Form - dictyBase

1) For search functionality - can we autocomplete for GO terms and for CC terms?

a) For autocomplete search of GO terms, would need to use the GO CC ontology.
b) For autocomplete search of CC terms, could use either the terms in the component-GO term index, or the entire CC category.  If   
   the latter, we would need to get it from Michael and update it on tazendra whenever it changed.

Searches - lines 413, 451+, 530

To get the GO terms:
SELECT * FROM ccc_component_go_index ;
autocomplete from 2nd column, named ccc_goterm and only allow values from that list
just the list :
SELECT DISTINCT(ccc_goterm) FROM ccc_component_go_index ORDER BY ccc_goterm;

2) Include a key to the term color coding - DONE, will see if curators want this removed though

a) This is on the old form (see http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/tair/tair_ccc.cgi)
We have some white space to the right of the sentence classification section; could put the color key there.

Color Key:
Gene Product in Blue
Textpresso Cellular Component in Red/Brown
Assay Term in Orange
Verb in Green
An underline indicates that terms are represented in multiple Textpresso categories.

3) Make the section headings listed for each sentence more prominent, i.e. bold

a) Section heading display - line 267? - DONE, bold, red, font size 14

4) Link out gene products IDs to MODs and UniProtKB

Will display beneath the color key for anything that is colored blue, i.e. a C. elegans, dicty, or TAIR gene/gene product.

a)From where should we do this?  From within the sentence?
b) Appropriate UniProt URL:
   http://www.uniprot.org/uniprot/sixcharacteruniprotid
   http://www.uniprot.org/uniprot/Q21253
c) Each MOD would then need a URL for its identifiers:
   WormBase: http://www.wormbase.org/species/c_elegans/gene/WBGeneidentifier
   WormBase: http://www.wormbase.org/species/c_elegans/gene/WBGene00010593
   dictyBase: http://dictybase.org/gene/DDB_dictygeneid
   dictyBase: http://dictybase.org/gene/DDB_G0283903
   TAIR: ??
   TAIR: http://www.arabidopsis.org/servlets/TairObject?id=26899&type=locus

5) Link out paper IDs in the display to PubMed or MOD

a) PubMed (PMID): http://www.ncbi.nlm.nih.gov/pubmed/paperidentifier
   PubMed (PMID): http://www.ncbi.nlm.nih.gov/pubmed/24146615  
b) WormBase: http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/paper_display.cgi?action=Search+!&data_number=identifier
   WormBase: http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/paper_display.cgi?action=Search+!&data_number=00037685
c) dictyBase: 
d) TAIR: http://lu:8080/pubsearch/DisplayArticle?article_id=xxxxx
   TAIR: http://lu:8080/pubsearch/DisplayArticle?article_id=61657

Populating synonyms

Protein Name in Sentence	Synonym in Column 4 of gpi File	Gene Product Symbol in Column 2 of gpi File	Displays on CCC Form
p25	p25	dynE	Yes
Rab7	rab7	rab7A	Yes
Hsp32	hsp32	hspC	Yes
Snf12	snf12	snf12-1	Yes
WASH	WASH	wshA	No
Rh50	Rh50	rhgA	No
CBP4a	CBP4a (not certain this is in the synonyms column)	cbpD1	No
NumA1	numA1	numA	No

Feedback on Form - TAIR

Include link out to papers for TAIR. - See above.
We had a link out in the old form, /home/azurebrd/public_html/cgi-bin/forms/tair, see line 438
New link for new form: http://lu:8080/pubsearch/DisplayArticle?article_id=xxxxx where xxxxx = the article id.

Add a Qualifier column to the annotation row of the curation form.
Add this column between the evidence code column and the already curated column.
The Qualifier column would have two fixed values, NOT and colocalizes_with.
For the curation table, add what is in bold:

ccc_sentenceannotation
  ccc_mod text,
  ccc_file text,
  ccc_paper text,
  ccc_section text,
  ccc_sentnum text,
  ccc_geneproduct text,
  ccc_component text,
  ccc_goterm text,
  ccc_evidencecode text,
  ccc_with text,
  ccc_qualifier text,
  ccc_alreadycurated text,
  ccc_comment text,
  ccc_valid text,
  ccc_ptgoid text,
  ccc_curator text,
  ccc_timestamp

The Qualifier value, when present, will also be sent to Protein2GO via web services.
The relevant Protein2GO code is in line 344 in ccc.cgi

Archived Info - Old script explanation - Juancarlos

The script runs on Wednesdays at 2am and looks for a new SVM file. If a new SVM file is available, the Textpresso search is performed; if not, the search is skipped for that week.

The script that gets stuff to tazendra from textpresso is :

 /home/postgres/public_html/cgi-bin/data/ccc_gocuration/get_newset.pl

called by :

 /home/postgres/work/pgpopulation/textpresso/wrapper.sh

On textpresso there have been no new matches since Oct 12

The full results are on textpresso-dev at :

 /data2/srv/textpresso-dev.caltech.edu/www/docroot/azurebrd/ccc_datafiles/good_sentences_file.*

The new matches are at :

 /data2/srv/textpresso-dev.caltech.edu/www/docroot/azurebrd/ccc_datafiles/recent_sentences_file.*

They're comparing to svm results from :

 http://caprica.caltech.edu/celegans/svm_results/Juancarlos/otherexpr

You can check the recent good_sentences_file.* and see if any of those should be in SVM, or you can look at the full text of the SVM results and see if any of those should be in the good_sentences_file.* If you don't have a textpresso-dev account, you can ask Michael, and he'll ask the its people. I log on with my its account.

If stuff should be in the good_sentences_file.* and isn't, check that the categories have what they should have, and let me know which category isn't matching what it should.

The good_sentences_file.* is generated by :

 /home/azurebrd/work/get_kimberly_go_gene_component_verb_localization/get_go_gene_component.pl

Now located here?

/home/postgres/work/textpresso/kimberly/get_go_gene_component.pl

The category files being used are in :

 /data2/data-processing/data/celegans/Data/indices/body/semantic/categories

files :

 protein_celegans 
 localization_cell_components_082208 
 localization_verbs_082208 
 localization_other_120107

Archived Info - Old Curation Form

Relevant postgres tables:

ccc_gene_comp_go

Back to Gene Ontology