From WormBaseWiki
Jump to navigationJump to search

Return to Caltech documentation

Search frequency

  • Whenever WB SVM is performed

Search according to SVM classification

  • Search all predicted positives (low, medium, high) of Other_expr pattern SVM

Search Categories

  • Four categories:
    • localization_cell_components_2011-02-11
    • protein_celegans
    • localization_verbs_082208
    • localization_experimental_082208

Search Filters

  • Remove all sentences with a Textpresso sentence score of 30 or higher.
  • Other filtering steps may be introduced in the future (e.g., specific proteins like DAF-16 or sentences that also contain words like mutant or RNAi).

File Format - File Name

  • Date Script was Run_MOD_Type of Textpresso Search
  • For example: 20130801_WB_ccc

File Format - Sentence Format

SSC:7 PMID:23263989:references:276 ZFP-1 chromosomes 771 FIG 4 <protein_celegans>ZFP-1</protein_celegans> : : <localization_experimental_082208>GFP</localization_experimental_082208> <localization_verbs_082208>localizes</localization_verbs_082208> to <localization_cell_components_2011-02-11>chromosomes</localization_cell_components_2011-02-11> in maturing oocytes and is <localization_experimental_082208>widely</localization_experimental_082208> <localization_verbs_082208>expressed</localization_verbs_082208> 772 at all developmental <localization_experimental_082208>stages</localization_experimental_082208> .

Sentence File Location

  • On mangolassi:


  • On tazendra (when live):

/home/azurebrd/public_html/cgi_bin/forms/ccc/source/worm - this has changed.

Same as above?

Mapping Files Location

  • gpi files: /home/azurebrd/public_html/cgi-bin/forms/ccc/scripts
    • The gpi files map MOD protein names, to MOD and UniProtKB identifiers.
    • Current WB gpi file is named ws234_gpi.

Updates to form code

  • In script: /home/azurebrd/public_html/cgi-bin/forms/ccc/scripts/populate_ccc_pg_indices.pl
  • Update name of worm gpi file
    • Script currently expects: $gpi_files{'worm'} = 'ws234_gpi';
    • Note that this will be changed so that the script will always look for a file entitled WB_gpi
  • In ccc.cgi (located here: /home/azurebrd/public_html/cgi-bin/forms/ccc):
  • When the curation form is made live, Userid for sending annotations to Protein2GO needs to be updated to remove "test:" and replace this with "production:"
    • From my notes: will need to change line 338 in code to remove test prefix from userid for protein2go
    • push @ptgoFields, "userid=test:$ptgoUser";

Form Documentation

User Guide for Curators

Detailed Documentation of Form and Scripts

CCC Workflow

Feedback on Form - C. elegans

  • Examples for color coding and underlining in curation form:
  • In this first case, the brown category (CCC_TAIR) contains, within its tags, another category localization_experimental_082208. When this happens, we wanted to underline the term or phrase that was contained within both markups, but we can keep the color brown. So, protein complex would be brown but protein would also be underlined.

SSC:12 PMID:22247249:introduction:42 AP1|APETALA1|FD|FRUITFULL|FT|FUL|bZIP protein complex <genes_arabidopsis>FT</genes_arabidopsis> <localization_verbs_082208>forms</localization_verbs_082208> a <CCC_TAIR><localization_experimental_082208>protein</localization_experimental_082208> complex</CCC_TAIR> involving a <genes_arabidopsis>bZIP</genes_arabidopsis> transcription factor <genes_arabidopsis>FD</genes_arabidopsis> , and directly activates <genes_arabidopsis>APETALA1</genes_arabidopsis> ( <genes_arabidopsis>AP1</genes_arabidopsis> ) <localization_experimental_082208>expression</localization_experimental_082208> ( Abe et al . 2005 , Wigge et al . 2005 ) , and either directly or indirectly activates <genes_arabidopsis>FRUITFULL</genes_arabidopsis> ( <genes_arabidopsis>FUL</genes_arabidopsis> ) <localization_experimental_082208>expression</localization_experimental_082208> ( Teper-Bamnolker and Samach 2005 ) .

  • In this case, there are no other tags within the brown (localization_cell_components_2011-02-11), so the term body would just be brown.

SSC:6 PMID:24120942:abstract:6 MOD-1|NHR-76 body|nuclear We <localization_verbs_082208>show</localization_verbs_082208> that the serotonergic chloride channel <protein_celegans>MOD-1</protein_celegans> relays a long-range endocrine <localization_experimental_082208>signal</localization_experimental_082208> from C . elegans <localization_cell_components_2011-02-11>body</localization_cell_components_2011-02-11> cavity neurons to control distal ATGL-1 function , via the <localization_cell_components_2011-02-11>nuclear</localization_cell_components_2011-02-11> receptor <protein_celegans>NHR-76</protein_celegans> .

Feedback on Form - dictyBase

1) For search functionality - can we autocomplete for GO terms and for CC terms?

a) For autocomplete search of GO terms, would need to use the GO CC ontology.
b) For autocomplete search of CC terms, could use either the terms in the component-GO term index, or the entire CC category.  If   
   the latter, we would need to get it from Michael and update it on tazendra whenever it changed. 
Searches - lines 413, 451+, 530 
To get the GO terms:
SELECT * FROM ccc_component_go_index ;
autocomplete from 2nd column, named ccc_goterm and only allow values from that list
just the list :
SELECT DISTINCT(ccc_goterm) FROM ccc_component_go_index ORDER BY ccc_goterm;

2) Include a key to the term color coding - DONE, will see if curators want this removed though

a) This is on the old form (see http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/tair/tair_ccc.cgi)
We have some white space to the right of the sentence classification section; could put the color key there.

Color Key:
Gene Product in Blue
Textpresso Cellular Component in Red/Brown
Assay Term in Orange
Verb in Green
An underline indicates that terms are represented in multiple Textpresso categories.

3) Make the section headings listed for each sentence more prominent, i.e. bold

a) Section heading display - line 267? - DONE, bold, red, font size 14

4) Link out gene products IDs to MODs and UniProtKB

Will display beneath the color key for anything that is colored blue, i.e. a C. elegans, dicty, or TAIR gene/gene product.
a)From where should we do this?  From within the sentence?
b) Appropriate UniProt URL:
c) Each MOD would then need a URL for its identifiers:
   WormBase: http://www.wormbase.org/species/c_elegans/gene/WBGeneidentifier
   WormBase: http://www.wormbase.org/species/c_elegans/gene/WBGene00010593
   dictyBase: http://dictybase.org/gene/DDB_dictygeneid
   dictyBase: http://dictybase.org/gene/DDB_G0283903
   TAIR: ??
   TAIR: http://www.arabidopsis.org/servlets/TairObject?id=26899&type=locus

5) Link out paper IDs in the display to PubMed or MOD

a) PubMed (PMID): http://www.ncbi.nlm.nih.gov/pubmed/paperidentifier
   PubMed (PMID): http://www.ncbi.nlm.nih.gov/pubmed/24146615  
b) WormBase: http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/paper_display.cgi?action=Search+!&data_number=identifier
   WormBase: http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/paper_display.cgi?action=Search+!&data_number=00037685
c) dictyBase: 
d) TAIR: http://lu:8080/pubsearch/DisplayArticle?article_id=xxxxx
   TAIR: http://lu:8080/pubsearch/DisplayArticle?article_id=61657

Populating synonyms

Protein Name in Sentence Synonym in Column 4 of gpi File Gene Product Symbol in Column 2 of gpi File Displays on CCC Form
p25 p25 dynE Yes
Rab7 rab7 rab7A Yes
Hsp32 hsp32 hspC Yes
Snf12 snf12 snf12-1 Yes
Rh50 Rh50 rhgA No
CBP4a CBP4a (not certain this is in the synonyms column) cbpD1 No
NumA1 numA1 numA No

Feedback on Form - TAIR

  • Add a Qualifier column to the annotation row of the curation form.
  • Add this column between the evidence code column and the already curated column.
  • The Qualifier column would have two fixed values, NOT and colocalizes_with.
  • For the curation table, add what is in bold:
  ccc_mod text,
  ccc_file text,
  ccc_paper text,
  ccc_section text,
  ccc_sentnum text,
  ccc_geneproduct text,
  ccc_component text,
  ccc_goterm text,
  ccc_evidencecode text,
  ccc_with text,
  ccc_qualifier text,
  ccc_alreadycurated text,
  ccc_comment text,
  ccc_valid text,
  ccc_ptgoid text,
  ccc_curator text,
  • The Qualifier value, when present, will also be sent to Protein2GO via web services.
  • The relevant Protein2GO code is in line 344 in ccc.cgi

Archived Info - Old script explanation - Juancarlos

The script runs on Wednesdays at 2am and looks for a new SVM file. If a new SVM file is available, the Textpresso search is performed; if not, the search is skipped for that week.

The script that gets stuff to tazendra from textpresso is :


called by :


On textpresso there have been no new matches since Oct 12

The full results are on textpresso-dev at :


The new matches are at :


They're comparing to svm results from :


You can check the recent good_sentences_file.* and see if any of those should be in SVM, or you can look at the full text of the SVM results and see if any of those should be in the good_sentences_file.* If you don't have a textpresso-dev account, you can ask Michael, and he'll ask the its people. I log on with my its account.

If stuff should be in the good_sentences_file.* and isn't, check that the categories have what they should have, and let me know which category isn't matching what it should.

The good_sentences_file.* is generated by :


Now located here?


The category files being used are in :


files :


Archived Info - Old Curation Form

Relevant postgres tables:


Back to Gene Ontology