WormBase
Return to Caltech documentation
Contents
- 1 Search frequency
- 2 Search according to SVM classification
- 3 Search Categories
- 4 Search Filters
- 5 File Format - File Name
- 6 File Format - Sentence Format
- 7 Sentence File Location
- 8 Mapping Files Location
- 9 Updates to form code
- 10 Form Documentation
- 11 Feedback on Form - C. elegans
- 12 Feedback on Form - dictyBase
- 13 Feedback on Form - TAIR
- 14 Archived Info - Old script explanation - Juancarlos
- 15 Archived Info - Old Curation Form
Search frequency
- Whenever WB SVM is performed
Search according to SVM classification
- Search all predicted positives (low, medium, high) of Other_expr pattern SVM
Search Categories
- Four categories:
- localization_cell_components_2011-02-11
- protein_celegans
- localization_verbs_082208
- localization_experimental_082208
Search Filters
- Remove all sentences with a Textpresso sentence score of 30 or higher.
- Other filtering steps may be introduced in the future (e.g., specific proteins like DAF-16 or sentences that also contain words like mutant or RNAi).
File Format - File Name
- Date Script was Run_MOD_Type of Textpresso Search
- For example: 20130801_WB_ccc
File Format - Sentence Format
SSC:7 PMID:23263989:references:276 ZFP-1 chromosomes 771 FIG 4 <protein_celegans>ZFP-1</protein_celegans> : : <localization_experimental_082208>GFP</localization_experimental_082208> <localization_verbs_082208>localizes</localization_verbs_082208> to <localization_cell_components_2011-02-11>chromosomes</localization_cell_components_2011-02-11> in maturing oocytes and is <localization_experimental_082208>widely</localization_experimental_082208> <localization_verbs_082208>expressed</localization_verbs_082208> 772 at all developmental <localization_experimental_082208>stages</localization_experimental_082208> .
Sentence File Location
- On textpresso-dev:
- On mangolassi:
/home/acedb/kimberly/ccc/ccc_source/worm
- On tazendra (when live):
/home/azurebrd/public_html/cgi_bin/forms/ccc/source/worm - this has changed.
Same as above?
Mapping Files Location
- PMID to MOD Accession: http://textpresso-dev.caltech.edu/ccc_results/accession
- This file maps PubMed identifiers to MOD paper IDs. A script will run with ccc scripts every time to generate this universal mapping file to use.
- gpi files: /home/azurebrd/public_html/cgi-bin/forms/ccc/scripts
- The gpi files map MOD protein names, to MOD and UniProtKB identifiers.
- Current WB gpi file is named ws234_gpi.
Updates to form code
- In script: /home/azurebrd/public_html/cgi-bin/forms/ccc/scripts/populate_ccc_pg_indices.pl
- Update name of worm gpi file
- Script currently expects: $gpi_files{'worm'} = 'ws234_gpi';
- Note that this will be changed so that the script will always look for a file entitled WB_gpi
- In ccc.cgi (located here: /home/azurebrd/public_html/cgi-bin/forms/ccc):
- When the curation form is made live, Userid for sending annotations to Protein2GO needs to be updated to remove "test:" and replace this with "production:"
- From my notes: will need to change line 338 in code to remove test prefix from userid for protein2go
- push @ptgoFields, "userid=test:$ptgoUser";
Form Documentation
Detailed Documentation of Form and Scripts
Feedback on Form - C. elegans
- Examples for color coding and underlining in curation form:
- In this first case, the brown category (CCC_TAIR) contains, within its tags, another category localization_experimental_082208. When this happens, we wanted to underline the term or phrase that was contained within both markups, but we can keep the color brown. So, protein complex would be brown but protein would also be underlined.
SSC:12 PMID:22247249:introduction:42 AP1|APETALA1|FD|FRUITFULL|FT|FUL|bZIP protein complex <genes_arabidopsis>FT</genes_arabidopsis> <localization_verbs_082208>forms</localization_verbs_082208> a <CCC_TAIR><localization_experimental_082208>protein</localization_experimental_082208> complex</CCC_TAIR> involving a <genes_arabidopsis>bZIP</genes_arabidopsis> transcription factor <genes_arabidopsis>FD</genes_arabidopsis> , and directly activates <genes_arabidopsis>APETALA1</genes_arabidopsis> ( <genes_arabidopsis>AP1</genes_arabidopsis> ) <localization_experimental_082208>expression</localization_experimental_082208> ( Abe et al . 2005 , Wigge et al . 2005 ) , and either directly or indirectly activates <genes_arabidopsis>FRUITFULL</genes_arabidopsis> ( <genes_arabidopsis>FUL</genes_arabidopsis> ) <localization_experimental_082208>expression</localization_experimental_082208> ( Teper-Bamnolker and Samach 2005 ) .
- In this case, there are no other tags within the brown (localization_cell_components_2011-02-11), so the term body would just be brown.
SSC:6 PMID:24120942:abstract:6 MOD-1|NHR-76 body|nuclear We <localization_verbs_082208>show</localization_verbs_082208> that the serotonergic chloride channel <protein_celegans>MOD-1</protein_celegans> relays a long-range endocrine <localization_experimental_082208>signal</localization_experimental_082208> from C . elegans <localization_cell_components_2011-02-11>body</localization_cell_components_2011-02-11> cavity neurons to control distal ATGL-1 function , via the <localization_cell_components_2011-02-11>nuclear</localization_cell_components_2011-02-11> receptor <protein_celegans>NHR-76</protein_celegans> .
Feedback on Form - dictyBase
1) For search functionality - can we autocomplete for GO terms and for CC terms?
a) For autocomplete search of GO terms, would need to use the GO CC ontology. b) For autocomplete search of CC terms, could use either the terms in the component-GO term index, or the entire CC category. If the latter, we would need to get it from Michael and update it on tazendra whenever it changed.
Searches - lines 413, 451+, 530
To get the GO terms: SELECT * FROM ccc_component_go_index ; autocomplete from 2nd column, named ccc_goterm and only allow values from that list just the list : SELECT DISTINCT(ccc_goterm) FROM ccc_component_go_index ORDER BY ccc_goterm;
2) Include a key to the term color coding - DONE, will see if curators want this removed though
a) This is on the old form (see http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/tair/tair_ccc.cgi) We have some white space to the right of the sentence classification section; could put the color key there. Color Key: Gene Product in Blue Textpresso Cellular Component in Red/Brown Assay Term in Orange Verb in Green An underline indicates that terms are represented in multiple Textpresso categories.
3) Make the section headings listed for each sentence more prominent, i.e. bold
a) Section heading display - line 267? - DONE, bold, red, font size 14
4) Link out gene products IDs to MODs and UniProtKB
Will display beneath the color key for anything that is colored blue, i.e. a C. elegans, dicty, or TAIR gene/gene product.
a)From where should we do this? From within the sentence? b) Appropriate UniProt URL: http://www.uniprot.org/uniprot/sixcharacteruniprotid http://www.uniprot.org/uniprot/Q21253 c) Each MOD would then need a URL for its identifiers: WormBase: http://www.wormbase.org/species/c_elegans/gene/WBGeneidentifier WormBase: http://www.wormbase.org/species/c_elegans/gene/WBGene00010593 dictyBase: http://dictybase.org/gene/DDB_dictygeneid dictyBase: http://dictybase.org/gene/DDB_G0283903 TAIR: ?? TAIR: http://www.arabidopsis.org/servlets/TairObject?id=26899&type=locus
5) Link out paper IDs in the display to PubMed or MOD
a) PubMed (PMID): http://www.ncbi.nlm.nih.gov/pubmed/paperidentifier PubMed (PMID): http://www.ncbi.nlm.nih.gov/pubmed/24146615 b) WormBase: http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/paper_display.cgi?action=Search+!&data_number=identifier WormBase: http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/paper_display.cgi?action=Search+!&data_number=00037685 c) dictyBase: d) TAIR: http://lu:8080/pubsearch/DisplayArticle?article_id=xxxxx TAIR: http://lu:8080/pubsearch/DisplayArticle?article_id=61657
Populating synonyms
Protein Name in Sentence | Synonym in Column 4 of gpi File | Gene Product Symbol in Column 2 of gpi File | Displays on CCC Form |
---|---|---|---|
p25 | p25 | dynE | Yes |
Rab7 | rab7 | rab7A | Yes |
Hsp32 | hsp32 | hspC | Yes |
Snf12 | snf12 | snf12-1 | Yes |
WASH | WASH | wshA | No |
Rh50 | Rh50 | rhgA | No |
CBP4a | CBP4a (not certain this is in the synonyms column) | cbpD1 | No |
NumA1 | numA1 | numA | No |
Feedback on Form - TAIR
- Include link out to papers for TAIR. - See above.
- We had a link out in the old form, /home/azurebrd/public_html/cgi-bin/forms/tair, see line 438
- New link for new form: http://lu:8080/pubsearch/DisplayArticle?article_id=xxxxx where xxxxx = the article id.
- Add a Qualifier column to the annotation row of the curation form.
- Add this column between the evidence code column and the already curated column.
- The Qualifier column would have two fixed values, NOT and colocalizes_with.
- For the curation table, add what is in bold:
ccc_sentenceannotation ccc_mod text, ccc_file text, ccc_paper text, ccc_section text, ccc_sentnum text, ccc_geneproduct text, ccc_component text, ccc_goterm text, ccc_evidencecode text, ccc_with text, ccc_qualifier text, ccc_alreadycurated text, ccc_comment text, ccc_valid text, ccc_ptgoid text, ccc_curator text, ccc_timestamp
- The Qualifier value, when present, will also be sent to Protein2GO via web services.
- The relevant Protein2GO code is in line 344 in ccc.cgi
Archived Info - Old script explanation - Juancarlos
The script runs on Wednesdays at 2am and looks for a new SVM file. If a new SVM file is available, the Textpresso search is performed; if not, the search is skipped for that week.
The script that gets stuff to tazendra from textpresso is :
/home/postgres/public_html/cgi-bin/data/ccc_gocuration/get_newset.pl
called by :
/home/postgres/work/pgpopulation/textpresso/wrapper.sh
On textpresso there have been no new matches since Oct 12
The full results are on textpresso-dev at :
/data2/srv/textpresso-dev.caltech.edu/www/docroot/azurebrd/ccc_datafiles/good_sentences_file.*
The new matches are at :
/data2/srv/textpresso-dev.caltech.edu/www/docroot/azurebrd/ccc_datafiles/recent_sentences_file.*
They're comparing to svm results from :
http://caprica.caltech.edu/celegans/svm_results/Juancarlos/otherexpr
You can check the recent good_sentences_file.* and see if any of those should be in SVM, or you can look at the full text of the SVM results and see if any of those should be in the good_sentences_file.* If you don't have a textpresso-dev account, you can ask Michael, and he'll ask the its people. I log on with my its account.
If stuff should be in the good_sentences_file.* and isn't, check that the categories have what they should have, and let me know which category isn't matching what it should.
The good_sentences_file.* is generated by :
/home/azurebrd/work/get_kimberly_go_gene_component_verb_localization/get_go_gene_component.pl
Now located here?
/home/postgres/work/textpresso/kimberly/get_go_gene_component.pl
The category files being used are in :
/data2/data-processing/data/celegans/Data/indices/body/semantic/categories
files :
protein_celegans localization_cell_components_082208 localization_verbs_082208 localization_other_120107
Archived Info - Old Curation Form
Relevant postgres tables:
ccc_gene_comp_go
Back to Gene Ontology