Difference between revisions of "Specifications for CCC Curation from Textpresso Search Page"

From WormBaseWiki
Jump to navigationJump to search
 
(37 intermediate revisions by 2 users not shown)
Line 1: Line 1:
===Requirements for Using Textpresso Search Results in General CCC Curation===
+
==Summary==
  
These specifications are for allowing a curator to search any Textpresso implementation using the CCC categories, submit the resulting sentences to a curation form, make annotations, and download the annotations in a gene_association file format.
+
These specifications are for allowing a curator to search any Textpresso implementation using the CCC categories, submit the resulting sentences to a curation form, make annotations, and download the annotations in either a simple three-column format or the GO's gene_association file (GAF) format.
  
This pipeline would make use of the XML format of a returned sentence.  A partial XML verions of a sample sentences from WBPaper00037859
+
==Searches==
  
<textpresso_article>
+
Curators would be able to search Textpresso using their chosen criteria and export the sentences to the CCC curation form.
  <bibliography>
+
The XML mark-up of each returned sentence will be used to populate the curation boxes on the form and color-code the search results.
    <literature>C. elegans</literature>
 
    <doc_id>WBPaper00037859</doc_id>
 
    <year>2010-11-29 </year>
 
    <type>Journal_article </type>
 
    <title>C . elegans galectins LEC-6 and LEC-10 interact with similar glycoconjugates in the intestine . </title>
 
    <journal>J Biol Chem </journal>
 
    <volume></volume>
 
    <page></page>
 
    <author>MaduziaLL</author>
 
    <author>YuE</author>
 
    <author>ZhangY</author>
 
    <accession>PMID 21115491 </accession>
 
    <abstract>Galectins are a family of metazoan proteins that show binding to various -galactoside-containing glycans . Due to a lack of proper tools , the interaction of galectins with their specific glycan ligands in the cells and it issues are largely unknown . We have investigated the localization of galectin ligands in Caenorhabditis elegans using a novel technology that relies on the high binding specificity between galectins and their endogenous ligands . Fluorescently-labeled recombinant galectin fusions are found to bind to ligands located in diverse it issues including the intestine , pharynx , and the rectal valve . Consistent with their role as galactoside-binding proteins , the interaction with their ligands is inhibited by galactose or lactose . Two of the galectins , LEC-6 and LEC-10 , recognize ligands that co-localize along the intestinal lumen . The ligands for LEC-6 and LEC-10 are absent in three glycosylation mutants bre-1 , fut-8 and galt-1 , which have been shown to be required to synthesize the Gal-1 , 4-Fuc modifications of the core N-glycans unique to C . elegans and several other invertebrates . Both galectins pull down the same set of glycoproteins in a manner dependant on the presence of these carbohydrate modifications . Endogenous LEC-6 and LEC-10 are expressed in the intestinal cells , but they are localized to different subcellular compartments that do not appear to overlap with each other or with the location of their glycan targets . An altered subcellular distribution of these ligands is found in mutants lacking both galectins . These results suggest a model where LEC-6 and LEC-10 interact with glycoproteins through specific glycans to regulate their cellular fate . </abstract>
 
  </bibliography>
 
  
  <matching_sentences>
+
Curators will need an option to send results to a curation form with a way to name the search so they can select it in the list of source files on the curation form.
    <field_references>
 
  
    <sentence id="315" subscore="48">
+
''Future options for filtering search results may include filtering papers that are in the corpus but not appropriate for curation (e.g., WB's 'functional annotation papers' or TAIR's black list of papers on other organisms - Tanya sent me a list of 1243 papers like this) or restricting searches to a particular level of SVM classification (e.g., high-confidence SVM papers only). For the former, would we need a consistent tag across all implementations?  For the latter, what would be the best way to integrate the SVM classification results? Store them in postgres? Store them in Textpresso?''
      <content>The intestinal lumen border is partially outlined in white . at CALIFORNIA INSTITUTE OF TECHNOLOGY , on December 9 , 2010 www . jbc . org Downloaded from Figure 1 A B C SNAP-LEC fusion protein expressed in E . coli SNAP LEC Single site labeling with fluorescent substrate Immobilizing to SNAP capture resin In situ detection Protein pull down 6xHis 1 5 pMoles 17 23 30 58 46 175 80 7 KDa 2 3 1 5 4 6 7 Coomassie Infra-Red 800 Nickel resin purification at CALIFORNIA INSTITUTE OF TECHNOLOGY , on December 9 , 2010 www . jbc . org Downloaded from Figure 2 Fluorescence Brightfield Intestine Rectal valve Spermatheca Buccal cavityGrinder Coelomocyte 25 um * SNAP-LEC-1 SNAP-LEC-11 SNAP-LEC-10 SNAP-LEC-9 SNAP-LEC-6 A B C D E F SNAP-CGL2 G at CALIFORNIA INSTITUTE OF TECHNOLOGY , on December 9 , 2010 www . jbc . org Downloaded from SNAP-LEC-6 IFB-2 ( MH33 ) Merge with DIC A C SNAP-LEC-10 IFB-2 ( MH33 ) Merge with DIC B Figure 3 SNAP-LEC-10 SNAP-LEC-6 Merge with DIC 10 um Nucleus IFB-2 Glycocalyx LEC-6 / -10 staining signal D Lumen at CALIFORNIA INSTITUTE OF TECHNOLOGY , on December 9 , 2010 www . jbc . org Downloaded from wt bre-1 fut-8 galt-1 Figure 4 A B No sugar Fucose Galactose Lactose 100 um bre-1 fut-8 galt-1 Fucose Mannose Galactose GlcNAc C Core N-Glycan Gal-1 , 4-Fuc at CALIFORNIA INSTITUTE OF TECHNOLOGY , on December 9 , 2010 www . jbc . org Downloaded from 175 80 KDa + - + - + - wt bre-1 fut-8 wt br e-1 fut-8 Lactose Input Pulldown with SNAP-LEC-6 A Figure 5 B Gene model Number of peptides identified by Mass-spec Predicted protein properties Bead LEC-6 LEC-6 LEC-10 LEC-10 Protein Domains Amino acids N- glycosylation sites - + - + - F28B4 . 3 0 3 40 0 27 CE07153 MD super family , C-type lectin , VWF type A 2229 12 F40F4 . 6 0 0 53 0 15 CE04536 MD super family , C-type lectin , VWF type A 2214 12 T25C12 . 3 0 0 26 0 12 CE40820 MD super family , C-type lectin , VWF type A 2103 F57F4 . 3 / 4 0 0 14 0 9 CE11342 CE11344 21 ET modules 2153 21 C F28B4 . 3 1 2229 aa 2000 1500 1000 500 MD MD vWFA vWFA CLEC MW MW 1 2 3 4 5 6 7 8 9 10 11 6 at CALIFORNIA INSTITUTE OF TECHNOLOGY , on December 9 , 2010 www . jbc . org Downloaded from 40 um * * wt lec-6lec-6 lec-10wt lec-10lec-6 lec-10 A Figure 6 Anti-LEC-6 IFB-2 ( MH33 ) Merge ( LEC-6 = green ) B D IFB-2 ( MH33 ) C Anti-LEC-6 Anti-LEC-10 LEC-6 : : GFP LEC-10 : : GFP 100 um 17 25 30 58 46 80 7 KDa E LEC-10 : : GFP DIC Merge ( GFP = Green ) Merge with DIC Merge with DIC Anti-LEC-10 Merge ( LEC-10 = green ) at CALIFORNIA INSTITUTE OF TECHNOLOGY , on December 9 , 2010 www . jbc . org Downloaded from lec-6 lec-6 ; lec-10 lec-10 wt SNAP-LEC-6 wt lec-6 ; lec-10 A B 40 um SNAP -LEC-6 SNAP -LEC-6 ( Green ) MH33 ( Red ) Figure 7 100 um MH33 at CALIFORNIA INSTITUTE OF TECHNOLOGY , on December 9 , 2010 www . jbc . org Downloaded from </content>
 
      <anatomy_celegans start_pos="3" end_pos="4">intestinal lumen</anatomy_celegans>
 
      <localization_experimental_082208 start_pos="7" end_pos="7">partially</localization_experimental_082208>
 
      <spatial_relation start_pos="12" end_pos="12">at</spatial_relation>
 
      <tables_and_figures start_pos="30" end_pos="30">Figure</tables_and_figures>
 
      <anatomy_celegans start_pos="33" end_pos="33">B</anatomy_celegans>
 
      <anatomy_celegans start_pos="34" end_pos="34">C</anatomy_celegans>
 
      <consort start_pos="36" end_pos="36">fusion</consort>
 
      <localization_experimental_082208 start_pos="36" end_pos="36">fusion</localization_experimental_082208>
 
      <mf_int_assay_2009-11-25 start_pos="36" end_pos="36">fusion</mf_int_assay_2009-11-25>
 
      <sequence start_pos="36" end_pos="36">fusion</sequence>
 
      <localization_experimental_082208 start_pos="37" end_pos="37">protein</localization_experimental_082208>
 
      <mf_int_assay_2009-11-25 start_pos="37" end_pos="37">protein</mf_int_assay_2009-11-25>
 
      <molecular_function start_pos="37" end_pos="37">protein</molecular_function>
 
      <sequence start_pos="37" end_pos="37">protein</sequence>
 
      <biological_process start_pos="38" end_pos="38">expressed</biological_process>
 
      <localization_verbs_082208 start_pos="38" end_pos="38">expressed</localization_verbs_082208>
 
      <anatomy_celegans start_pos="40" end_pos="40">E</anatomy_celegans>
 
      <organism start_pos="40" end_pos="42">E _PRD_ coli</organism>
 
      <transporter_activity start_pos="43" end_pos="43">SNAP</transporter_activity>
 
      <sequence start_pos="45" end_pos="45">Single</sequence>
 
      <localization start_pos="46" end_pos="46">site</localization>
 
      <mf_int_assay_2009-11-25 start_pos="46" end_pos="46">site</mf_int_assay_2009-11-25>
 
      <sequence start_pos="46" end_pos="46">site</sequence>
 
      <localization_verbs_082208 start_pos="47" end_pos="47">labeling</localization_verbs_082208>
 
      <method start_pos="47" end_pos="47">labeling</method>
 
      <localization_experimental_082208 start_pos="49" end_pos="49">fluorescent</localization_experimental_082208>
 
      <transporter_activity start_pos="53" end_pos="53">SNAP</transporter_activity>
 
      <localization_experimental_082208 start_pos="58" end_pos="58">detection</localization_experimental_082208>
 
      <localization_experimental_082208 start_pos="59" end_pos="59">Protein</localization_experimental_082208>
 
      <mf_int_assay_2009-11-25 start_pos="59" end_pos="59">Protein</mf_int_assay_2009-11-25>
 
      <molecular_function start_pos="59" end_pos="59">Protein</molecular_function>
 
      <sequence start_pos="59" end_pos="59">Protein</sequence>
 
      <mf_int_verbs_2009-11-25 start_pos="60" end_pos="60">pull</mf_int_verbs_2009-11-25>
 
      <mf_int_assay_2009-11-25 start_pos="61" end_pos="61">down</mf_int_assay_2009-11-25>
 
      <mf_int_assay_2009-11-25 start_pos="62" end_pos="62">6xHis</mf_int_assay_2009-11-25>
 
      <chemical_drug_2010-03-19 start_pos="85" end_pos="85">Nickel</chemical_drug_2010-03-19>
 
      <method start_pos="87" end_pos="87">purification</method>
 
      <spatial_relation start_pos="88" end_pos="88">at</spatial_relation>
 
      <tables_and_figures start_pos="106" end_pos="106">Figure</tables_and_figures>
 
      <localization_experimental_082208 start_pos="108" end_pos="108">Fluorescence</localization_experimental_082208>
 
      <anatomy_celegans start_pos="110" end_pos="110">Intestine</anatomy_celegans>
 
      <anatomy_dmelanogaster start_pos="110" end_pos="110">Intestine</anatomy_dmelanogaster>
 
      <anatomy_dmelanogaster start_pos="111" end_pos="112">Rectal valve</anatomy_dmelanogaster>
 
      <anatomy_celegans start_pos="113" end_pos="113">Spermatheca</anatomy_celegans>
 
      <anatomy_celegans start_pos="116" end_pos="116">Coelomocyte</anatomy_celegans>
 
      <anatomy_celegans start_pos="126" end_pos="126">B</anatomy_celegans>
 
      <anatomy_celegans start_pos="127" end_pos="127">C</anatomy_celegans>
 
      <anatomy_celegans start_pos="128" end_pos="128">D</anatomy_celegans>
 
      <anatomy_celegans start_pos="129" end_pos="129">E</anatomy_celegans>
 
      <anatomy_celegans start_pos="130" end_pos="130">F</anatomy_celegans>
 
      <chemical_drug_2010-03-19 start_pos="130" end_pos="130">F</chemical_drug_2010-03-19>
 
      <spatial_relation start_pos="133" end_pos="133">at</spatial_relation>
 
      <molecular_function start_pos="152" end_pos="152">IFB-2</molecular_function>
 
      <protein_celegans start_pos="152" end_pos="152">IFB-2</protein_celegans>
 
      <anatomy_celegans start_pos="160" end_pos="160">C</anatomy_celegans>
 
      <molecular_function start_pos="162" end_pos="162">IFB-2</molecular_function>
 
      <protein_celegans start_pos="162" end_pos="162">IFB-2</protein_celegans>
 
      <anatomy_celegans start_pos="169" end_pos="169">B</anatomy_celegans>
 
      <tables_and_figures start_pos="170" end_pos="170">Figure</tables_and_figures>
 
      <anatomy_celegans start_pos="179" end_pos="179">Nucleus</anatomy_celegans>
 
      <cell_part start_pos="179" end_pos="179">Nucleus</cell_part>
 
      <localization_cell_components_082208 start_pos="179" end_pos="179">Nucleus</localization_cell_components_082208>
 
      <organelle start_pos="179" end_pos="179">Nucleus</organelle>
 
      <molecular_function start_pos="180" end_pos="180">IFB-2</molecular_function>
 
      <protein_celegans start_pos="180" end_pos="180">IFB-2</protein_celegans>
 
      <cell_part start_pos="181" end_pos="181">Glycocalyx</cell_part>
 
      <molecular_function start_pos="182" end_pos="182">LEC-6</molecular_function>
 
      <protein_celegans start_pos="182" end_pos="182">LEC-6</protein_celegans>
 
      <localization_experimental_082208 start_pos="185" end_pos="185">staining</localization_experimental_082208>
 
      <localization_verbs_082208 start_pos="185" end_pos="185">staining</localization_verbs_082208>
 
      <method start_pos="185" end_pos="185">staining</method>
 
      <biological_process start_pos="186" end_pos="186">signal</biological_process>
 
      <localization_experimental_082208 start_pos="186" end_pos="186">signal</localization_experimental_082208>
 
      <anatomy_celegans start_pos="187" end_pos="187">D</anatomy_celegans>
 
      <spatial_relation start_pos="189" end_pos="189">at</spatial_relation>
 
      <phenotype_celegans start_pos="207" end_pos="207">wt</phenotype_celegans>
 
      <gene_celegans start_pos="208" end_pos="208">bre-1</gene_celegans>
 
      <gene_celegans start_pos="209" end_pos="209">fut-8</gene_celegans>
 
      <gene_celegans start_pos="210" end_pos="210">galt-1</gene_celegans>
 
      <tables_and_figures start_pos="211" end_pos="211">Figure</tables_and_figures>
 
      <anatomy_celegans start_pos="214" end_pos="214">B</anatomy_celegans>
 
      <gene_celegans start_pos="222" end_pos="222">bre-1</gene_celegans>
 
      <gene_celegans start_pos="223" end_pos="223">fut-8</gene_celegans>
 
      <gene_celegans start_pos="224" end_pos="224">galt-1</gene_celegans>
 
      <anatomy_celegans start_pos="229" end_pos="229">C</anatomy_celegans>
 
      <localization start_pos="230" end_pos="230">Core</localization>
 
      <virion_part start_pos="230" end_pos="230">Core</virion_part>
 
      <spatial_relation start_pos="235" end_pos="235">at</spatial_relation>
 
      <phenotype_celegans start_pos="262" end_pos="262">wt</phenotype_celegans>
 
      <gene_celegans start_pos="263" end_pos="263">bre-1</gene_celegans>
 
      <gene_celegans start_pos="264" end_pos="264">fut-8</gene_celegans>
 
      <phenotype_celegans start_pos="265" end_pos="265">wt</phenotype_celegans>
 
      <gene_celegans start_pos="268" end_pos="268">fut-8</gene_celegans>
 
      <mf_int_assay_2009-11-25 start_pos="271" end_pos="271">Pulldown</mf_int_assay_2009-11-25>
 
      <tables_and_figures start_pos="275" end_pos="275">Figure</tables_and_figures>
 
      <anatomy_celegans start_pos="277" end_pos="277">B</anatomy_celegans>
 
      <gene start_pos="278" end_pos="278">Gene</gene>
 
      <sequence start_pos="278" end_pos="278">Gene</sequence>
 
      <localization_verbs_082208 start_pos="283" end_pos="283">identified</localization_verbs_082208>
 
      <mf_int_verbs_2009-11-25 start_pos="283" end_pos="283">identified</mf_int_verbs_2009-11-25>
 
      <mf_int_verbs_2009-11-25 start_pos="286" end_pos="286">Predicted</mf_int_verbs_2009-11-25>
 
      <sequence start_pos="286" end_pos="286">Predicted</sequence>
 
      <localization_experimental_082208 start_pos="287" end_pos="287">protein</localization_experimental_082208>
 
      <mf_int_assay_2009-11-25 start_pos="287" end_pos="287">protein</mf_int_assay_2009-11-25>
 
      <molecular_function start_pos="287" end_pos="287">protein</molecular_function>
 
      <sequence start_pos="287" end_pos="287">protein</sequence>
 
      <molecular_function start_pos="290" end_pos="290">LEC-6</molecular_function>
 
      <protein_celegans start_pos="290" end_pos="290">LEC-6</protein_celegans>
 
      <molecular_function start_pos="291" end_pos="291">LEC-6</molecular_function>
 
      <protein_celegans start_pos="291" end_pos="291">LEC-6</protein_celegans>
 
      <protein_celegans start_pos="292" end_pos="292">LEC-10</protein_celegans>
 
      <protein_celegans start_pos="293" end_pos="293">LEC-10</protein_celegans>
 
      <localization_experimental_082208 start_pos="294" end_pos="294">Protein</localization_experimental_082208>
 
      <mf_int_assay_2009-11-25 start_pos="294" end_pos="294">Protein</mf_int_assay_2009-11-25>
 
      <molecular_function start_pos="294" end_pos="294">Protein</molecular_function>
 
      <sequence start_pos="294" end_pos="294">Protein</sequence>
 
      <entity_feature start_pos="295" end_pos="295">Domains</entity_feature>
 
      <mf_int_assay_2009-11-25 start_pos="295" end_pos="295">Domains</mf_int_assay_2009-11-25>
 
      <mf_int_assay_2009-11-25 start_pos="296" end_pos="296">Amino</mf_int_assay_2009-11-25>
 
      <entity_feature start_pos="296" end_pos="297">Amino acids</entity_feature>
 
      <localization start_pos="300" end_pos="300">sites</localization>
 
      <mf_int_assay_2009-11-25 start_pos="300" end_pos="300">sites</mf_int_assay_2009-11-25>
 
      <sequence start_pos="300" end_pos="300">sites</sequence>
 
      <gene_celegans start_pos="306" end_pos="308">F28B4 _PRD_ 3</gene_celegans>
 
      <protein_celegans start_pos="314" end_pos="314">CE07153</protein_celegans>
 
      <binding start_pos="320" end_pos="320">lectin</binding>
 
      <biological_adhesion start_pos="320" end_pos="320">lectin</biological_adhesion>
 
      <cellular_process start_pos="320" end_pos="320">lectin</cellular_process>
 
      <gene_celegans start_pos="327" end_pos="329">F40F4 _PRD_ 6</gene_celegans>
 
      <protein_celegans start_pos="335" end_pos="335">CE04536</protein_celegans>
 
      <binding start_pos="341" end_pos="341">lectin</binding>
 
      <biological_adhesion start_pos="341" end_pos="341">lectin</biological_adhesion>
 
      <cellular_process start_pos="341" end_pos="341">lectin</cellular_process>
 
      <gene_celegans start_pos="348" end_pos="350">T25C12 _PRD_ 3</gene_celegans>
 
      <protein_celegans start_pos="356" end_pos="356">CE40820</protein_celegans>
 
      <binding start_pos="362" end_pos="362">lectin</binding>
 
      <biological_adhesion start_pos="362" end_pos="362">lectin</biological_adhesion>
 
      <cellular_process start_pos="362" end_pos="362">lectin</cellular_process>
 
      <gene_celegans start_pos="368" end_pos="370">F57F4 _PRD_ 3</gene_celegans>
 
      <protein_celegans start_pos="378" end_pos="378">CE11342</protein_celegans>
 
      <protein_celegans start_pos="379" end_pos="379">CE11344</protein_celegans>
 
      <anatomy_celegans start_pos="385" end_pos="385">C</anatomy_celegans>
 
      <gene_celegans start_pos="386" end_pos="388">F28B4 _PRD_ 3</gene_celegans>
 
      <spatial_relation start_pos="415" end_pos="415">at</spatial_relation>
 
      <phenotype_celegans start_pos="437" end_pos="437">wt</phenotype_celegans>
 
      <gene_celegans start_pos="441" end_pos="441">lec-10</gene_celegans>
 
      <tables_and_figures start_pos="443" end_pos="443">Figure</tables_and_figures>
 
      <molecular_function start_pos="446" end_pos="446">IFB-2</molecular_function>
 
      <protein_celegans start_pos="446" end_pos="446">IFB-2</protein_celegans>
 
      <molecular_function start_pos="452" end_pos="452">LEC-6</molecular_function>
 
      <protein_celegans start_pos="452" end_pos="452">LEC-6</protein_celegans>
 
      <anatomy_celegans start_pos="456" end_pos="456">B</anatomy_celegans>
 
      <anatomy_celegans start_pos="457" end_pos="457">D</anatomy_celegans>
 
      <molecular_function start_pos="458" end_pos="458">IFB-2</molecular_function>
 
      <protein_celegans start_pos="458" end_pos="458">IFB-2</protein_celegans>
 
      <anatomy_celegans start_pos="462" end_pos="462">C</anatomy_celegans>
 
      <molecular_function start_pos="465" end_pos="465">LEC-6</molecular_function>
 
      <protein_celegans start_pos="465" end_pos="465">LEC-6</protein_celegans>
 
      <localization_experimental_082208 start_pos="468" end_pos="468">GFP</localization_experimental_082208>
 
      <reporter_gene_celegans start_pos="468" end_pos="468">GFP</reporter_gene_celegans>
 
      <protein_celegans start_pos="469" end_pos="469">LEC-10</protein_celegans>
 
      <localization_experimental_082208 start_pos="472" end_pos="472">GFP</localization_experimental_082208>
 
      <reporter_gene_celegans start_pos="472" end_pos="472">GFP</reporter_gene_celegans>
 
      <anatomy_celegans start_pos="483" end_pos="483">E</anatomy_celegans>
 
      <protein_celegans start_pos="484" end_pos="484">LEC-10</protein_celegans>
 
      <localization_experimental_082208 start_pos="487" end_pos="487">GFP</localization_experimental_082208>
 
      <reporter_gene_celegans start_pos="487" end_pos="487">GFP</reporter_gene_celegans>
 
      <localization_experimental_082208 start_pos="491" end_pos="491">GFP</localization_experimental_082208>
 
      <reporter_gene_celegans start_pos="491" end_pos="491">GFP</reporter_gene_celegans>
 
      <protein_celegans start_pos="504" end_pos="504">LEC-10</protein_celegans>
 
      <spatial_relation start_pos="508" end_pos="508">at</spatial_relation>
 
      <gene_celegans start_pos="526" end_pos="526">lec-6</gene_celegans>
 
      <gene_celegans start_pos="527" end_pos="527">lec-6</gene_celegans>
 
      <gene_celegans start_pos="529" end_pos="529">lec-10</gene_celegans>
 
      <gene_celegans start_pos="530" end_pos="530">lec-10</gene_celegans>
 
      <phenotype_celegans start_pos="531" end_pos="531">wt</phenotype_celegans>
 
      <phenotype_celegans start_pos="533" end_pos="533">wt</phenotype_celegans>
 
      <gene_celegans start_pos="534" end_pos="534">lec-6</gene_celegans>
 
      <gene_celegans start_pos="536" end_pos="536">lec-10</gene_celegans>
 
      <anatomy_celegans start_pos="538" end_pos="538">B</anatomy_celegans>
 
      <transporter_activity start_pos="541" end_pos="541">SNAP</transporter_activity>
 
      <transporter_activity start_pos="543" end_pos="543">SNAP</transporter_activity>
 
      <localization_experimental_082208 start_pos="550" end_pos="550">Red</localization_experimental_082208>
 
      <tables_and_figures start_pos="552" end_pos="552">Figure</tables_and_figures>
 
      <spatial_relation start_pos="557" end_pos="557">at</spatial_relation>
 
    </sentence>
 
    </field_references>
 
    <field_results>
 
  
    <sentence id="78" subscore="15">
+
HMM: If you don't want to store it anywhere, there should be a textfield (to be filled with a URL, a filename on a local computer, or the actual list of papers) that indicates a list of papers (one ID per line, perhaps), that should be excluded. That way the filtering is highly customizable. There might be case where each curator want to exclude a different set of papers or change that set on a daily basis, and then it can be very tedious to keep track of everybody's needs.
      <content>Closer examination reveals that some LEC-10 : : GFP at CALIFORNIA INSTITUTE OF TECHNOLOGY , on December 9 , 2010 www . jbc . org Downloaded from 8 signal is concentrated in vesicular structures in the apical cytoplasm of the intestinal cells ( Figure 6C ) while LEC-6 : : GFP signal is mostly diffuse in the cytoplasm . </content>
+
 
      <spatial_relation start_pos="2" end_pos="2">Closer</spatial_relation>
+
KMV: Yes, I agree that the filtering should be as customizable as possible since curators will likely have different needs for their searches.  With respect to the list of 1243 papers that Tanya sent me, I think these papers are akin to WormBase papers that are marked for 'functional annotation' only in that they (TAIR) currently don't want them included in any of their searches.  For this, it seems that a tag or field that marks them appropriately in the database would be best, as opposed to always having a curator select or de-select a button on the search form. 
      <localization_experimental_082208 start_pos="3" end_pos="3">examination</localization_experimental_082208>
+
 
      <localization_verbs_082208 start_pos="4" end_pos="4">reveals</localization_verbs_082208>
+
HMM: OK, then we have both ways; Some filtering happens via using a information from a database, and there is an option to use a local list (file) customized by a curator.
      <protein_celegans start_pos="7" end_pos="7">LEC-10</protein_celegans>
+
 
      <localization_experimental_082208 start_pos="10" end_pos="10">GFP</localization_experimental_082208>
+
'''KMV:  Okay, that will be good.'''
      <reporter_gene_celegans start_pos="10" end_pos="10">GFP</reporter_gene_celegans>
+
 
      <spatial_relation start_pos="11" end_pos="11">at</spatial_relation>
+
For the SVM results, it seems that the information should also be stored, so curators could have the option to select what level of confidence that might want for their Textpresso searches, and even the option to only search the predicted negatives.  Closer integration of Textpresso searches with SVM results will be very helpful.  This gets done programatically for CCC right now and manually (by me) for the macromolecular interactions.  For the long term, having the option to combine the SVMs with both Textpresso searches and with the HMMs would be ideal.'''
      <biological_process start_pos="30" end_pos="30">signal</biological_process>
+
 
      <localization_experimental_082208 start_pos="30" end_pos="30">signal</localization_experimental_082208>
+
HMM: I'll put that on Textpresso's to-do list.
      <localization_verbs_082208 start_pos="32" end_pos="32">concentrated</localization_verbs_082208>
+
 
      <localization_cell_components_082208 start_pos="34" end_pos="34">vesicular</localization_cell_components_082208>
+
'''KMV:  Great.  I'm still thinking about how best to handle the already long, and continuously growing, list of SVM results.  The confidence level and data type lists are fixed and relatively small, respectively, but we'll need a way to deal with the individual SVM results:'''
      <localization start_pos="38" end_pos="38">apical</localization>
+
 
      <localization_cell_components_082208 start_pos="38" end_pos="38">apical</localization_cell_components_082208>
+
http://caprica.caltech.edu/celegans/svm_results/
      <cell_part start_pos="39" end_pos="39">cytoplasm</cell_part>
+
 
      <localization_cell_components_082208 start_pos="39" end_pos="39">cytoplasm</localization_cell_components_082208>
+
'''In practice, how would curators select which results to search?  For example, suppose I wanted to perform a Textpresso search on only those papers that were high confidence for other_expr in the last six months of 2010.'''
      <anatomy_celegans start_pos="42" end_pos="43">intestinal cells</anatomy_celegans>
+
 
      <anatomy start_pos="43" end_pos="43">cells</anatomy>
+
 
      <cell start_pos="43" end_pos="43">cells</cell>
+
HMM: Still need to figure out where the filtering based on database information should happen. Is Juancarlos filtering them in out in the CCC form?
      <tables_and_figures start_pos="45" end_pos="45">Figure</tables_and_figures>
+
 
      <molecular_function start_pos="49" end_pos="49">LEC-6</molecular_function>
+
'''KMV: For the next TAIR search, perhaps we could just filter the sentences before they get included into the source files used for the curation form.''' 
      <protein_celegans start_pos="49" end_pos="49">LEC-6</protein_celegans>
+
 
      <localization_experimental_082208 start_pos="52" end_pos="52">GFP</localization_experimental_082208>
+
'''This issue is not such a big deal for the current elegans CCC curation, since the number of for 'functional annotation' only papers in postgres has plummeted.  This issue is a bigger deal, though, when WB curators want to search the whole WB corpus for something.'''
      <reporter_gene_celegans start_pos="52" end_pos="52">GFP</reporter_gene_celegans>
+
 
      <biological_process start_pos="53" end_pos="53">signal</biological_process>
+
==From Searches to the Curation Form==
      <localization_experimental_082208 start_pos="53" end_pos="53">signal</localization_experimental_082208>
+
 
      <biological_process start_pos="56" end_pos="56">diffuse</biological_process>
+
This pipeline would make use of the XML format of returned sentences to construct a version of the current CCC curation form; a version for TAIR is shown here:
      <localization_experimental_082208 start_pos="56" end_pos="56">diffuse</localization_experimental_082208>
+
 
      <cell_part start_pos="59" end_pos="59">cytoplasm</cell_part>
+
http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/tair/tair_ccc.cgi
      <localization_cell_components_082208 start_pos="59" end_pos="59">cytoplasm</localization_cell_components_082208>
+
 
    </sentence>
+
1) Keep three action buttons on the top - Submit, Search for Sentence, Search for Paper
 +
 
 +
2) Display all information from within the <bibliography> tags.
 +
 
 +
''This will display more information than is currently shown, but seemed easier than just picking a few tags from within the bibliography section.  The additional information may also be helpful to curators.''
 +
 
 +
3) Do not display information within any <field_references> tags.
 +
 
 +
''Is this how scrambled sentences are typically identified?''
 +
 
 +
HMM: No. Arun wrote a function that uses heuristics such as length of sentence, repetitious patterns within the sentence.
 +
 
 +
KMV:  Okay, got it.  I had forgotten how that worked.  Thanks.
 +
 
 +
4) Potentially curatable sentences are found within the <field_results> tags.  Working from left to right on the curation form, here is how the information from the XML would translate to the curation form.
 +
 
 +
 
 +
First box, Gene/Protein Name
 +
 
 +
This box lists all entities within the species-specific protein or gene tag.  Right now, the category name for this will be different for each implementation, for example:
 +
 
 +
protein_celegans
 +
 
 +
genes_arabidopsis
 +
 
 +
dicty_genes
 +
 
 +
''For TAIR we displayed both the name of the gene as presented in the sentence, as well as the name in the sentence mapped to a specific TAIR gene ID.  This is because some Arabidopsis gene names are used for more than one locus and the TAIR curators wanted a way to ensure they'd be making the annotation to the correct locus.  It would be fine to implement this universally, I think.  This requires a mapping file from each group and a regular pipeline for updating the gene or protein names file.''
 +
 
 +
 
 +
Second box, Component Term in Sentence
 +
 
 +
This box lists the component term as identified by the component category.
 +
 
 +
For the various implementations there are currently several component categories.
 +
 
 +
For elegans and Arabidopsis:
 +
 
 +
localization_cell_components_082208
 +
CCC_TAIR
 +
 
 +
For dicty:
 +
 
 +
localization_cell_components_050808
 +
localization_cell_components_082208
 +
 
 +
''If more than one cellular component category is used in the mark-up, we will need to figure out how to determine which one the curator wants to display on the curation form.  Alternatively, we could restrict each implementation to only one component category at a time.''
 +
 
 +
HMM: This shouldn't be a problem unless I don't understand what you are saying.
 +
 
 +
'''KMV:  Okay, I was referring to a situation where there might be a category localization_cell_component_082208 as well as CCC_TAIR and whether or not it'd be difficult to discern which one the curator wanted to display in the curation form.  Could the information in the category selection boxes be used for this?'''
 +
 
 +
 
 +
'''HMM: Selection box should work.'''
 +
 
 +
''Currently, this box only displays the component terms in sentences that matched the category.  It might be helpful, though, if in the future curators could enter new component terms displayed in the sentence but not in the category and have this entry somehow be added to the category for the next mark-up.''
 +
 
 +
 
 +
Third box, CC Term in GO
 +
 
 +
This box displays, if available, any GO terms that have already been curated from component terms in sentences, or allows curators to enter a new GO term, if needed.
 +
 
 +
''The suggested GO terms come from the relationship index that has been built up over time from elegans curation.  I think it'd be good to use one relationship index for all implementations to leverage community efforts.''
 +
 
 +
==Curation Actions and Sentence Display==
 +
 
 +
The color-key displayed at the top of the sentence can stay.
 +
 
 +
''Additional options for sentence display include being able to see surrounding sentences and being able to link out to the corresponding PDF with the appropriate sentences marked-up in the PDF.''
 +
 
 +
The curator options are:
 +
 
 +
Curating
 +
 
 +
To make a new GO annotation, curators need to select, or create, an entry in each of the three boxes on the left-hand side of the form. Once a selection is made, the region will be highlighted in blue. For the last column, CC term in GO, curators can either select one of the suggested GO terms in the list (where is this relationship file located, i.e. its path --K), or enter a new one. Then select curate from the list of radio buttons above the sentence and, if you are ready to enter your annotations, click on Make connections:Submit at the top of the page.
 +
 
 +
If there is no list of GO terms in the third box, that means that a GO annotation using the term in the second box has not yet been made. In this case, the curator will need to enter the new GO term manually.
 +
 
 +
''An autocomplete function for adding new GO terms would be good here.  Also, we need a more efficient way to handle having to make duplicate annotations from one sentence.  Right now I reload the sentence using the search sentence button, but it'd be good if curators could make more than one annotation from a sentence more easily.''
 +
 
 +
''The way the form currently works is that once a sentence has been annotated and the information submitted, the sentence is removed from the display.  Curators may want to be able to see what they've annotated though, so we might need to come up with a different strategy.  Also, since some of these source files can be quite large, curators may want to have a way to just pick up where they left off if we decide not to remove sentences after curation.''
 +
 
 +
 
 +
Marking Sentences Not Used for Curation
 +
 
 +
If an annotation cannot be made from a sentence, then curators may record the reason that an annotation was not made. Keeping track of these sentences will help build up a training set for improving search results.  The different options for not curating are described below:
 +
 
 +
 
 +
Already Curated/Already Done
 +
 
 +
If a curator does not wish to make another annotation for information previously curated, they can select the appropriate entity from each curation box and the Already Curated radio button.
 +
 
 +
''The way this was supposed to work for elegans was that the information about previous curation would be entered by curators using this form, stored in a postgres table and then, when next presented, displayed at the bottom after the Already Done heading in red.  I'd imagine that this feature is something groups may want to specify from the outset of their searches with options perhaps to include them in the curation boxes, show the information as Already Done but not include them in the curation boxes, or not show this information at all.''
 +
 
 +
 
 +
Scrambled Sentence
 +
 
 +
If, during the pdf-to-text conversion, a sentence has become scrambled, curators can mark it as such here. These are becoming much less frequent.
 +
 
 +
''Should we label this Scrambled Sentence/Run-on Sentence to collect both, split these two or not worry about the run-on sentences?  Having a run-on sentence does not necessarily preclude making an annotation, though, so these options might have to become check boxes instead of radio buttons.''
 +
 
 +
HMM: In Textpresso, scrambled sentences are identified by an algorithm. This is not a machine learning algorithm, so Textpresso has no use for the sentences marked scrambled other than use the actual information in the interfaces.
 +
 
 +
'''KMV:  Okay, then it sounds like specifically marking sentences as scrambled doesn't help Textpresso.  I still think it'd be good for curators to distinguish scrambled sentences from non-scrambled false positives, though, as the non-scrambled false positives may be a good resource for category updating or finding terms to use to exclude sentences from the results.'''
 +
 
 +
'''HMM: It doesn't help in Textpresso's current implementation, however, if curators get annoyed with the current state-of-art, we might start implementing a new algorithm which most likely will be machine-learning based. For that we would need the scrambled sentences marked. So for future development, I'd like to continue to collect this information. (I actually don't know where they are stored right now, Juancarlos would need to help me on that.)'''
 +
 
 +
False Positive
 +
 
 +
If a returned sentence has nothing to do with subcellular localization, then it is marked as a false positive. For example:
 +
 
 +
In contrast, PP2AA3 rescues root tip organization weakly even when expression is driven by the RCN1 promoter, demonstrating a more stringent requirement for A subunit function in the root apical meristem.
 +
 
 +
''These sentences could be saved in a flat file possibly for future category refinement.''
 +
 
 +
 
 +
Not GO Curatable
 +
 
 +
This is intended to mark sentences that may describe subcellular localization, but the information contained in them would not normally be curated for GO. For example, the localization described is for a mutant protein, or the localization is for the wild-type protein in a mutant background. An example sentence:
 +
 
 +
No alteration in expression levels of soluble GFP or GFP::RAB-3 was observed in the synapses, cell body or axon in uba-1 animals (Figure 1A: d1-d6, f1-f6, 1C, 1D, Figure S4F, S4G, S4H).
 +
 
 +
''Like the false positives, these sentences could be used to refine searches, e.g. to develop lists of terms or HMMs, that could be used to filter potentially uncuratable sentences from the search results.''
 +
 
 +
==Dumping Annotations==
 +
 
 +
Annotations will be stored in postgres tables with two options available for dumping them out into tab-delimited text files.  These options would be presented as buttons on the curation form.
 +
 
 +
Here are examples of what would be contained in the files using the sample sentence below:
 +
 
 +
SentenceID 9 -- S 7 P 43065 S s3 E
 +
The first enzyme, gamma-glutamate cysteine ligase (GSH1), responsible for synthesis of gamma-glutamylcysteine (gamma-EC), is, in Arabidopsis, exclusively located in the plastids, whereas the second enzyme, glutathione synthetase (GSH2), is located in both plastids and cytosol.
 +
 
 +
 
 +
1) Three-column, tab-delimited format
 +
 
 +
'''1)''' 3-column user submission file
 +
 
 +
{| {{table border ="1"}}
 +
| align="center" style="background:#f0f0f0;"|Locus Name
 +
| align="center" style="background:#f0f0f0;"|GO ID
 +
| align="center" style="background:#f0f0f0;"|Paper ID
 +
|-
 +
|Column 1 of ftp mapping file||GO:0009536||PMID:nnnnnnnn or TAIR:43065
 +
|-
 +
|}
 +
 
 +
AT4G23100 GO:0009536 TAIR:43065
 +
 
 +
AT5G27380 GO:0009536 TAIR:43065
 +
 
 +
AT5G27380 GO:0005829 TAIR:43065
 +
 
 +
For column 1, map the gene symbol (second column) to the locus name (first column) in the gene_aliases file:
 +
 
 +
ftp://ftp.arabidopsis.org/home/tair/Genes/gene_aliases.20101027
 +
 
 +
For column 2, use the geneontology.obo file to map the GO term to the GO ID:
 +
 
 +
http://www.geneontology.org/ontology/obo_format_1_2/gene_ontology_ext.obo
 +
 
 +
For column 3, use the PMID (preferred) or the TAIR paper ID, pipe separated, prefaced PMID: and TAIR: respectively.
 +
 
 +
This information could be taken from the bibliography XML.
 +
 
 +
 
 +
 
 +
2) 18-column, tab-delimited GAF 2.0
 +
 
 +
{| {{table border ="1"}}
 +
| align="center" style="background:#f0f0f0;"|'''Column'''
 +
| align="center" style="background:#f0f0f0;"|'''Content'''
 +
| align="center" style="background:#f0f0f0;"|'''Required'''
 +
| align="center" style="background:#f0f0f0;"|'''Cardinality'''
 +
| align="center" style="background:#f0f0f0;"|'''TAIR Entry'''
 +
|-
 +
|1||DB||Required||1||TAIR
 +
|-
 +
|2||DB Object ID||Required||1||Column 1 of gene_aliases file
 +
|-
 +
|3||DB Object Symbol||Required||1||Column 2 gene_aliases file
 +
|-
 +
|4||Qualifier||Optional||0 or greater||NULL
 +
|-
 +
|5||GO ID||Required||1||GO:0005654
 +
|-
 +
|6||DB Reference||Required||1 or greater||PMID:21074051 or, if no PMID, TAIR:42184
 +
|-
 +
|7||Evidence Code||Required||1||IDA
 +
|-
 +
|8||With or From||Optional||0 or greater||NULL
 +
|-
 +
|9||Aspect||Required||1||C
 +
|-
 +
|10||DB Object Name||Optional||0 or 1||Column 3 of gene_aliases file
 +
|-
 +
|11||DB Object Synonym||Optional||0 or greater||'''ASK TANYA'''
 +
|-
 +
|12||DB Object Type||Required||1||protein
 +
|-
 +
|13||Taxon (|taxon)||Required||1 or 2||taxon:3702
 +
|-
 +
|14||Date||Required||1||Date annotation is made
 +
|-
 +
|15||Assigned By||Required||1||TAIR
 +
|-
 +
|16||Annotation Extension||Optional||0 or greater||NULL
 +
|-
 +
|17||Gene Product Form ID||Optional||0 or greater||NULL
 +
|-
 +
|}
 +
 
 +
 
 +
''Right now, TAIR only needs the simple, three-column format.  Some elements of the Gene Association File Format, such as Taxon ID, will need to be hard-coded for each database, so we will have to store that information somewhere.''
  
    <sentence id="23" subscore="10">
 
      <content>CGL2 shows intestinal lumen staining as well as a diffuse cytoplasmic staining throughout the intestinal cells ( Figure 2F ) similar to that of LEC-9 . </content>
 
      <localization_verbs_082208 start_pos="3" end_pos="3">shows</localization_verbs_082208>
 
      <mf_int_verbs_2009-11-25 start_pos="3" end_pos="3">shows</mf_int_verbs_2009-11-25>
 
      <anatomy_celegans start_pos="4" end_pos="5">intestinal lumen</anatomy_celegans>
 
      <localization_experimental_082208 start_pos="6" end_pos="6">staining</localization_experimental_082208>
 
      <localization_verbs_082208 start_pos="6" end_pos="6">staining</localization_verbs_082208>
 
      <method start_pos="6" end_pos="6">staining</method>
 
      <biological_process start_pos="11" end_pos="11">diffuse</biological_process>
 
      <localization_experimental_082208 start_pos="11" end_pos="11">diffuse</localization_experimental_082208>
 
      <localization_cell_components_082208 start_pos="12" end_pos="12">cytoplasmic</localization_cell_components_082208>
 
      <localization_experimental_082208 start_pos="13" end_pos="13">staining</localization_experimental_082208>
 
      <localization_verbs_082208 start_pos="13" end_pos="13">staining</localization_verbs_082208>
 
      <method start_pos="13" end_pos="13">staining</method>
 
      <localization_experimental_082208 start_pos="14" end_pos="14">throughout</localization_experimental_082208>
 
      <anatomy_celegans start_pos="16" end_pos="17">intestinal cells</anatomy_celegans>
 
      <anatomy start_pos="17" end_pos="17">cells</anatomy>
 
      <cell start_pos="17" end_pos="17">cells</cell>
 
      <tables_and_figures start_pos="19" end_pos="19">Figure</tables_and_figures>
 
      <comparison start_pos="22" end_pos="22">similar</comparison>
 
      <molecular_function start_pos="26" end_pos="26">LEC-9</molecular_function>
 
      <protein_celegans start_pos="26" end_pos="26">LEC-9</protein_celegans>
 
    </sentence>
 
  
    <sentence id="22" subscore="9">
 
      <content>LEC-9 and LEC-11 show relatively abundant but diffuse signal in the cytoplasm of the intestinal cells ( Figure 2C and E ) as compared to the other fusions . </content>
 
      <molecular_function start_pos="2" end_pos="2">LEC-9</molecular_function>
 
      <protein_celegans start_pos="2" end_pos="2">LEC-9</protein_celegans>
 
      <protein_celegans start_pos="4" end_pos="4">LEC-11</protein_celegans>
 
      <localization_verbs_082208 start_pos="5" end_pos="5">show</localization_verbs_082208>
 
      <mf_int_verbs_2009-11-25 start_pos="5" end_pos="5">show</mf_int_verbs_2009-11-25>
 
      <biological_process start_pos="9" end_pos="9">diffuse</biological_process>
 
      <localization_experimental_082208 start_pos="9" end_pos="9">diffuse</localization_experimental_082208>
 
      <biological_process start_pos="10" end_pos="10">signal</biological_process>
 
      <localization_experimental_082208 start_pos="10" end_pos="10">signal</localization_experimental_082208>
 
      <cell_part start_pos="13" end_pos="13">cytoplasm</cell_part>
 
      <localization_cell_components_082208 start_pos="13" end_pos="13">cytoplasm</localization_cell_components_082208>
 
      <anatomy_celegans start_pos="16" end_pos="17">intestinal cells</anatomy_celegans>
 
      <anatomy start_pos="17" end_pos="17">cells</anatomy>
 
      <cell start_pos="17" end_pos="17">cells</cell>
 
      <tables_and_figures start_pos="19" end_pos="19">Figure</tables_and_figures>
 
      <anatomy_celegans start_pos="22" end_pos="22">E</anatomy_celegans>
 
      <comparison start_pos="25" end_pos="25">compared</comparison>
 
      <localization_verbs_082208 start_pos="25" end_pos="25">compared</localization_verbs_082208>
 
      <mf_int_verbs_2009-11-25 start_pos="25" end_pos="25">compared</mf_int_verbs_2009-11-25>
 
      <consort start_pos="29" end_pos="29">fusions</consort>
 
      <localization_experimental_082208 start_pos="29" end_pos="29">fusions</localization_experimental_082208>
 
      <sequence start_pos="29" end_pos="29">fusions</sequence>
 
    </sentence>
 
  
    <sentence id="87" subscore="7">
 
      <content>This LEC-10 containing region occupy a similar subcellular region where the LEC-10 : : GFP-containing vesicular structures are found ( compare Figure 6C and E ) . </content>
 
      <protein_celegans start_pos="3" end_pos="3">LEC-10</protein_celegans>
 
      <characterization start_pos="4" end_pos="4">containing</characterization>
 
      <localization_verbs_082208 start_pos="4" end_pos="4">containing</localization_verbs_082208>
 
      <mf_int_verbs_2009-11-25 start_pos="4" end_pos="4">containing</mf_int_verbs_2009-11-25>
 
      <localization start_pos="5" end_pos="5">region</localization>
 
      <mf_int_assay_2009-11-25 start_pos="5" end_pos="5">region</mf_int_assay_2009-11-25>
 
      <sequence start_pos="5" end_pos="5">region</sequence>
 
      <comparison start_pos="8" end_pos="8">similar</comparison>
 
      <localization_experimental_082208 start_pos="9" end_pos="9">subcellular</localization_experimental_082208>
 
      <localization start_pos="10" end_pos="10">region</localization>
 
      <mf_int_assay_2009-11-25 start_pos="10" end_pos="10">region</mf_int_assay_2009-11-25>
 
      <sequence start_pos="10" end_pos="10">region</sequence>
 
      <protein_celegans start_pos="13" end_pos="13">LEC-10</protein_celegans>
 
      <localization_cell_components_082208 start_pos="17" end_pos="17">vesicular</localization_cell_components_082208>
 
      <localization_verbs_082208 start_pos="20" end_pos="20">found</localization_verbs_082208>
 
      <mf_int_verbs_2009-11-25 start_pos="20" end_pos="20">found</mf_int_verbs_2009-11-25>
 
      <comparison start_pos="22" end_pos="22">compare</comparison>
 
      <mf_int_verbs_2009-11-25 start_pos="22" end_pos="22">compare</mf_int_verbs_2009-11-25>
 
      <tables_and_figures start_pos="23" end_pos="23">Figure</tables_and_figures>
 
      <anatomy_celegans start_pos="26" end_pos="26">E</anatomy_celegans>
 
    </sentence>
 
    </field_results>
 
  </matching_sentences>
 
  
  </textpresso_article>
 
</singleresult>
 
</textpresso_output>
 
<!-- This XML file failed validation test. exit status of RXP XML parser = 512. If you are concerned, please contact the Textpresso group. -->
 
  
  
  
 
Back to [[Gene Ontology]]
 
Back to [[Gene Ontology]]

Latest revision as of 20:04, 21 January 2011

Summary

These specifications are for allowing a curator to search any Textpresso implementation using the CCC categories, submit the resulting sentences to a curation form, make annotations, and download the annotations in either a simple three-column format or the GO's gene_association file (GAF) format.

Searches

Curators would be able to search Textpresso using their chosen criteria and export the sentences to the CCC curation form. The XML mark-up of each returned sentence will be used to populate the curation boxes on the form and color-code the search results.

Curators will need an option to send results to a curation form with a way to name the search so they can select it in the list of source files on the curation form.

Future options for filtering search results may include filtering papers that are in the corpus but not appropriate for curation (e.g., WB's 'functional annotation papers' or TAIR's black list of papers on other organisms - Tanya sent me a list of 1243 papers like this) or restricting searches to a particular level of SVM classification (e.g., high-confidence SVM papers only). For the former, would we need a consistent tag across all implementations? For the latter, what would be the best way to integrate the SVM classification results? Store them in postgres? Store them in Textpresso?

HMM: If you don't want to store it anywhere, there should be a textfield (to be filled with a URL, a filename on a local computer, or the actual list of papers) that indicates a list of papers (one ID per line, perhaps), that should be excluded. That way the filtering is highly customizable. There might be case where each curator want to exclude a different set of papers or change that set on a daily basis, and then it can be very tedious to keep track of everybody's needs.

KMV: Yes, I agree that the filtering should be as customizable as possible since curators will likely have different needs for their searches. With respect to the list of 1243 papers that Tanya sent me, I think these papers are akin to WormBase papers that are marked for 'functional annotation' only in that they (TAIR) currently don't want them included in any of their searches. For this, it seems that a tag or field that marks them appropriately in the database would be best, as opposed to always having a curator select or de-select a button on the search form.

HMM: OK, then we have both ways; Some filtering happens via using a information from a database, and there is an option to use a local list (file) customized by a curator.

KMV: Okay, that will be good.

For the SVM results, it seems that the information should also be stored, so curators could have the option to select what level of confidence that might want for their Textpresso searches, and even the option to only search the predicted negatives. Closer integration of Textpresso searches with SVM results will be very helpful. This gets done programatically for CCC right now and manually (by me) for the macromolecular interactions. For the long term, having the option to combine the SVMs with both Textpresso searches and with the HMMs would be ideal.

HMM: I'll put that on Textpresso's to-do list.

KMV: Great. I'm still thinking about how best to handle the already long, and continuously growing, list of SVM results. The confidence level and data type lists are fixed and relatively small, respectively, but we'll need a way to deal with the individual SVM results:

http://caprica.caltech.edu/celegans/svm_results/

In practice, how would curators select which results to search? For example, suppose I wanted to perform a Textpresso search on only those papers that were high confidence for other_expr in the last six months of 2010.


HMM: Still need to figure out where the filtering based on database information should happen. Is Juancarlos filtering them in out in the CCC form?

KMV: For the next TAIR search, perhaps we could just filter the sentences before they get included into the source files used for the curation form.

This issue is not such a big deal for the current elegans CCC curation, since the number of for 'functional annotation' only papers in postgres has plummeted. This issue is a bigger deal, though, when WB curators want to search the whole WB corpus for something.

From Searches to the Curation Form

This pipeline would make use of the XML format of returned sentences to construct a version of the current CCC curation form; a version for TAIR is shown here:

http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/tair/tair_ccc.cgi

1) Keep three action buttons on the top - Submit, Search for Sentence, Search for Paper

2) Display all information from within the <bibliography> tags.

This will display more information than is currently shown, but seemed easier than just picking a few tags from within the bibliography section. The additional information may also be helpful to curators.

3) Do not display information within any <field_references> tags.

Is this how scrambled sentences are typically identified?

HMM: No. Arun wrote a function that uses heuristics such as length of sentence, repetitious patterns within the sentence.

KMV: Okay, got it. I had forgotten how that worked. Thanks.

4) Potentially curatable sentences are found within the <field_results> tags. Working from left to right on the curation form, here is how the information from the XML would translate to the curation form.


First box, Gene/Protein Name

This box lists all entities within the species-specific protein or gene tag. Right now, the category name for this will be different for each implementation, for example:

protein_celegans

genes_arabidopsis

dicty_genes

For TAIR we displayed both the name of the gene as presented in the sentence, as well as the name in the sentence mapped to a specific TAIR gene ID. This is because some Arabidopsis gene names are used for more than one locus and the TAIR curators wanted a way to ensure they'd be making the annotation to the correct locus. It would be fine to implement this universally, I think. This requires a mapping file from each group and a regular pipeline for updating the gene or protein names file.


Second box, Component Term in Sentence

This box lists the component term as identified by the component category.

For the various implementations there are currently several component categories.

For elegans and Arabidopsis:

localization_cell_components_082208 CCC_TAIR

For dicty:

localization_cell_components_050808 localization_cell_components_082208

If more than one cellular component category is used in the mark-up, we will need to figure out how to determine which one the curator wants to display on the curation form. Alternatively, we could restrict each implementation to only one component category at a time.

HMM: This shouldn't be a problem unless I don't understand what you are saying.

KMV: Okay, I was referring to a situation where there might be a category localization_cell_component_082208 as well as CCC_TAIR and whether or not it'd be difficult to discern which one the curator wanted to display in the curation form. Could the information in the category selection boxes be used for this?


HMM: Selection box should work.

Currently, this box only displays the component terms in sentences that matched the category. It might be helpful, though, if in the future curators could enter new component terms displayed in the sentence but not in the category and have this entry somehow be added to the category for the next mark-up.


Third box, CC Term in GO

This box displays, if available, any GO terms that have already been curated from component terms in sentences, or allows curators to enter a new GO term, if needed.

The suggested GO terms come from the relationship index that has been built up over time from elegans curation. I think it'd be good to use one relationship index for all implementations to leverage community efforts.

Curation Actions and Sentence Display

The color-key displayed at the top of the sentence can stay.

Additional options for sentence display include being able to see surrounding sentences and being able to link out to the corresponding PDF with the appropriate sentences marked-up in the PDF.

The curator options are:

Curating

To make a new GO annotation, curators need to select, or create, an entry in each of the three boxes on the left-hand side of the form. Once a selection is made, the region will be highlighted in blue. For the last column, CC term in GO, curators can either select one of the suggested GO terms in the list (where is this relationship file located, i.e. its path --K), or enter a new one. Then select curate from the list of radio buttons above the sentence and, if you are ready to enter your annotations, click on Make connections:Submit at the top of the page.

If there is no list of GO terms in the third box, that means that a GO annotation using the term in the second box has not yet been made. In this case, the curator will need to enter the new GO term manually.

An autocomplete function for adding new GO terms would be good here. Also, we need a more efficient way to handle having to make duplicate annotations from one sentence. Right now I reload the sentence using the search sentence button, but it'd be good if curators could make more than one annotation from a sentence more easily.

The way the form currently works is that once a sentence has been annotated and the information submitted, the sentence is removed from the display. Curators may want to be able to see what they've annotated though, so we might need to come up with a different strategy. Also, since some of these source files can be quite large, curators may want to have a way to just pick up where they left off if we decide not to remove sentences after curation.


Marking Sentences Not Used for Curation

If an annotation cannot be made from a sentence, then curators may record the reason that an annotation was not made. Keeping track of these sentences will help build up a training set for improving search results. The different options for not curating are described below:


Already Curated/Already Done

If a curator does not wish to make another annotation for information previously curated, they can select the appropriate entity from each curation box and the Already Curated radio button.

The way this was supposed to work for elegans was that the information about previous curation would be entered by curators using this form, stored in a postgres table and then, when next presented, displayed at the bottom after the Already Done heading in red. I'd imagine that this feature is something groups may want to specify from the outset of their searches with options perhaps to include them in the curation boxes, show the information as Already Done but not include them in the curation boxes, or not show this information at all.


Scrambled Sentence

If, during the pdf-to-text conversion, a sentence has become scrambled, curators can mark it as such here. These are becoming much less frequent.

Should we label this Scrambled Sentence/Run-on Sentence to collect both, split these two or not worry about the run-on sentences? Having a run-on sentence does not necessarily preclude making an annotation, though, so these options might have to become check boxes instead of radio buttons.

HMM: In Textpresso, scrambled sentences are identified by an algorithm. This is not a machine learning algorithm, so Textpresso has no use for the sentences marked scrambled other than use the actual information in the interfaces.

KMV: Okay, then it sounds like specifically marking sentences as scrambled doesn't help Textpresso. I still think it'd be good for curators to distinguish scrambled sentences from non-scrambled false positives, though, as the non-scrambled false positives may be a good resource for category updating or finding terms to use to exclude sentences from the results.

HMM: It doesn't help in Textpresso's current implementation, however, if curators get annoyed with the current state-of-art, we might start implementing a new algorithm which most likely will be machine-learning based. For that we would need the scrambled sentences marked. So for future development, I'd like to continue to collect this information. (I actually don't know where they are stored right now, Juancarlos would need to help me on that.)

False Positive

If a returned sentence has nothing to do with subcellular localization, then it is marked as a false positive. For example:

In contrast, PP2AA3 rescues root tip organization weakly even when expression is driven by the RCN1 promoter, demonstrating a more stringent requirement for A subunit function in the root apical meristem.

These sentences could be saved in a flat file possibly for future category refinement.


Not GO Curatable

This is intended to mark sentences that may describe subcellular localization, but the information contained in them would not normally be curated for GO. For example, the localization described is for a mutant protein, or the localization is for the wild-type protein in a mutant background. An example sentence:

No alteration in expression levels of soluble GFP or GFP::RAB-3 was observed in the synapses, cell body or axon in uba-1 animals (Figure 1A: d1-d6, f1-f6, 1C, 1D, Figure S4F, S4G, S4H).

Like the false positives, these sentences could be used to refine searches, e.g. to develop lists of terms or HMMs, that could be used to filter potentially uncuratable sentences from the search results.

Dumping Annotations

Annotations will be stored in postgres tables with two options available for dumping them out into tab-delimited text files. These options would be presented as buttons on the curation form.

Here are examples of what would be contained in the files using the sample sentence below:

SentenceID 9 -- S 7 P 43065 S s3 E The first enzyme, gamma-glutamate cysteine ligase (GSH1), responsible for synthesis of gamma-glutamylcysteine (gamma-EC), is, in Arabidopsis, exclusively located in the plastids, whereas the second enzyme, glutathione synthetase (GSH2), is located in both plastids and cytosol.


1) Three-column, tab-delimited format

1) 3-column user submission file

Locus Name GO ID Paper ID
Column 1 of ftp mapping file GO:0009536 PMID:nnnnnnnn or TAIR:43065

AT4G23100 GO:0009536 TAIR:43065

AT5G27380 GO:0009536 TAIR:43065

AT5G27380 GO:0005829 TAIR:43065

For column 1, map the gene symbol (second column) to the locus name (first column) in the gene_aliases file:

ftp://ftp.arabidopsis.org/home/tair/Genes/gene_aliases.20101027

For column 2, use the geneontology.obo file to map the GO term to the GO ID:

http://www.geneontology.org/ontology/obo_format_1_2/gene_ontology_ext.obo

For column 3, use the PMID (preferred) or the TAIR paper ID, pipe separated, prefaced PMID: and TAIR: respectively.

This information could be taken from the bibliography XML.


2) 18-column, tab-delimited GAF 2.0

Column Content Required Cardinality TAIR Entry
1 DB Required 1 TAIR
2 DB Object ID Required 1 Column 1 of gene_aliases file
3 DB Object Symbol Required 1 Column 2 gene_aliases file
4 Qualifier Optional 0 or greater NULL
5 GO ID Required 1 GO:0005654
6 DB Reference Required 1 or greater PMID:21074051 or, if no PMID, TAIR:42184
7 Evidence Code Required 1 IDA
8 With or From Optional 0 or greater NULL
9 Aspect Required 1 C
10 DB Object Name Optional 0 or 1 Column 3 of gene_aliases file
11 DB Object Synonym Optional 0 or greater ASK TANYA
12 DB Object Type Required 1 protein
13 taxon) Required 1 or 2 taxon:3702
14 Date Required 1 Date annotation is made
15 Assigned By Required 1 TAIR
16 Annotation Extension Optional 0 or greater NULL
17 Gene Product Form ID Optional 0 or greater NULL


Right now, TAIR only needs the simple, three-column format. Some elements of the Gene Association File Format, such as Taxon ID, will need to be hard-coded for each database, so we will have to store that information somewhere.




Back to Gene Ontology