Difference between revisions of "VariationConciseDescriptions"

From WormBaseWiki
Jump to navigationJump to search
 
(77 intermediate revisions by the same user not shown)
Line 1: Line 1:
Work in progress based on Ranjana's wiki for creating [[Generation_of_automated_descriptions | automated concise descriptions of genes]]. <br>
+
==Useful links==
 +
[[All_OA_tables#app_tables_Allele_phenotype | app tables]]<br>
 +
[https://raw.githubusercontent.com/WormBase/wormbase-pipeline/master/wspec/models.wrm variation model]<br>
 +
Ranjana's wiki for creating [[Generation_of_automated_descriptions | automated concise descriptions of genes]]. <br>
 +
[[WBGene_information_and_status_pipeline | geneace upload info]]<br>
 +
[[Source_and_maintenance_of_non-WBGene_info | geneace upload of nongene info]]<br>
  
==Goals for variation concise descriptions==
+
==Variation concise descriptions==
This project aims to create a human-readable summary of an allele that includes a description of its lesion and effect on the gene's function and resulting phenotypes.  
+
Human-readable summaries of alleles that include a description of its lesion, its effect on the gene's function, and resulting phenotypes. These descriptions aim to recreate summaries like those in the ''C. elegans'' I & II books and enhance them with up to date data. A first step is to make and display summaries for each variation; second step is to extract info and combine the summaries of a gene's variations and display them on the corresponding gene pages.
These descriptions aim to recreate the summaries (enhanced) like those in the ''C. elegans'' I & II books, for example
+
<pre>
<pre>  
+
From C. elegans II
e51 : paralysed kinky small irregular pharyngeal pumping able to lay eggs. ES3 ME0. NA > 30  
+
e51 : paralysed kinky small irregular pharyngeal pumping able to lay eggs. ES3 ME0. NA > 30
 
(e450amber e312amber (non-null) e309 (see sup- 6) etc.; all similar to e51 or slightly weaker).
 
(e450amber e312amber (non-null) e309 (see sup- 6) etc.; all similar to e51 or slightly weaker).
See also e51, e328, e450, e973, e985, e2208, e2274 [C.elegansII] e51 : paralysed, kinky, small,  
+
See also e51, e328, e450, e973, e985, e2208, e2274 [C.elegansII] e51 : paralysed, kinky, small,
irregular pharyngeal pumping; able to lay eggs. Ric, high acetylcholine levels; variable  
+
irregular pharyngeal pumping; able to lay eggs. Ric, high acetylcholine levels; variable
neuroanatomical defects.ES3 ME0. OA>30: e450amb, e312amb (non-null),e309 (suppressed by  
+
neuroanatomical defects.ES3 ME0. OA>30: e450amb, e312amb (non-null),e309 (suppressed by
 
sup-6), s69, s178 etc. All alleles similar to e51 or slightly weaker.</pre>
 
sup-6), s69, s178 etc. All alleles similar to e51 or slightly weaker.</pre>
  
 
[http://www.informatics.jax.org MGI] produces these summaries (do not know if they are automated): for [http://www.informatics.jax.org/marker/MGI:95294 MGI:95294]
 
[http://www.informatics.jax.org MGI] produces these summaries (do not know if they are automated): for [http://www.informatics.jax.org/marker/MGI:95294 MGI:95294]
<pre>  
+
<pre>
Mutations widely affect epithelial development. Null homozygote survival is strain dependent, with  
+
Mutations widely affect epithelial development. Null homozygote survival is strain dependent, with
defects observed in skin, eye, brain, viscera, palate, tongue and other tisses. Other  mutations  
+
defects observed in skin, eye, brain, viscera, palate, tongue and other tisses. Other  mutations
produce an open eyed, curly whisker phenotype, while a dominant hypermorph yields a thickened  
+
produce an open eyed, curly whisker phenotype, while a dominant hypermorph yields a thickened
 
epidermis.
 
epidermis.
 
</pre>
 
</pre>
  
First step -make individualized variation summaries for variation pages, then extract, combine, and shorten for corresponding gene pages.  
+
Sample allele summaries:<br>
 +
<pre>
 +
ju2 is a null allele of syd-1(F32D2.5). The ju2 lesion is a nonsense point mutation that results in a
 +
truncation of all 3 SYD-1 isoforms. ju2 results in defects in axodendritic polarity of ASI and L1 DDs,
 +
neuron morphology of ASI but not DD or VD neurons, presynaptic component localization, synaptic
 +
remodeling of VDs in adults, and backward movement resulting in coiling. ju2 animals do not show
 +
defects in neurite development or postsynaptic component localization.
 +
</pre>
  
Variation page summary
 
 
<pre>
 
<pre>
ju2 is a null allele of syd-1(F32D2.5). The ju2 lesion is a nonsense point mutation that results in a
+
e1368 is a reduction-of-function/hypomorphic allele of the insulin/IGF receptor ortholog daf-2. The
truncation of all 3 SYD-1 isoforms. ju2 results in axodendritic polarity of ASI and L1 DDs, neuron
+
e1368 lesion is a missense mutation affecting 5 of 6 coding transcripts daf-2. e1368 affects many, but
morphology of ASI but not DD or VD neurons, presynaptic component localization, synaptic
+
not all DAF-2-activity requiring processes. Specifically, e1368 disrupts DAF-2 processes of embryonic
remodeling of VDs in adults, and backward movement causing coiling. ju2 animals do not show  
+
and larval development, formation of the developmentally arrested dauer larval stage (diapause),
defects in neurite development or postsynaptic component localization.  
+
adult longevity, fat storage, salt chemotaxis learning, and stress resistance, including response to
 +
high temperature. e1368 mutants are temperature sensitive and are dauer constitutive at 22.5 deg. In
 +
addition, e1368 animals have extended life spans. e1368 animals do not show any defects in acetylcholine
 +
esterase activity, carbon dioxide avoidance, diacetyl chemotaxis, and DMPP response.
 
</pre>
 
</pre>
  
==Prioritizing variations for concise descriptions==
+
==Source files for project==
* Tagged as allele in Variation_type (see model) (subdivide those with molecular or only genetic info)
+
*obo_name_variation  tazendra home/postgres/work/pgpopulation/obo_oa_ontologies/geneace/obo_name_variation
* For genes with concise descriptions (see concise description wiki to find these)
+
*obo_data_variation  tazendra home/postgres/work/pgpopulation/obo_oa_ontologies/geneace/obo_data_variation
* For polymorphisms/MMP
+
*phenotype_ontology ftp://ftp.wormbase.org/pub/wormbase/releases/WS251/ONTOLOGY/phenotype_ontology.WS251.obo  (correct for latest release)
* For transposons, RFLPs
+
*phenotype_association ftp://ftp.wormbase.org/pub/wormbase/releases/WS251/ONTOLOGY/phenotype_association.WS251.wb  (correct for latest release)
* For edited regions
+
*gene_association  ftp://ftp.wormbase.org/pub/wormbase/releases/WS251/ONTOLOGY/gene_association.WS251.wb.c_elegans  (correct for latest release)
* For Integrated transgenes
+
*gin_seqname tazendra /home/postgres/work/pgpopulation/obo_oa_ontologies/geneace
* Most published variations
 
  
==Querying for variation sets==
+
==Building sentences==
===Obtaining a set of variation IDs and names for automated descriptions===
+
===Building Molecular sentence 1 ===
Reference the [[WormBase_Model:Variation | Variation model]]
+
====Template for molecular sentence 1====
 +
<variation> is a <app_nature> allele of <gene>.
 +
*Ex. "ju2 is a null allele of syd-1(F32D2.5)"
 +
*Ex. "e1368 is a reduction-of-function/hypomorphic allele of daf-2."
  
==Location of project-related files==
+
====Sources for molecular sentence 2====
===on Texptresso-dev===
+
* Source 1: geneace file on postgres
 +
** tazendra /home/postgres/work/pgpopulation/obo_oa_ontologies/geneace
 +
* Source 2: gin_seqname.pg
 +
** tazendra /home/postgres/work/pgpopulation/obo_oa_ontologies/geneace
 +
*Source 3: postgres table app_function
  
===on postgres===
+
===Building Molecular sentence 2A - gene features===
*The latest dump:
+
Gene features include DNA binding sites, promoters, UTRs, splice sites
*Variation concise description pipeline:
+
====Template for Molecular sentence -gene features====
*Scripts:
+
The <variation> is a <Molecular_change> in <gene feature> in <gene>.
*Output location:
+
*Ex. tm3467 is a deletion mutation in a splice acceptor site in supr-1. (WBPaper00045690)
 +
====source tags====
 +
#Molecular_change tags:
 +
<pre>
 +
                      Splice_site Donor Text #Evidence
 +
                                  Acceptor Text #Evidence
 +
                      Frameshift Text #Evidence  // added sdm
 +
</pre>
  
==Semantic categories in an Automated Description==
+
===Building Molecular sentence 2B - gene processing===
* Lesion
+
Gene processing effects include, truncation, frameshifts
* Gene feature
+
====Template for Molecular sentence -gene processing====
* Gene product
+
The <variation> is a <Molecular_change> in <gene feature> in <gene>.
* Gene molecular function
+
*Ex. n3763 is a frameshift mutation in lin-35. (WBPaper00027336)
* Sub-cellular localization
+
*Ex. n767 is a silent mutation in lin-15. (WBPaper00001182)
* Tissue expression
+
====source tags====
* Phenotypes
+
#Molecular_change tags:
* Orthology to human gene mutations related to disease
+
<pre>
 +
                      Silent Text #Evidence
 +
                      Splice_site Donor Text #Evidence
 +
                                  Acceptor Text #Evidence
 +
                      Frameshift Text #Evidence  // added sdm
 +
                      Readthrough Text #Evidence // klh WS228
 +
</pre>
  
==Lesion==
+
===Building Molecular sentence 2C - protein domains===
*Rationale: Molecular details
+
Protein domains include catalytic sites, binding domains, activation domains etc.
*Example:
+
====Template for Molecular sentence -protein domains====
*Model tags:
+
The <variation> is a <Molecular_change> in the <protein domain> of <gene>.
*Source files:
+
*Ex. gk291 is a deletion mutation in the PTB domain of dab-1. (WBPaper00031546)
*Template sentence:
 
<Variation> is a(n) <Variation_type> in the <Species> <Gene>. This variation results in a <molecular summary> <Type_of_mutation> in the gene.  
 
  
==Protein domain mutations==
 
see Caltech group meeting [[WormBase-Caltech_Weekly_Calls_May_2015 | May 7, 2015]] <br>
 
connections between mutations and protein domains, and predict affects on function   
 
 
 
==Orthology/Homology==
 
see Caltech group meeting [[WormBase-Caltech_Weekly_Calls_May_2015 | May 7, 2015]] <br>
 
*to connect conserved/syntenic mutations
 
*link elegans gene variations and phenotypes to homologous human disease gene variations
 
*to link elegans mutations as a disease model ex. pdr-1 mutations used to model juvenile parkinsons
 
  
use ortholog(s) of gene defined by [[Generation_of_automated_descriptions#Orthology.2FHomology]]
+
====Sources for molecular sentence 2====
[[Orthology, Homology and Paralog data in WormBase]]
+
* Source 1: geneace file on postgres
 +
**tazendra /home/postgres/work/pgpopulation/obo_oa_ontologies/geneace
 +
* Source 2: TBD for associated domain
  
====Template of an Orthology sentence====
+
Variation description tags needed for the above text:
*<Worm Gene> is an ortholog of <human gene>.
+
#Molecular_change tags:
*Similar or identical mutations in the <human gene> has been associated with <disease>.
+
<pre>
 +
                      Nonsense UNIQUE Amber_UAG Text #Evidence
 +
                                      Ochre_UAA Text #Evidence
 +
                                      Opal_UGA Text #Evidence
 +
                      Missense Text #Evidence                // text fields stored details of codon change
 +
</pre>
 +
=====problems=====
 +
*coding vs noncoding genes
 +
*mutation in noncoding area of gene
 +
*mutation affects isoforms differently
 +
*only use variations mapped onto human proteins, rather than other way around - keep in sync with with other projects, also limits sentence to a subset of isoforms?
 +
*..."that is or is not conserved in humans." can only use under limited situations, orthology needs to be done.
  
====Rules for orthology sentence construction====
+
===Building gene process summary sentences===
 +
====Template for a process sentence====
 +
<Variation> affects <Gene> function in <GO>.<br>
 +
Example:
 +
*''lf29'' affects ''polk-1'' function in error-prone translesion synthesis.
 +
*''bp501'' affects ''atg-4.1'' function in autophagy.
 +
*''ns260'' affects ''ttx-1'' function in embryo development ending in birth or egg hatching
 +
*''ns235'' affects ''ttx-1'' function in the regulation of transcription from RNA polymerase II promoter
 +
*''gg91'' affects ''nrde-2'' function in chromatin silencing by small RNA
  
====How to pick orthologs for the description====
+
====Source files for process sentences====
use human gene that is associated with disease and has mutation information
+
*obo_name_variation  tazendra home/postgres/work/pgpopulation/obo_oa_ontologies/geneace/obo_name_variation
 +
*gene_association  ftp://ftp.wormbase.org/pub/wormbase/releases/WS251/ONTOLOGY/gene_association.WS251.wb.c_elegans  (correct for latest release)
 +
*gin_seqname tazendra /home/postgres/work/pgpopulation/obo_oa_ontologies/geneace
  
==Process==
+
Gene association file:
*Rationale:  
+
<pre>
*Template for a process sentence
+
WB WBGene00006831 unc-104 GO:0048490 WB_REF:WBPaper00045884|PMID:25329901 IMP WB:WBVar02141295 P C52E12.2|klp-1 gene taxon:6239 20141212 WB
<Variation> affects <Gene>'s role in <process>;
+
WB WBGene00019126 sam-4 GO:1903744 WB_REF:WBPaper00045884|PMID:25329901 IMP WB:WBVar02125688 P F59E12.11 gene taxon:6239 20141212 WB
 +
WB WBGene00017696 polk-1 GO:0042276 WB_REF:WBPaper00041255|PMID:22761594 IMP WB:WBVar01473736 P F22B7.6 gene taxon:6239 20150611 WB
 +
WB WBGene00013595 atg-4.1 GO:0006914 WB_REF:WBPaper00041282|PMID:22767594 IMP WB:WBVar01473704 P Y87G2A.3 gene taxon:6239 20140724 WB
 +
WB WBGene00006652 ttx-1 GO:0009792 WB_REF:WBPaper00040681|PMID:22298710 IMP WB:WBVar00603928 P Y113G7A.6 gene taxon:6239 20140408 WB
 +
WB WBGene00006652 ttx-1 GO:0045944 WB_REF:WBPaper00040681|PMID:22298710 IMP WB:WBVar00603924 P Y113G7A.6 gene taxon:6239 20140408 WB has_regulation_target<WB:WBGene00006894>,occurs_in<WBbt:0006754>,happens_during<GO:0009408>
 +
WB WBGene00011333 nrde-2 GO:0031048 WB_REF:WBPaper00040602|PMID:22231482 IMP WB:WBVar00601048 P T01E8.5 gene taxon:6239 20150715 WB
 +
WB WBGene00017066 maco-1 GO:0006935 WB_REF:WBPaper00038428|PMID:21589894 IMP WB:WBVar00597666|WB:WBVar00597667 P D2092.5 gene taxon:6239 20110823 WB
 +
WB WBGene00017066 maco-1 GO:0023041 WB_REF:WBPaper00038428|PMID:21589894 IMP WB:WBVar00597666|WB:WBVar00597667 P D2092.5 gene taxon:6239 20110824 WB
 +
</pre>
  
*Source file for Process data
+
====Rules for Process sentence construction====
 +
*'''Rule 1: '''Ignore all lines that do no have  "IMP" in column 7
 +
*'''Rule 2: '''Map column 8 WBVariation to <allele public name> -
 +
*use obo_name_variation.pg at tazendra /home/postgres/work/pgpopulation/obo_oa_ontologies/geneace
 +
*WBVarID is in first column of geneace file, public_name is second column of geneace file
 +
*'''Rule 3: '''When there are more than one WBVarIDs in column 8, map other WBVarIDs and create a separate summary for those objects
 +
*'''Rule 4: '''Map column 5 GO:ID to GO name -
 +
*use obo_goidprocess at tazendra /home/postgres/work/pgpopulation/obo_oa_ontologies
 +
obo_goidprocess looks like:
 
<pre>
 
<pre>
*Source 1: gene_association file for C.elegans from the WormBase FTP site:
+
GO:ID in line id: <GO:ID#######>
**ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS243/ONTOLOGY/gene_association.WS243.wb.c_elegans
+
name in line name: <GO name>
**this file is one build behind, compared to release label for the phenotype2go annotations
+
If GO:term name starts with a qualifiers "negative" or "positive", replace qualifier with "the"
*Source 2: (from WS250): phenotype2go file, these automated annotations are now 'IEAs', but will be treated like IMPs, if they have the 'WBPhenotype:XXXXXXX' in column 8 (with)
 
**this file is current with the release label for the phenotype2go annotations
 
**Source 1 and 2 will have redundant annotations but these will get resolved
 
**All rows with column 15 (assigned by) with 'WB' are WormBase annotations, those with 'UniProtKB' or 'InterPro' are from those databases
 
**Need data from these rows:
 
***where column 9: has value 'P' (Process),
 
***column 2 (DB_Object ID): the associated genes, UniProt ID or WBGene ID,
 
***column 3 (DB_Object symbol), eg, wht-7, GO terms are from column 5.
 
***column 5: GOID, eg, GO:0000346
 
***column 6: DB:Reference (Reference), eg.PMID:12062106, GO_REF:0000002
 
***column 7: Evidence code, eg, IMP
 
***column 8: With, eg. 'WB:WBRNAi00000785|WBPhenotype:0000050'
 
 
</pre>
 
</pre>
*Rules for process sentence construction
+
Example:<br>
 +
''js901'' affects ''unc-104'' function in anterograde synaptic vesicle transport. Based on column 8 WBVar02141295 ->js901, column 3 unc-104, column 5 GO:00048490 <br>
 +
''js415'' affects ''sam-4'' function in the regulation of anterograde synaptic vesicle transport.  Based on column 8 WBVar02125688 -> js415, column 3  sam-4, column 5 GO:1903744 -> positive regulation of anterograde synaptic vesicle transport -> Replace "positive" with "the" -> the regulation of anterograde synaptic vesicle transport<br>
 +
*'''Rule 5: '''Italicize <variation> public name and column 3 gene
 +
*'''Rule 6: '''If there are two variations listed in column 8,  make a summary for each variation each using the GO value in column 5 of the line<br>
 +
*'''Rule 7: '''If the genes and alleles are the same in each line, concatenate GO:IDs, comma separate or join with “and”<br>
 +
*Example:
 +
  *'''nj21''' affects '''maco-1''' function in chemotaxis and neuronal signal transduction
 +
  *'''nj34''' affects '''maco-1''' function in chemotaxis and neuronal signal transduction
 +
 
 +
===Building Phenotype Observed summary sentences using phenotype_association file===
 +
 
 +
====Template for Phenotype observed sentences====
 +
<Variation> results in (defects, alterations) in <phenotype(s)>. <br>
 +
Ex: "e1368 disrupts DAF-2 processes of embryonic and larval development,
 +
formation of the developmentally arrested dauer larval stage (diapause), adult
 +
longevity, fat storage, salt chemotaxis learning, and stress resistance, including
 +
response to high temperature. In addition, e1368 animals have extended life spans. "
 +
 
 +
====Source files Phenotype Observed data====
 +
*phenotype association file ftp://ftp.wormbase.org/pub/wormbase/releases/WS251/ONTOLOGY/phenotype_association.WS251.wb
 +
*phenotype ontology ftp://ftp.wormbase.org/pub/wormbase/releases/WS251/ONTOLOGY/phenotype_ontology.WS251.obo (correct for latest release)
 +
*obo_name_variation tazendra home/postgres/work/pgpopulation/obo_oa_ontologies/geneace/obo_name_variation
 +
 
 +
Source 1 phenotype association file:
 
<pre>
 
<pre>
*'''Rule 1''': Only apply source from IMP GO terms, if variation is listed
+
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
*'''Rule 4''': For all IMP terms, add the words 'based on mutant phenotype' at the end of the sentence.
+
WB WBGene00000898 daf-2 WBPhenotype:0001682 WB:WBVar00143949 IMP WB:WBPerson261 P Y55D5A.5 gene taxon:6239 20151027 WB
**Examples:
+
WB WBGene00000898 daf-2 NOT WBPhenotype:0001688 WB:WBVar00143949 IMP WB:WBPerson261 P Y55D5A.5 gene taxon:6239 20151027 WB
**Data: WBGene00001972,hmg-1.2,cell fate specification[IMP],WB_REF:WBPaper00003453|PMID:10197987,,WB,gonad development[IMP],Wnt signaling pathway[IGI],vulval development[IMP]
+
WB WBGene00000898 daf-2 WBPhenotype:0000190 WB_REF:WBPaper00002149 IMP WB:WBVar00088561 P Y55D5A.5 gene taxon:6239 20151027 WB
**Sentence: hmg-1.2 is involved in cell fat=e specification, gonad development and vulval development, based on mutant phenotypes.
+
WB WBGene00000898 daf-2 NOT WBPhenotype:0001660 WB_REF:WBPaper00006052 IMP WB:WBVar00088561 P Y55D5A.5 gene taxon:6239 20151027 WB
**Data: WBGene00002148,gon-14,muscle contraction[IMP],WB_REF:WBPaper00024213|PMID:15133127,,WB,defecation[IGI],gonad morphogenesis[IMP],multicellular organism growth[IGI],regulation of pharyngeal pumping[IMP]
+
WB WBGene00000898 daf-2 WBPhenotype:0000136 WB_REF:WBPaper00046188 IMP WB:WBVar00143947 P Y55D5A.5 gene taxon:6239 20151027 WB
**Sentence: gon-14 is involved in muscle contraction, gonad morphogenesis and regulation of pharyngeal pumping, based on mutant phenotypes and in defecation and growth.
+
WB WBGene00000898 daf-2 WBPhenotype:0000631 WB_REF:WBPaper00036280 IMP WB:WBVar00143947 P Y55D5A.5 gene taxon:6239 20151027 WB
*'''Rule 5''': No exclusions as of 07.07.2014, leave in reproduction:
+
WB WBGene00000898 daf-2 WBPhenotype:0000637 WB_REF:WBPaper00038179 IMP WB:WBVar00143947 P Y55D5A.5 gene taxon:6239 20151027 WB
*'''Rule 6''' If a GO term has the words 'involved in' anywhere, beginning, middle or at the end, use the words 'functions in'
+
WB WBGene00000898 daf-2 WBPhenotype:0001184 WB_REF:WBPaper00038379 IMP WB:WBVar00143947 P Y55D5A.5 gene taxon:6239 20151027 WB
**Data: WBGene00006495,cpna-1,striated muscle contraction involved in embryonic body morphogenesis[IMP],WB_REF:WBPaper00041875|PMID:23283987
+
WB WBGene00000898 daf-2 WBPhenotype:0001351 WB_REF:WBPaper00038379 IMP WB:WBVar00143947 P Y55D5A.5 gene taxon:6239 20151027 WB
**Sentence: cpna-1 functions in striated muscle contraction involved in embryonic body morphogenesis;
+
WB WBGene00000898 daf-2 WBPhenotype:0001861 WB_REF:WBPaper00041295 IMP WB:WBVar00143947 P Y55D5A.5 gene taxon:6239 20151027 WB
*'''Rule 7''': For all other Process terms the sentence will be:
+
 
**<Gene> is involved in <process term>;
 
**Examples:
 
**Data: WBGene00014123,elpc-3,tRNA wobble uridine modification[IMP],WB_REF:WBPaper00034713|PMID:19593383,,WB,translation[IMP],spermatogenesis[IGI],olfactory learning[IMP],vulval development[IGI],embryonic morphogenesis[IGI],oocyte development[IGI]
 
**Sentence: elpc-3 is involved in tRNA wobble uridine modification, translation, spermatogenesis, olfactory learning, vulval development and embryonic morphogenesis.
 
*'''Rule 8''': For the term 'molting cycle, collagen and cuticulin-based cuticle' GO term, drop the comma and the words 'collagen and cuticulin-based cuticle'
 
**Example:
 
**Data: WBGene00016643,vps-45,molting cycle, collagen and cuticulin-based cuticle[IMP],WB_REF:WBPaper00029049|PMID:17235359,,WB,vesicle docking involved in exocytosis[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR001619,vesicle-mediated transport[IEA]
 
**Sentence: vps-45 is involved in '''the''' molting cycle;
 
*'''Rule 9''': Replacement rule:
 
**Replace term 'multicellular organism growth' with 'growth'.
 
**Replace term 'embryo development ending in birth or egg hatching' with 'embryo development'
 
**Replace term 'synaptic transmission, <word>' with '<word> synaptic transmission'.
 
***Example Data: WBGene00011488,nra-2, synaptic transmission, cholinergic[IMP],WB_REF:WBPaper00034730|PMID:19609303,,WB,locomotion[IMP],protein processing[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR008710,regulation of signal transduction[IEA],INTERPRO:IPR016574
 
***Sentence: nra-2 is involved in cholinergic synaptic transmission and locomotion.
 
'''Rule 10''': Granularity rule:
 
*If a GO term in the list is a parent of (is_a or part_of parent) another GO term in the list, keep the most granular child term and ignore the parent terms.
 
 
</pre>
 
</pre>
 +
====Rules for phenotype sentence construction using postgres tables====
 +
'''Rule 1:''' For lines that does not contain a value in column 4,
 +
*extract variation(column 6 or 8),
 +
*phenotype (column 5)
 +
*paper(column 6)
 +
*ex. "WB:WBVar00143947" "WBPhenotype:0001861" "WBPaper00041295"
 +
*Map WBVariationID to Variation public_name using obo_name_variation<br>
 +
*Map WBPhenotype to Phenotype name using phenotype_ontology.WS251.obo<br>
 +
'''Rule 2:''' When more than one variation exists in a line create a new line and use only one variation, with same phenotype, NOT(if exists), and paper as original line<br>
 +
'''Rule 3:''' Pool all phenotypes for a given variation, comma separate<br>
 +
'''Rule 4:''' If a Phenotype term in the list is a parent of (is_a or part_of parent) another Phenotype term in the list, keep the most granular child term and ignore the parent term(s)<br>
 +
 +
 +
===Building Phenotype Observed summary sentences using phenotype_association file===
  
==References==
+
====Template for Phenotype observed sentences====
WBPaper00047004 title : Comparative mapping of the 22q11.2 deletion region and the potential of simple model organisms. Guna_2015
+
<Variation> results in (defects, alterations) in <phenotype(s)>. <br>
 +
Ex: "e1368 effects embryonic and larval development,
 +
formation of the developmentally arrested dauer larval stage (diapause), adult
 +
longevity, fat storage, salt chemotaxis learning, stress resistance, and
 +
response to high temperature. In addition, mutants have extended life spans. "
  
 +
====Source files Phenotype Observed data====
 +
Source 1 postgres app tables:
 
<pre>
 
<pre>
==Molecular function/identity==
+
app_term
====Source file for molecular function data====
+
app_variation
 +
app_paper
 +
</pre>
 +
*phenotype ontology ftp://ftp.wormbase.org/pub/wormbase/releases/WS251/ONTOLOGY/phenotype_ontology.WS251.obo (correct for latest release)
 +
*obo_name_variation tazendra home/postgres/work/pgpopulation/obo_oa_ontologies/geneace/obo_name_variation
 +
 
 +
====Rules for phenotype sentence construction====
 +
'''Rule 1:'''
 +
*extract variation - app_variation
 +
*phenotype - app_term
 +
*paper - app_paper
 +
*ex. "WB:WBVar00143947" "WBPhenotype:0001861" "WBPaper00041295"
 +
*Map WBVariationID to Variation public_name using obo_name_variation<br>
 +
*Map WBPhenotype to Phenotype name using phenotype_ontology.WS251.obo<br>
 +
'''Rule 2:''' When more than one variation exists in a line create a new line and use only one variation, with same phenotype, NOT(if exists), and paper as original line<br>
 +
'''Rule 3:''' Pool all phenotypes for a given variation, comma separate<br>
 +
'''Rule 4:''' If a Phenotype term in the list is a parent of (is_a or part_of parent) another Phenotype term in the list, keep the most granular child term and ignore the parent term(s)<br>
 +
'''Rule 5:''' Remove 'variant' from public names that have them<br>
 +
 
 +
====Rules for specific phenotype terms====
 +
'''Rule 1:''' For all terms that contain 'variant' remove 'variant' from public names that have them<br>
 +
'''Rule 2:''' For the following terms:
 +
* WBPhenotype:0000061 - extended life span
 +
* WBPhenotype:0001171 - shortened life span
 +
** add an "s" to the end of the term public name
 +
** use the phrasing "In addition, mutants have extended life spans."
 +
* WBPhenotype:0001838 - drug induced gene expression variant
 +
* WBPhenotype:0001871 - drug induced life span variant
 +
* WBPhenotype:0001872 - drug induced locomotion variant
 +
** change "induced" to "influenced"
 +
* Terms that end in "early emb"
 +
** replace "early emb" in the term with "in the early embryo"
 +
** for term WBPhenotype:0001185 "embryonic developmental delay early emb", remove "embryonic"
 +
** for term WBPhenotype:0001186 "embryo delayed at pronuclear contact early emb", remove the first term "embryo"
 +
 
 +
===Building Phenotype Attribute summary sentences===
 +
====Template for Phenotype Attribute summary sentences====
 +
<variation> is <app attribute>, <app attribute> <app_term>
 +
*Ex. "s1019 is cold-sensitive, maternal effect mid larval lethal;"
 +
 
 +
<variation> is <app attribute> <app_term> at <app_heat_temp> OR <variation> is <app attribute> <app term> at <
 +
*Ex. "e1368 is temperature sensitive dauer constitutive at 22.5 deg C;"
  
**gene_association file for C.elegans from the WormBase FTP site:
+
<variation> is <app attribute> for all phenotypes at <app_heat_degree> OR <variation> is <app attribute> for all phenotypes at <app_cold_degree>
**ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS243/ONTOLOGY/gene_association.WS243.wb.c_elegans
+
*Ex. "oj21 is temperature sensitive for all phenotypes at 25 deg C; "
**All rows with column 15 (assigned by) with 'WB' are WormBase annotations, those with 'UniProtKB' or 'InterPro' are from those databases
 
**Need data from these rows:
 
*** where column 9 has value 'F' (Molecular Function)
 
***column 2, associated genes, has UniProt ID or WormBase GeneID, need to translate UniProtID to WormBase ID.
 
***column 3: DB_Object symbol, eg, wht-7,
 
***column 5: GOID, eg, GO:0000346
 
***column 6: DB:Reference (Reference), eg.PMID:12062106, GO_REF:0000002
 
***column 7: Evidence code, eg, IMP
 
***column 8: 'With (or) From' eg., INTERPRO:IPR002293,
 
***column 15: Assigned By, eg., WB (which database created the annotation)
 
  
====Rules for molecular function sentence construction====
+
<variation> is <app_mat_effect> <app_term>
*'''Rule 1''': Ignore all GO terms with the tag 'is_obsolete: true'
+
*Ex. "oj21 is maternal effect embryonic lethal;" 
*'''Rule 2''': Exclusion list:
+
**Ignore the term 'protein binding'
+
====Source for Phenotype Attribute sentence====
**Ignore the term 'binding'
+
postgres app tables
*'''Rule 3''': For identical GO terms with both experimental (EXP, IDA, IPI, IMP, IGI, IEP) and computational evidence codes (ISS, IEA), keep only the non-redundant GO terms from experimental evidence and ignore the computational evidence codes.
 
*'''Rule 4''': Order the experimental GO terms first in the sentence followed by ISS and IEA terms.
 
*'''Rule 5''': If the evidence code is 'IEA' and 'activity' term is present, add the words 'predicted to have <activity term>' and add the words ', based on protein domain information' to the sentence:
 
**Sentence: <Gene> is predicted to have <activity term>, based on protein domain information.
 
*Examples:
 
**WBGene00000108,alh-2,oxidoreductase activity[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR015590,WB,oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor[IEA],INTERPRO:IPR016163
 
**alh-2 is predicted to have oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor, based on protein domain information.
 
*'''Rule 6''': If Evidence code is 'ISS' add the words 'based on sequence information' to the sentence
 
**WBGene00001951,hlh-4,sequence-specific DNA binding transcription factor activity[ISS],WB_REF:WBPaper00034761|PMID:19632181,,WB,protein dimerization activity[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR011598,DNA binding[IEA],INTERPRO:IPR015660
 
**Sentence: hlh-4 is predicted to have sequence-specific DNA binding transcription factor activity based on sequence information, protein dimerization activity and DNA binding activity, based on protein domain information.
 
  
*'''Rule 7''': If a binding term is present add the word 'activity' to it.
+
====Rules for Phenotype Attribute summary sentences====
 +
'''Rule 1:''' when there are more than one phentoype attribute sentences, use the generic "animals" instead of repeating the allele name
 +
*Ex "oj21 is temperature sensitive for all phenotypes at 25 deg C; oj21 is maternal effect embryonic lethal; animals are 100% sterile at 25 degrees C;"
  
*'''Rule 8''': If the evidence code is IDA, IMP, IGI, IPI, or IEP add the words 'exhibits' before the GO term and 'based on experimental evidence' after the GO term and/or at the end of the sentence.
+
===Building Phenotype NOT Observed summary sentences===
**IDA example:
+
====Template for Phenotype NOT observed sentences====
**WBGene00001952,hlh-6,RNA polymerase II distal enhancer sequence-specific DNA binding transcription factor activity[IDA],WB_REF:WBPaper00032277|PMID:18927627,,WB,protein dimerization activity[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR011598,DNA binding[IEA],INTERPRO:IPR015660
+
<Allele> does not show (defects, alterations) in <NOT phenotype(s)>.
**Sentence: hlh-6 exhibits RNA polymerase II distal enhancer sequence-specific DNA binding transcription factor activity and is predicted to have protein dimerization activity and DNA binding activity.
+
*Ex. "e1368 animals do not show any defects in acetylcholine esterase activity, carbon dioxide avoidance, diacetyl chemotaxis, and DMPP response."
**IMP example:
 
**WBGene00009583,aagr-3,alpha-glucosidase activity[IMP],WB_REF:WBPaper00036069|PMID:20349118,,WB,hydrolase activity, hydrolyzing O-glycosyl compounds[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR000322,catalytic activity[IEA],INTERPRO:IPR011013,carbohydrate binding[IEA]
 
**Sentence: aagr-3 exhibits alpha-glucosidase activity and is predicted to have hydrolase activity, hydrolyzing O-glycosyl compounds, catalytic activity, and carbohydrate binding activity, based on protein domain information.
 
  
*'''Rule 9''': For non-IEA terms beginning with 'structural constituent of <word>', use the words 'is a 'structural constituent of <word>' in the sentence.
+
====Source files Phenotype Observed data====
*Example:
+
*phenotype association file ftp://ftp.wormbase.org/pub/wormbase/releases/WS251/ONTOLOGY/phenotype_association.WS251.wb
**WBGene00010783,mrpl-36,structural constituent of ribosome[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR000473,WB
+
*phenotype ontology ftp://ftp.wormbase.org/pub/wormbase/releases/WS251/ONTOLOGY/phenotype_ontology.WS251.obo (correct for latest release)
**Sentence: mrpl-36 is a structural constituent of ribosome, based on protein domain information.
+
*obo_name_variation tazendra home/postgres/work/pgpopulation/obo_oa_ontologies/geneace/obo_name_variation
**WBGene00010783,mrpl-36,structural constituent of ribosome[ISS]
 
**Sentence: mrpl-36 is a structural constituent of ribosome, based on sequence information.
 
**WBGene00010783,mrpl-36,structural constituent of ribosome[IMP] or [IDA]
 
**Sentence: mrpl-36 is a structural constituent of ribosome.
 
  
*'''Rule 10''' Replacement
+
Source 1 phenotype association file:
**For GO term: 'oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor' becomes 'oxidoreductase activity, acting on the aldehyde or oxo group of donors' (drop the text string 'NAD or NADP as acceptor')
+
<pre>
 +
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
 +
WB WBGene00000898 daf-2 WBPhenotype:0001682 WB:WBVar00143949 IMP WB:WBPerson261 P Y55D5A.5 gene taxon:6239 20151027 WB
 +
WB WBGene00000898 daf-2 NOT WBPhenotype:0001688 WB:WBVar00143949 IMP WB:WBPerson261 P Y55D5A.5 gene taxon:6239 20151027 WB
 +
WB WBGene00000898 daf-2 WBPhenotype:0000190 WB_REF:WBPaper00002149 IMP WB:WBVar00088561 P Y55D5A.5 gene taxon:6239 20151027 WB
 +
WB WBGene00000898 daf-2 NOT WBPhenotype:0001660 WB_REF:WBPaper00006052 IMP WB:WBVar00088561 P Y55D5A.5 gene taxon:6239 20151027 WB
 +
WB WBGene00000898 daf-2 WBPhenotype:0000136 WB_REF:WBPaper00046188 IMP WB:WBVar00143947 P Y55D5A.5 gene taxon:6239 20151027 WB
 +
WB WBGene00000898 daf-2 WBPhenotype:0000631 WB_REF:WBPaper00036280 IMP WB:WBVar00143947 P Y55D5A.5 gene taxon:6239 20151027 WB
 +
WB WBGene00000898 daf-2 WBPhenotype:0000637 WB_REF:WBPaper00038179 IMP WB:WBVar00143947 P Y55D5A.5 gene taxon:6239 20151027 WB
 +
</pre>
 +
====Rules for phenotype sentence construction====
 +
'''Rule 1:''' For each line that contains (NOT) value in column 4, extract
 +
*variation(column 6 or 8),
 +
*phenotype (column 5)
 +
*paper(column 6)
 +
*NOT(column 4)
 +
*ex. "WB:WBVar00143947" "NOT" "WBPhenotype:0000114" "WBPaper00039871"
 +
*Map WBVariationID to Variation public_name using obo_name_variation<br>
 +
* Map WBPhenotype to Phenotype name using phenotype_ontology.WS251.obo<br>
 +
'''Rule 2:''' When more than one variation exists in a line create a new line and use only one variation, with same phenotype, and paper as original line<br>
 +
'''Rule 3:''' Pool all NOT phenotype for a given variation, separate<br>
 +
'''Rule 4:''' If a Phenotype term in the list is a parent of (is_a or part_of parent) another Phenotype term in the list, keep the most granular child term and ignore the parent term(s)
 +
'''Rule 5:''' Remove 'variant' from public names that have them<br>
  
*'''Rule 11''': If a GO term has a comma in it, then add a comma after this GO term and the next ‘and’.
 
  
==Sub-cellular localization==
 
====Source file for Sub-cellular localization data====
 
*GO data
 
*Source 1: gene_association file for C. elegans at:http://www.geneontology.org/GO.current.annotations.shtml?all
 
**Need data from these rows:
 
*** where column 9 has value 'C' (Cellular Component)
 
***column 2, associated genes, has UniProt ID or WormBase GeneID, need to translate UniProtID to WormBase ID.
 
***column 3: DB_Object symbol, eg, wht-7,
 
***column 5: GOID, eg, GO:0000346
 
***column 7: Evidence code, eg, IDA
 
  
====Rules for sub-cellular localization sentence construction====
+
=== Building Disease Orthology summary sentences===
*'''Rule 1''': Ignore GO terms with the tag 'is_obsolete: true' in the obo file
+
see Caltech group meeting [[WormBase-Caltech_Weekly_Calls_May_2015 | May 7, 2015]] <br>
*'''Rule 2''': Ignore all IEA and ISS GO terms, use only non-IEA, non-ISS GO terms
+
*to connect conserved/syntenic mutations
*'''Rule 4''': Ignore IBA and IBD GO terms (PAINT annotations)
+
*link elegans gene variations and phenotypes to homologous human disease gene variations
*'''Rule 5''': For identical GO terms with both experimental (EXP, IDA, IPI, IMP, IGI, IEP) and computational evidence codes (ISS, IEA), keep only the non-redundant GO terms from experimental evidence and ignore the computational evidence codes.
+
*to link elegans mutations as a disease model ex. pdr-1 mutations used to model juvenile parkinsons
  
*'''Rule 6''': For 'integral component of ....' terms add the words 'is an';
+
use ortholog(s) of gene defined by [[Generation_of_automated_descriptions#Orthology.2FHomology]]
**Eg. for Rule 2: WBGene00006319,sup-10,integral component of plasma membrane[IDA],WB_REF:WBPaper00006135|PMID:14534247,,WB,striated muscle dense body[IDA]
+
[[Orthology, Homology and Paralog data in WormBase]]
**sup-10 is an integral component of plasma membrane and is localized to striated muscle dense body;
 
*Examples
 
**WBGene00023405,sor-1,nucleoplasm[IDA],WB_REF:WBPaper00027128|PMID:16501168,,WB,nuclear speck[IDA]
 
**Sentence: sor-1 is localized to the nucleoplasm and nuclear speck;
 
  
**WBGene00004681,rsd-2,nucleolus[IDA],WB_REF:WBPaper00044261|PMID:18430922,,WB,endoplasmic reticulum[IDA],cytosol[IDA]
+
==Examples for concise descriptions for variations of different types==
**Sentence: rsd-2 is localized to the nucleolus, endoplasmic reticulum and cytosol;
+
* Classic alleles
 +
notes: can add: mutagen, history of isolation
 +
**Alleles with molecular data
 +
<pre>
 +
e1368 is a reduction-of-function/hypomorphic allele of the insulin/IGF receptor ortholog daf-2. The
 +
e1368 lesion is a missense mutation affecting 5 of 6 coding transcripts daf-2. e1368 affects many, but
 +
not all DAF-2-activity requiring processes. Specifically, e1368 disrupts DAF-2 processes of embryonic
 +
and larval development, formation of the developmentally arrested dauer larval stage (diapause),
 +
adult longevity, fat storage, salt chemotaxis learning, and stress resistance, including response to
 +
high temperature. e1368 mutants are temperature sensitive and are dauer constitutive at 22.5 deg. In
 +
addition, e1368 animals have extended life spans. e1368 animals do not show any defects in acetylcholine
 +
esterase activity, carbon dioxide avoidance, diacetyl chemotaxis, and DMPP response.
 +
</pre>
  
**WBGene00021827,dnc-6,dynactin complex[IDA],WB_REF:WBPaper00037699|PMID:20964796,,WB
+
**Alleles with only genetic data
**Sentence: dnc-6 is localized to the dynactin complex;
 
  
*'''Rule 7''': For the GO term 'intracellular [IEA]' structure of the sentence will be different, use 'is intracellular'.
+
* Other variation types
**Eg.1 WBGene00089742
 
**PPA00188 is an ortholog of C. elegans T26A5.8; based on protein domain information, PPA00188 is predicted to have sequence-specific DNA binding activity and protein heterodimerization activity and is intracellular.
 
  
==Order of sentences==
+
* Engineered alleles
*Orthology
 
*Process
 
*Function/identity
 
*Component
 
  
==Tissue expression==
+
* Integrated transgenes
====Source files for Tissue expression data====
 
*Source 1: Expression data
 
*OA (exprpat), PG table names:
 
**for all these tables where exp_endogenous table value 'endogenous', grab the exp_gene, WBGeneXXXXXXXX
 
**exp_name, values look like Expr1005.
 
**exp_anatomy for anatomy terms, in the form of Wbt:0003679, translate these IDs using the anatomy ontology file
 
**anatomy ontology file from ftp site:ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS244/ONTOLOGY/anatomy_ontology.WS244.obo
 
**exp_paper for paper
 
**exp_qualifier for the qualifiers 'certain’, ‘uncertain’ and ‘partial’.
 
*Contact Person: Daniela
 
  
 +
====other categories of alleles====
 +
* Most published alleles
 +
* Alleles with most phenotypes
 +
* Alleles of genes with concise descriptions (see concise description wiki to find these)
  
====Rules for tissue expression sentence construction====
+
==Prioritizing variations for automated descriptions==
'''Rule 1''': Use only the data that has the qualifiers 'Certain' and 'Partial' and ignore all those data that have 'uncertain'; treat 'Certain' and 'Partial' equally; also use all expression data with no qualifiers as well, qualifiers were added recently.
+
*Classic alleles - Variations with Variation_type Allele (see [[WormBase_Model:Variation | Variation model]])
 +
**Alleles with molecular data => Variation_type Allele + Type_of_mutation exists
 +
**Alleles with only genetic data => Variation_type Allele and NO Type_of_mutation exists
 +
*Variations of genes with concise descriptions
 +
**Variation Affects Gene + gene exists with cns_summary table entry; acedb model Gene->Gene_info->Concise_description
 +
*Other variation types
 +
**Variation_type SNP, Confirmed_SNP
 +
**Variation_type Transposons, RFLPs
 +
*Engineered variations
 +
**Variation_type Engineered_allele
 +
* Integrated transgenes with allele name
 +
*Variations that have been most published
 +
**textpresso search results
 +
*Variations with most phenotypes
 +
**app_variation count
  
'''Rule 2''': Pick an anatomy term only once
+
==Orthology/ Conserved domain mutations==
*Sentence: <Gene> is expressed in the <anatomy term1, anatomy term2 and anatomy term3>;
+
see Caltech group meeting [[WormBase-Caltech_Weekly_Calls_May_2015 | May 7, 2015]] <br>
 +
connections between mutations and protein domains, and predict affects on function<br>
 +
We currently do not capture mutations in the context of affecting a conserved amino acid - how and who would do this? Can Hinxton generate these? <br>
 +
Many examples can be found with the '''Textpresso search for 'mutation conserved''''
  
*Examples:
+
Examples
*Data for alh-10:  
+
<pre>
*WBGene00000116 alh-10 Expr5583 nervous system,intestine,tail neuron WBPaper00031006,WBPaper00006525 Endogenous
+
id : WBPaper00037661
*'''Sentence''': alh-10 is expressed in the nervous system, intestine and tail neuron;
+
name : WBPaper00037661
 +
title : Sequential action of Caenorhabditis elegans Rab GTPases regulates phagolysosome formation during apoptotic cell degradation.
 +
Sequencing of rab-14 in qx18 mutants revealed a C to T transition, which resulted in substitution of the Threonine at codon 67 with Methionine (ACG > ATG; T67M). This mutation affects the phosphate/Mg2+
 +
binding domain PM3, which is conserved in all members of the Ras GTPase superfamily
  
 +
Sentences to make:
 +
*Molecular sentence 2: "qx18 encodes a C to T transition, T67M which affects the phosphate/Mg2+ binding domain PM3"
 +
*Orthology sentence: "The phosphate/Mg2+ binding domain PM3 is conserved in all members of the Ras GTPase superfamily"
 +
</pre>
  
*Data for asp-5:
+
<pre>
*WBGene00000218 asp-5 Expr5817 intestine WBPaper00031006,WBPaper00006525 Endogenous
+
Reverse in vitro mutation analysis of elegans mutation on mammalian disease gene.
*WBGene00000218 asp-5 Expr4352 intestine WBPaper00028802 Endogenous
+
Title: Introduction of a loss-of-function point mutation from the SH3 region of the Caenorhabditis elegans sem-5 gene activates the transforming ability of c-abl in vivo and abolishes binding of proline-rich ligands in vitro .
*'''Sentence''': asp-5 is expressed in the intestine;
+
Authors: Van Etten RA ; Debnath J ; Zhou H ; Casasnovas JM
 +
Journal: Oncogene
 +
Year: 1995-05-18
 +
Doc ID: WBPaper00002191
 +
When the n1619 mutation , which confers a lethal and highly penetrant vulvaless phenotype in C . elegans , is introduced into the c-abl SH3 domain , substituting a leucine for proline at AN amino acid number 131 , the resulting mutant transforms NIH3T3 fibroblasts with an efficiency about 10 % that of SH3-deleted c-abl .
  
 +
Sentence to make:
 +
*Molecular sentence 2: "n1619 encodes a L131P substitution in the c-abl SH3 domain"
 +
</pre>
  
*Data for ccr-4:
+
<pre>
*WBGene00000376 ccr-4 Expr4479 pharynx WBPaper00027076 Endogenous
+
Title: CED-9 and mitochondrial homeostasis in C . elegans muscle .
*WBGene00000376 ccr-4 Expr11132 male,hermaphrodite,somatic cell,germ line *WBPaper00043886 Endogenous
+
Authors: Tan FJ ; Husain M ; Manlandro CM ; Koppenol M ; Fire AZ ; Hill RB
*WBGene00000376 ccr-4 Expr4479 hyp6,hyp5,hyp4,hyp3,hyp2,hyp1,pharyngeal neuron,head muscle WBPaper00027076 Endogenous
+
Journal: J Cell Sci
*WBGene00000376 ccr-4 Expr7174 pharynx,hypodermis,seam cell *WBPaper00031006,WBPaper00006525 Endogenous
+
Year: 2008-10-15
*WBGene00000376 ccr-4 Expr4480 pharynx,body wall musculature,head neuron,tail neuron *WBPaper00027076 Endogenous
+
Doc ID: WBPaper00032231
*WBGene00000376 ccr-4 Expr4480 hyp6,hyp5,hyp4,hyp3,hyp2,hyp1,pharyngeal neuron,head muscle
+
SECTION: results. This allele encodes a mutation where glycine 169 in the BH3 binding pocket is replaced with glutamate ( Fig . 4C ) ( Hengartner and Horvitz , 1994a ) , which inhibits EGL-1 from binding and triggering a conformational change in CED-9 ( del Peso et al . , 2000 ; Yan et al . , 2004 )
*'''Sentence''': ccr-4 is expressed in the pharynx, somatic cell, germ line, hyp6, hyp5, hyp4, hyp3, hyp2, hyp1, pharyngeal neuron, head muscle, hypodermis, seam cell, body wall musculature and in the head and tail neurons.
+
SECTION: results. In the gain-of-function ced-9 ( n1950sd ) allele , glycine 169 , which resides in the CED-9 BH3 binding pocket , is mutated to glutamate ( G169E ) . [Field: results, subscore: 3.00]
 +
SECTION: results. To test whether co-expression of DRP-1 modulates CED-9 via interactions with the BH3 binding pocket , we first created a construct corresponding to the ced- 9 ( n1950gf ) allele .
  
*'''Rule 3''': Replacement Rule'''
+
Sentence to make:  
*Replacement 1: If the anatomy term 'cell' occurs by itself, then use the words 'widely expressed' instead of 'cell'.
+
*Molecular sentence 2: "n1950 encodes a G169E change in the BH3 binding pocket of CED-9"
**sentence: col-178 is expressed in the Cell;
+
</pre>
**Becomes: col-178 is expressed widely.
 
  
*Replacement 2: If the anatomy term 'cell' is present with other anatomy terms, use instead the words 'expressed in several tissues including the' <other anatomy terms>
+
<pre>
**Sentence: frm-1 is expressed in the intestine, pharynx, and the Cell;
+
WBPaper00029156 Modzelewska 2007
**Becomes: frm-1 is expressed in several tissues including the intestine and pharynx.
+
“the sy262 mutation lies within the Rac GEF Dbl domain , it is possible that the mutation acts through one or both of the GTPases .”
 +
“In conjunction with molecular modeling , our data suggest that the C . elegans mutation as well as an equivalent mutation in human SOS1 activate the MAPK pathway by disrupting an auto-inhibitory function of the Dbl domain on Ras activation “
 +
“A mutation equivalent to sy262 G322R activates hSOS1.”
 +
“...in every experiment we found that at some time point, hSOS1 C282R dis- played two- to fourfold more activity than wild-type hSOS1. In conjunction with our genetic data, these data suggest that the G322R change in C. elegans SOS-1, as well as the equivalent C282R change in hSOS1, does indeed enhance EGF-dependent MAPK activation.
  
*Replacement 3: If the anatomy term 'neuron' occurs as an anatomy term by itself, use the words 'in the nervous system' instead.
+
Sentence to make:
**Sentence: ceh-82 is expressed in the neuron;
+
*Molecular sentence 2: "sy262 encodes a G33R mutation in the Rac GEF Dbl domain and is equivalent to a C282R change in the human hSOS1."
**Becomes: ceh-82 is expressed in the nervous system;
+
*Orthology sentence: "The sy262 G33R mutation is equivalent to a C282R change in the human hSOS1."
 +
</pre>
  
 +
<pre>
 +
Title: The genetics of ivermectin resistance in Caenorhabditis elegans .
 +
Authors: Dent JA ; Smith M ; Vassilatis DK ; Avery L
 +
Journal: Proc Natl Acad Sci U S A
 +
Year: 2000-03-14
 +
Doc ID: WBPaper00003954
 +
results. The region surrounding the conserved valine ( bold ) that is mutated to a glutamate in the ad1302 allele ( resulting from a T to A mutation in the second base of the V60 codon ) is shown lined up with the corresponding region in other GluCl subunits and in the rat glycine and -aminobutyric acid ( GABA ) type A- channel subunits ( 29 , 30 ) . [Field: results]
  
*Replacement 4: If the word 'neuron' occurs as part of an anatomy term, eg, 'head neuron', pluralize it, except for the exceptions:
+
Sentence to make:  
**Exceptions:  
+
*Molecular sentence 2: "ad1302 encodes a Val to Glu change."
**I3 neuron
+
*Orthology sentence: "ad1302 affects a region that is conserved in GluCl subunits as well as in rat glycine and GABA type A-channel subunits."
**I4 neuron
+
</pre>
**I5 neuron
 
**I6 neuron
 
**M1 neuron
 
**M4 neuron
 
**M5 neuron
 
**MI neuron
 
**Sentence: nhr-194 is expressed in the amphid neuron, ciliated neuron, head neuron, and the sensory neuron;
 
**Becomes: nhr-194 is expressed in the amphid neurons, ciliated neurons, head neurons, and the sensory neurons;
 
  
 +
<pre>
 +
Title: POP-1 controls axis formation during early gonadogenesis in C . elegans .
 +
Authors: Siegfried KR ; Kimble J
 +
Journal: Development
 +
Year: 2002-01
 +
Doc ID: WBPaper00005116
 +
SECTION: abstract. The pop-1 ( q624 ) allele is weakly penetrant for multiple defects and appears to be a partial loss-of-function mutation ; pop-1 ( q624 ) alters a conserved amino acid in the HMG-box DNA binding domain . [Field: abstract]
 +
SECTION: discussion. This mutation alters a conserved amino acid in the HMG box DNA binding domain which is conserved specifically in TCF / LEF- 1 type HMG proteins ( Laudet et al . , 1993 ) , suggesting that the pop-1 ( q624 ) mutation may affect either recognition of the TCF / LEF-1 consensus sequence or DNA binding affinity , thereby lowering POP-1 activity . [Field: discussion]
 +
SECTION: discussion. As the only mutation isolated thus far in a developmental system that changes a highly conserved amino acid in the -catenin binding domain , the pop-1 ( q645 ) missense mutation may shed new light on TCF / LEF-1 function during development . [Field: discussion]
 +
SECTION: introduction. The pop-1 ( q624 ) allele is weakly penetrant for multiple defects and appears to be a partial loss-of function mutation ; pop-1 ( q624 ) alters a conserved amino acid in the HMG-box DNA binding domain . [Field: introduction]
 +
SECTION: results. The pop-1 ( q645 ) mutation carries a nucleotide substitution predicted to change an aspartic acid ( D ) to a glutamic acid ( E ) ( Fig . 2B ) ; this mutation resides within the pop-1 -catenin binding domain and alters an amino acid conserved in all known TCF / LEF-1 proteins , including nematode , fly , and vertebrate homologues . [Field: results]
 +
SECTION: results. The pop-1 ( q624 ) mutation possesses a nucleotide change in the region encoding the HMG box ; the predicted amino acid change in this case also affects a conserved amino acid ( Fig . 2C ) . [Field: results]
  
'''Rule 5''': Granularity Rule: When a string of cell names are present, check to see if any parent terms are present in any ExprXXX for that gene, keep the highest common parent term available.
+
Sentence to make:  
 +
*Molecular sentence 2: "q624 encodes a D to E change in the catenin binding domain"
 +
*Orthology sentence: "q624 affects a conserved amino acid in the HMG-box DNA binding domain specifically in TCF/LEF-1 type HMG proteins"
 +
</pre>
  
==Preliminary results==
+
==Location of project-related files==
These descriptions are based on Homology predictions and the GO annotations for Process, Component and Function:
+
===on Textpresso-dev===
<pre style="white-space: pre-wrap;
+
The directory structure should be something like
white-space: -moz-pre-wrap;
+
*descriptions/ (descriptions and stats per release)
white-space: -pre-wrap;
+
*source_files/ (common input files, including variation. gene, obos, etc)
white-space: -o-pre-wrap;
+
*molecular/ (phrases generated from molecular info input)
word-wrap: break-word">
+
*process/ (phrases generated from go input)
 +
*phenotype/ (phrases generated from phenotype input)
  
*alh-2
+
===on postgres===
alh-2 encodes an ortholog of human dehydrogenase aldehyde dehydrogenase family 1 member dehydrogenase; alh-2 is predicted to have oxidoreductase activity and oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor based on protein domain information.
+
*The latest dump:
 +
*Variation concise description pipeline:
 +
*Scripts:
 +
*Output location:
  
*asp-5
+
==Order of sentences==
asp-5 encodes an ortholog of human cathepsin d; asp-5 is involved in cell death and locomotion; asp-5 is predicted to have aspartic-type endopeptidase activity based on protein domain information.
+
#Molecular sentence 1
 +
#Molecular sentence 2
 +
#Disease/Orthology
 +
#Process
 +
#Phenotype observed
 +
#Phenotype attribute
 +
#Phenotype NOT observed
  
*cng-1
+
==Postgres sources ==
cng-1 encodes an ortholog of human cyclic nucleotide gated channel alpha 3; cng-1 is predicted to have ion channel activity based on protein domain information; cng-1 is localized to the neuronal cell body.
+
<pre>
 +
Source files for phenotype attribute data for variation concise description
 +
*OA (app) tables:
 +
**app_variation
 +
**app_term
 +
app_paper, app_easescore, app_mmateff, app_hmateff, app_molecule, app_anatomy, app_lifestage, app_penetrance, app_mat_effect, app_temperature, app_cold_degree, app_heat_degree, app+pat_effect
 +
app_haplo, app_cold_sens, app_heat_sens
 +
<pre>
 +
need to adapt for variation concise description
 +
**for all these tables where exp_endogenous table value 'endogenous', grab the exp_gene, WBGeneXXXXXXXX
 +
**exp_name, values look like Expr1005.
 +
**exp_anatomy for anatomy terms, in the form of Wbt:0003679, translate these IDs using the anatomy ontology file
 +
**anatomy ontology file from ftp site:ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS244/ONTOLOGY/anatomy_ontology.WS244.obo
 +
**exp_paper for paper
 +
**exp_qualifier for the qualifiers 'certain', 'uncertain' and 'partial'.
  
 
</pre>
 
</pre>
 +
 +
==Preliminary results==
  
 
==Mapping of automated variation concise description data to OA fields==
 
==Mapping of automated variation concise description data to OA fields==
Line 345: Line 519:
 
|1 ||WBVariation|| Variation|| WBVar00145853 OR gk448 || Required || vcd_variation
 
|1 ||WBVariation|| Variation|| WBVar00145853 OR gk448 || Required || vcd_variation
 
|-
 
|-
|2 ||Species|| Species||Onchocerca volvulus||Required||vcd_species  
+
|2 ||Species|| Species||Onchocerca volvulus||Required||vcd_species
 
|-
 
|-
 
|3 ||Curator||Name of Curator || James Done(first then replace with) Karen Yook<br/> (insert for all rows) ||Required||vcd_curator
 
|3 ||Curator||Name of Curator || James Done(first then replace with) Karen Yook<br/> (insert for all rows) ||Required||vcd_curator
 
|-
 
|-
|4 ||Curator History|| Name of Curator ||same as pgid<br/>(insert for all rows)||Required||vcd_curhistory  
+
|4 ||Curator History|| Name of Curator ||same as pgid<br/>(insert for all rows)||Required||vcd_curhistory
 
|-
 
|-
 
|5 ||Description Type|| Automated_concise_description<br/>(insert for all rows)||Automated_concise_description||Required||vcd_desctype
 
|5 ||Description Type|| Automated_concise_description<br/>(insert for all rows)||Automated_concise_description||Required||vcd_desctype
 
|-
 
|-
|6 ||Description Text|| the automated concise description ||asp-19 encodes an ortholog...||Required||vcd_desctext  
+
|6 ||Description Text|| the automated concise description ||asp-19 encodes an ortholog...||Required||vcd_desctext
 
|-
 
|-
 
|7 ||Reference||WBPaper||WBPaper00026979||Required||vcd_paper
 
|7 ||Reference||WBPaper||WBPaper00026979||Required||vcd_paper
 
|-
 
|-
|8 ||Accesssion Evidence||For Homology, for elegans, use ENSEMBL Gene ID, for non-elegans species use WBGeneID<br/>For Process, Function, use InterPro ID||For elegans: ENSEMBL:ENSG00000103257 (previously used the protein ENSEMBL protein ids) and INTERPRO:IPR002293<br/>For non-elegans species: WBGene00007443 and INTERPRO:IPR002293 <br/>(comma separate multiple values)||Not required||con_accession
+
|8 ||Last Updated||Date when the descriptions<br/>were last generated||2014-09-11||Required||vcd_lastupdate
 
|-
 
|-
|9 ||Last Updated||Date when the descriptions<br/>were last generated||2014-09-11||Required||con_lastupdate
+
|9 || pgid||pgid||1149<br/>(Postgres will generate)||Required||
|-
 
|10 || pgid||pgid||1149<br/>(Postgres will generate)||Required||
 
 
|-
 
|-
 
|}
 
|}
  
 
==Tab-delimited file for OA insert==
 
==Tab-delimited file for OA insert==
 +
(for gene concise desc, for reference)
 
*One tab-delimited file per species
 
*One tab-delimited file per species
 
*Order of the data will be: WBGene, Date, Paper, Accession_evidence, Automated_concise_description, Species, Inferred_automatically text
 
*Order of the data will be: WBGene, Date, Paper, Accession_evidence, Automated_concise_description, Species, Inferred_automatically text
Line 373: Line 546:
  
 
==Directory structure for project==
 
==Directory structure for project==
 +
Use same structure as for gene concise descriptions (which follows)
 
*http://textpresso-dev.caltech.edu/concise_descriptions/  Top level parent directory for project
 
*http://textpresso-dev.caltech.edu/concise_descriptions/  Top level parent directory for project
*http://textpresso-dev.caltech.edu/concise_descriptions/production_release.txt  Indicates what release the file corresponds to  
+
*http://textpresso-dev.caltech.edu/concise_descriptions/production_release.txt  Indicates what release the file corresponds to
*http://textpresso-dev.caltech.edu/concise_descriptions/species.txt  Indicates the different species we are producing description files for  
+
*http://textpresso-dev.caltech.edu/concise_descriptions/species.txt  Indicates the different species we are producing description files for
 
*http://textpresso-dev.caltech.edu/concise_descriptions/release/WS247/c_elegans/descriptions/OA_concise_descriptions.WS247.txt  WS247 elegans file for import into OA
 
*http://textpresso-dev.caltech.edu/concise_descriptions/release/WS247/c_elegans/descriptions/OA_concise_descriptions.WS247.txt  WS247 elegans file for import into OA
  
 
==Inserting automated descriptions into postgres==
 
==Inserting automated descriptions into postgres==
 
====Populating script====
 
====Populating script====
Run the script to populate from here:
+
Scripts for automated concise descriptions (for reference)
/home/acedb/ranjana/concise_testing/populate_automated_concise_descriptions.pl
+
*/home/acedb/ranjana/concise_testing/populate_automated_concise_descriptions.pl -> /home/postgres/work/pgpopulation/concise_description/20140909_automated_concise/populate_automated_concise_descriptions.pl <br>
 
+
which look at  
Script actually at:
+
*http://textpresso-dev.caltech.edu/concise_descriptions/production_release.txt  for release number
/home/postgres/work/pgpopulation/concise_description/20140909_automated_concise/populate_automated_concise_descriptions.pl
+
*http://textpresso-dev.caltech.edu/concise_descriptions/species.txt  for the different species
 
 
Script looks at  
 
http://textpresso-dev.caltech.edu/concise_descriptions/production_release.txt  for release number
 
and
 
 
 
http://textpresso-dev.caltech.edu/concise_descriptions/species.txt  for the different species
 
 
 
Script gets data from the following URL for each of the species:
 
For elegans: http://textpresso-dev.caltech.edu/concise_descriptions/release/WS247/c_elegans/descriptions/OA_concise_descriptions.WS247.txt
 
 
 
For briggsae:
 
http://textpresso-dev.caltech.edu/concise_descriptions/release/WS247/c_briggsae/descriptions/OA_concise_descriptions.WS247.txt
 
 
 
and so on.
 
 
 
When populating for each OA row that has con_desctype set to 'Automated_concise_description', it will delete from these tables :
 
*con_wbgene
 
*con_species
 
*con_curator
 
*con_curhistory
 
*con_desctext
 
*con_paper
 
*con_accession
 
*con_inferredauto
 
*con_lastupdate
 
 
 
Meaning that it's keeping the 'Automated_concise_description' value, so that future runs of the script will reuse existing pgids.  When running out of pre-existing pgids, it will create a new one and assign 'Automated_concise_description' to con_desctype
 
  
 
====For testing on Mangolassi====
 
====For testing on Mangolassi====
Both populating and dumping scripts at:/home/acedb/ranjana/concise_testing
 
 
For the polulating script always redirect output to a file: populate_automated_concise_descriptions.pg.<date>
 
 
==Dumping the automated, concise and provisional descriptions==
 
====Dumping script====
 
*Run the dumping script manually, on tazendra: /home/acedb/kimberly/citace_upload/concise/wrapper.pl
 
*The file, concise_dump_new.ace, can be scp-ed for testing from:scp acedb@tazendra.caltech.edu:/home/postgres/public_html/cgi-bin/data/concise_dump_new.ace
 
*Concise descriptions dumper at /home/postgres/work/citace_upload/concise/dump_concise.pl
 
*/home/postgres/public_html/cgi-bin/data/concise_dump_new.ace
 
 
====Script that finds genes with concise descrips that also have an automated description====
 
On Tazendra and Mangolassi: /home/acedb/ranjana/concise_testing/find_concise_with_automated.pl
 
 
====Script that compares citace test file and Postgres for automated descriptions====
 
Location:/home/acedb/ranjana/concise_testing/compare_concise_postgres_vs_acefile.pl
 
*Needs a 'citace_genes_with_automated.ace' file which is the .ace export of all genes in citace with the 'Automated_description' tag (I use Query buildr to do this query and then export the 'Names' to a file 'citace_genes_with_automated.ace').
 
*Compares the output from citace (from testing concise_dump_new.ace in empty citace database on Maya) of genes with automated tag, to postgres data, also outputs if a gene has a concise description (which would not be dumped so not in .ace file, but in Postgres).
 
*For the WS247 upload, 1 gene not accounted for, WBGene00020108, which has been merged into a dead gene?
 
 
====Discontinued from WS246 upload====
 
*concise cronjob is on the acedb account : 0 2 * * thu /home/acedb/kimberly/citace_upload/concise/wrapper.pl (turned off for now)
 
*It creates a file at: /home/postgres/public_html/cgi-bin/data/concise_dump_new.ace
 
*which you can see on the web at:http://tazendra.caltech.edu/~postgres/cgi-bin/data/concise_dump_new.ace
 
*Then on spica, login, go to the Data_for_citace/Data_from_Kimberly directory, remove the existing file, and upload the latest file using: wget http://tazendra.caltech.edu/~postgres/cgi-bin/data/concise_dump_new.ace
 
 
====Text for Automatically_inferred tag====
 
Tag looks like: Evidence Automatically_inferred ?Text
 
 
Text will be: This description was automatically generated by Textpresso scripts based on homology/orthology data and Gene Ontology (GO) annotations from the <WS246> version of WormBase.
 
 
====Rules for dumping the different types of descriptions in the OA====
 
====.ace format====
 
List of tags to be dumped:
 
*Automated_description
 
*Paper_evidence
 
*Accession_evidence
 
*Date_last_updated
 
*Inferred_automatically
 
 
<pre style="white-space: pre-wrap;
 
white-space: -moz-pre-wrap;
 
white-space: -pre-wrap;
 
white-space: -o-pre-wrap;
 
word-wrap: break-word">
 
Gene : "WBGene00009585"
 
Automated_description "cal-7 encodes an ortholog of human calmodulin-like 4 (HGNC:CALML4); cal-7 is predicted to have calcium ion binding activity, based on protein domain information." Date_last_updated "2012-07-24"
 
Automated_description "cal-7 encodes an ortholog of human calmodulin-like 4 (HGNC:CALML4); cal-7 is predicted to have calcium ion binding activity, based on protein domain information." Paper_evidence "WBPaper000045688"
 
Automated_description "cal-7 encodes an ortholog of human calmodulin-like 4 (HGNC:CALML4); cal-7 is predicted to have calcium ion binding activity, based on protein domain information." Paper_evidence "WBPaper000045689"
 
Automated_description "cal-7 encodes an ortholog of human calmodulin-like 4 (HGNC:CALML4); cal-7 is predicted to have calcium ion binding activity, based on protein domain information." Acession_evidence "ENSEMBL" "ENSP00000419081"
 
Automated_description "cal-7 encodes an ortholog of human calmodulin-like 4 (HGNC:CALML4); cal-7 is predicted to have calcium ion binding activity, based on protein domain information." Acession_evidence "INTERPRO" "IPR002048"
 
Automated_description "cal-7 encodes an ortholog of human calmodulin-like 4 (HGNC:CALML4); cal-7 is predicted to have calcium ion binding activity, based on protein domain information." Inferred_automatically "This description was generated automatically by Textpresso scripts based on homology/orthology data and Gene Ontology (GO) annotations, from the WS243 version of WormBase."
 
</pre>
 
 
====Numbers from citace testing====
 
*concise_dump_new.ace was scped from: /home/postgres/public_html/cgi-bin/data/concise_dump_new.ace and tested
 
*Tested concise_dump_new.ace on empty citaceminus mirrror on local machine with the WS246 models file saved, read in fine!
 
*WS246 numbers: Total genes: 9,529 genes, Genes with automated descriptions: 3,362; 85,039 lines
 
 
==Reporting numbers==
 
--Currently the automated descriptions are generated for genes without a concise description
 
 
--Generate a report for numbers and place on Textpresso-dev, http://textpresso-dev.caltech.edu/concise_descriptions/
 
*Report for WS246 upload, Sept/Oct, 2014:
 
*Total number of automated descriptions = 3,364
 
*Number of automated descriptions with homology = 2,353
 
*Number of automated descriptions with process information = 1,206
 
*Number of automated descriptions with function information = 2,183
 
*Number of automated descriptions with component information = 244
 
 
==Publications related to Text-mining methods==
 
*Automatically generating gene summaries from biomedical literature.
 
 
Ling X, Jiang J, He X, Mei Q, Zhai C, Schatz B.
 
 
Pac Symp Biocomput. 2006:40-51.
 
 
PMID:17094226
 
  
*Generating gene summaries from biomedical literature: A study of semi-structured summarization
+
==Dumping to .ace==
  
Xu Ling *, Jing Jiang, Xin He, Qiaozhu Mei, Chengxiang Zhai, Bruce Schatz
+
==Tracking progress==
 +
Generate a report for numbers and place on Textpresso-dev
 +
*Report for each upload:
 +
*Total number of automated variation descriptions =
 +
*Number of automated descriptions with molecular details =
 +
*Number of automated descriptions with gene function/GO information =
 +
*Number of automated descriptions with phenotype information =
 +
*Number of automated descriptions with human disease reference =
  
Information Processing and Management 43 (2007) 1777–1791
+
==Changes/Updates for each release==
 
==Changes/Updates for WS248==
 
*Changes to Orthology,
 
 
 
==Automated descriptions for ''C. briggsae''==
 
[[Automated descriptions for C. briggsae]]
 
  
 
==Issues to address==
 
==Issues to address==
*1. For the non-elegans species, need to use the WBGeneID as Accession_evidence, so can we use our own database, and do: Accession_evidence "WormBase" "WBGene00005678" (not done)
 
*2. For non-elegans species, need to add in elegans gene experimental data, but with multiple orthologs, which do we pick?  (done, using only the common GO terms for orthologs)
 
*3. Gene names should be pulled and mapped to the Gene IDs only from the gene names file, not from the source files of orthology or GO (will be addressed for WS250)
 
*4. For WS250, check that the gene names are (somewhat) current; corrected mel-47 to tofu-6 by hand for WS249.
 
  
 
==Automated descriptions software==
 
==Automated descriptions software==
[[Documentation for workflow and scripts]]  
+
Follow gene concise description pipeline [[Documentation for workflow and scripts]]
  
</pre>
+
==Publications related to Text-mining methods==
 +
*Automatically generating gene summaries from biomedical literature. Ling X, Jiang J, He X, Mei Q, Zhai C, Schatz B.  Pac Symp Biocomput. 2006:40-51. PMID:17094226
 +
*Generating gene summaries from biomedical literature: A study of semi-structured summarization. Xu Ling *, Jing Jiang, Xin He, Qiaozhu Mei, Chengxiang Zhai, Bruce Schatz

Latest revision as of 01:38, 2 March 2016

Contents

Useful links

app tables
variation model
Ranjana's wiki for creating automated concise descriptions of genes.
geneace upload info
geneace upload of nongene info

Variation concise descriptions

Human-readable summaries of alleles that include a description of its lesion, its effect on the gene's function, and resulting phenotypes. These descriptions aim to recreate summaries like those in the C. elegans I & II books and enhance them with up to date data. A first step is to make and display summaries for each variation; second step is to extract info and combine the summaries of a gene's variations and display them on the corresponding gene pages.

From C. elegans II
e51 : paralysed kinky small irregular pharyngeal pumping able to lay eggs. ES3 ME0. NA > 30
(e450amber e312amber (non-null) e309 (see sup- 6) etc.; all similar to e51 or slightly weaker).
See also e51, e328, e450, e973, e985, e2208, e2274 [C.elegansII] e51 : paralysed, kinky, small,
irregular pharyngeal pumping; able to lay eggs. Ric, high acetylcholine levels; variable
neuroanatomical defects.ES3 ME0. OA>30: e450amb, e312amb (non-null),e309 (suppressed by
sup-6), s69, s178 etc. All alleles similar to e51 or slightly weaker.

MGI produces these summaries (do not know if they are automated): for MGI:95294

Mutations widely affect epithelial development. Null homozygote survival is strain dependent, with
defects observed in skin, eye, brain, viscera, palate, tongue and other tisses. Other  mutations
produce an open eyed, curly whisker phenotype, while a dominant hypermorph yields a thickened
epidermis.

Sample allele summaries:

ju2 is a null allele of syd-1(F32D2.5). The ju2 lesion is a nonsense point mutation that results in a
truncation of all 3 SYD-1 isoforms. ju2 results in defects in axodendritic polarity of ASI and L1 DDs,
neuron morphology of ASI but not DD or VD neurons, presynaptic component localization, synaptic
remodeling of VDs in adults, and backward movement resulting in coiling. ju2 animals do not show
defects in neurite development or postsynaptic component localization.
e1368 is a reduction-of-function/hypomorphic allele of the insulin/IGF receptor ortholog daf-2. The
e1368 lesion is a missense mutation affecting 5 of 6 coding transcripts daf-2. e1368 affects many, but
not all DAF-2-activity requiring processes. Specifically, e1368 disrupts DAF-2 processes of embryonic
and larval development, formation of the developmentally arrested dauer larval stage (diapause),
adult longevity, fat storage, salt chemotaxis learning, and stress resistance, including response to
high temperature. e1368 mutants are temperature sensitive and are dauer constitutive at 22.5 deg. In
addition, e1368 animals have extended life spans. e1368 animals do not show any defects in acetylcholine
esterase activity, carbon dioxide avoidance, diacetyl chemotaxis, and DMPP response.

Source files for project

Building sentences

Building Molecular sentence 1

Template for molecular sentence 1

<variation> is a <app_nature> allele of <gene>.

  • Ex. "ju2 is a null allele of syd-1(F32D2.5)"
  • Ex. "e1368 is a reduction-of-function/hypomorphic allele of daf-2."

Sources for molecular sentence 2

  • Source 1: geneace file on postgres
    • tazendra /home/postgres/work/pgpopulation/obo_oa_ontologies/geneace
  • Source 2: gin_seqname.pg
    • tazendra /home/postgres/work/pgpopulation/obo_oa_ontologies/geneace
  • Source 3: postgres table app_function

Building Molecular sentence 2A - gene features

Gene features include DNA binding sites, promoters, UTRs, splice sites

Template for Molecular sentence -gene features

The <variation> is a <Molecular_change> in <gene feature> in <gene>.

  • Ex. tm3467 is a deletion mutation in a splice acceptor site in supr-1. (WBPaper00045690)

source tags

  1. Molecular_change tags:
                       Splice_site Donor Text #Evidence
                                   Acceptor Text #Evidence
                       Frameshift Text #Evidence  // added sdm

Building Molecular sentence 2B - gene processing

Gene processing effects include, truncation, frameshifts

Template for Molecular sentence -gene processing

The <variation> is a <Molecular_change> in <gene feature> in <gene>.

  • Ex. n3763 is a frameshift mutation in lin-35. (WBPaper00027336)
  • Ex. n767 is a silent mutation in lin-15. (WBPaper00001182)

source tags

  1. Molecular_change tags:
                       Silent Text #Evidence
                       Splice_site Donor Text #Evidence
                                   Acceptor Text #Evidence
                       Frameshift Text #Evidence  // added sdm
                       Readthrough Text #Evidence // klh WS228

Building Molecular sentence 2C - protein domains

Protein domains include catalytic sites, binding domains, activation domains etc.

Template for Molecular sentence -protein domains

The <variation> is a <Molecular_change> in the <protein domain> of <gene>.

  • Ex. gk291 is a deletion mutation in the PTB domain of dab-1. (WBPaper00031546)


Sources for molecular sentence 2

  • Source 1: geneace file on postgres
    • tazendra /home/postgres/work/pgpopulation/obo_oa_ontologies/geneace
  • Source 2: TBD for associated domain

Variation description tags needed for the above text:

  1. Molecular_change tags:
                       Nonsense UNIQUE Amber_UAG Text #Evidence
                                       Ochre_UAA Text #Evidence
                                       Opal_UGA Text #Evidence
                       Missense Text #Evidence                // text fields stored details of codon change
problems
  • coding vs noncoding genes
  • mutation in noncoding area of gene
  • mutation affects isoforms differently
  • only use variations mapped onto human proteins, rather than other way around - keep in sync with with other projects, also limits sentence to a subset of isoforms?
  • ..."that is or is not conserved in humans." can only use under limited situations, orthology needs to be done.

Building gene process summary sentences

Template for a process sentence

<Variation> affects <Gene> function in <GO>.
Example:

  • lf29 affects polk-1 function in error-prone translesion synthesis.
  • bp501 affects atg-4.1 function in autophagy.
  • ns260 affects ttx-1 function in embryo development ending in birth or egg hatching
  • ns235 affects ttx-1 function in the regulation of transcription from RNA polymerase II promoter
  • gg91 affects nrde-2 function in chromatin silencing by small RNA

Source files for process sentences

Gene association file:

WB	WBGene00006831	unc-104		GO:0048490	WB_REF:WBPaper00045884|PMID:25329901	IMP	WB:WBVar02141295	P		C52E12.2|klp-1	gene	taxon:6239	20141212	WB
WB	WBGene00019126	sam-4		GO:1903744	WB_REF:WBPaper00045884|PMID:25329901	IMP	WB:WBVar02125688	P		F59E12.11	gene	taxon:6239	20141212	WB
WB	WBGene00017696	polk-1		GO:0042276	WB_REF:WBPaper00041255|PMID:22761594	IMP	WB:WBVar01473736	P		F22B7.6	gene	taxon:6239	20150611	WB
WB	WBGene00013595	atg-4.1		GO:0006914	WB_REF:WBPaper00041282|PMID:22767594	IMP	WB:WBVar01473704	P		Y87G2A.3	gene	taxon:6239	20140724	WB
WB	WBGene00006652	ttx-1		GO:0009792	WB_REF:WBPaper00040681|PMID:22298710	IMP	WB:WBVar00603928	P		Y113G7A.6	gene	taxon:6239	20140408	WB
WB	WBGene00006652	ttx-1		GO:0045944	WB_REF:WBPaper00040681|PMID:22298710	IMP	WB:WBVar00603924	P		Y113G7A.6	gene	taxon:6239	20140408	WB	has_regulation_target<WB:WBGene00006894>,occurs_in<WBbt:0006754>,happens_during<GO:0009408>
WB	WBGene00011333	nrde-2		GO:0031048	WB_REF:WBPaper00040602|PMID:22231482	IMP	WB:WBVar00601048	P		T01E8.5	gene	taxon:6239	20150715	WB
WB	WBGene00017066	maco-1		GO:0006935	WB_REF:WBPaper00038428|PMID:21589894	IMP	WB:WBVar00597666|WB:WBVar00597667	P		D2092.5	gene	taxon:6239	20110823	WB
WB	WBGene00017066	maco-1		GO:0023041	WB_REF:WBPaper00038428|PMID:21589894	IMP	WB:WBVar00597666|WB:WBVar00597667	P		D2092.5	gene	taxon:6239	20110824	WB

Rules for Process sentence construction

  • Rule 1: Ignore all lines that do no have "IMP" in column 7
  • Rule 2: Map column 8 WBVariation to <allele public name> -
*use obo_name_variation.pg at tazendra /home/postgres/work/pgpopulation/obo_oa_ontologies/geneace
*WBVarID is in first column of geneace file, public_name is second column of geneace file
  • Rule 3: When there are more than one WBVarIDs in column 8, map other WBVarIDs and create a separate summary for those objects
  • Rule 4: Map column 5 GO:ID to GO name -
*use obo_goidprocess at tazendra /home/postgres/work/pgpopulation/obo_oa_ontologies

obo_goidprocess looks like:

GO:ID in line id: <GO:ID#######>
name in line name: <GO name>
If GO:term name starts with a qualifiers "negative" or "positive", replace qualifier with "the"

Example:
js901 affects unc-104 function in anterograde synaptic vesicle transport. Based on column 8 WBVar02141295 ->js901, column 3 unc-104, column 5 GO:00048490
js415 affects sam-4 function in the regulation of anterograde synaptic vesicle transport. Based on column 8 WBVar02125688 -> js415, column 3 sam-4, column 5 GO:1903744 -> positive regulation of anterograde synaptic vesicle transport -> Replace "positive" with "the" -> the regulation of anterograde synaptic vesicle transport

  • Rule 5: Italicize <variation> public name and column 3 gene
  • Rule 6: If there are two variations listed in column 8, make a summary for each variation each using the GO value in column 5 of the line
  • Rule 7: If the genes and alleles are the same in each line, concatenate GO:IDs, comma separate or join with “and”
*Example: 
 *nj21 affects maco-1 function in chemotaxis and neuronal signal transduction
 *nj34 affects maco-1 function in chemotaxis and neuronal signal transduction

Building Phenotype Observed summary sentences using phenotype_association file

Template for Phenotype observed sentences

<Variation> results in (defects, alterations) in <phenotype(s)>.
Ex: "e1368 disrupts DAF-2 processes of embryonic and larval development, formation of the developmentally arrested dauer larval stage (diapause), adult longevity, fat storage, salt chemotaxis learning, and stress resistance, including response to high temperature. In addition, e1368 animals have extended life spans. "

Source files Phenotype Observed data

Source 1 phenotype association file:

1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
WB	WBGene00000898	daf-2		WBPhenotype:0001682	WB:WBVar00143949	IMP	WB:WBPerson261	P		Y55D5A.5	gene	taxon:6239	20151027	WB
WB	WBGene00000898	daf-2	NOT	WBPhenotype:0001688	WB:WBVar00143949	IMP	WB:WBPerson261	P		Y55D5A.5	gene	taxon:6239	20151027	WB
WB	WBGene00000898	daf-2		WBPhenotype:0000190	WB_REF:WBPaper00002149	IMP	WB:WBVar00088561	P		Y55D5A.5	gene	taxon:6239	20151027	WB
WB	WBGene00000898	daf-2	NOT	WBPhenotype:0001660	WB_REF:WBPaper00006052	IMP	WB:WBVar00088561	P		Y55D5A.5	gene	taxon:6239	20151027	WB
WB	WBGene00000898	daf-2		WBPhenotype:0000136	WB_REF:WBPaper00046188	IMP	WB:WBVar00143947	P		Y55D5A.5	gene	taxon:6239	20151027	WB
WB	WBGene00000898	daf-2		WBPhenotype:0000631	WB_REF:WBPaper00036280	IMP	WB:WBVar00143947	P		Y55D5A.5	gene	taxon:6239	20151027	WB
WB	WBGene00000898	daf-2		WBPhenotype:0000637	WB_REF:WBPaper00038179	IMP	WB:WBVar00143947	P		Y55D5A.5	gene	taxon:6239	20151027	WB
WB	WBGene00000898	daf-2		WBPhenotype:0001184	WB_REF:WBPaper00038379	IMP	WB:WBVar00143947	P		Y55D5A.5	gene	taxon:6239	20151027	WB
WB	WBGene00000898	daf-2		WBPhenotype:0001351	WB_REF:WBPaper00038379	IMP	WB:WBVar00143947	P		Y55D5A.5	gene	taxon:6239	20151027	WB
WB	WBGene00000898	daf-2		WBPhenotype:0001861	WB_REF:WBPaper00041295	IMP	WB:WBVar00143947	P		Y55D5A.5	gene	taxon:6239	20151027	WB

Rules for phenotype sentence construction using postgres tables

Rule 1: For lines that does not contain a value in column 4,

  • extract variation(column 6 or 8),
  • phenotype (column 5)
  • paper(column 6)
  • ex. "WB:WBVar00143947" "WBPhenotype:0001861" "WBPaper00041295"
  • Map WBVariationID to Variation public_name using obo_name_variation
  • Map WBPhenotype to Phenotype name using phenotype_ontology.WS251.obo

Rule 2: When more than one variation exists in a line create a new line and use only one variation, with same phenotype, NOT(if exists), and paper as original line
Rule 3: Pool all phenotypes for a given variation, comma separate
Rule 4: If a Phenotype term in the list is a parent of (is_a or part_of parent) another Phenotype term in the list, keep the most granular child term and ignore the parent term(s)


Building Phenotype Observed summary sentences using phenotype_association file

Template for Phenotype observed sentences

<Variation> results in (defects, alterations) in <phenotype(s)>.
Ex: "e1368 effects embryonic and larval development, formation of the developmentally arrested dauer larval stage (diapause), adult longevity, fat storage, salt chemotaxis learning, stress resistance, and response to high temperature. In addition, mutants have extended life spans. "

Source files Phenotype Observed data

Source 1 postgres app tables:

app_term
app_variation
app_paper

Rules for phenotype sentence construction

Rule 1:

  • extract variation - app_variation
  • phenotype - app_term
  • paper - app_paper
  • ex. "WB:WBVar00143947" "WBPhenotype:0001861" "WBPaper00041295"
  • Map WBVariationID to Variation public_name using obo_name_variation
  • Map WBPhenotype to Phenotype name using phenotype_ontology.WS251.obo

Rule 2: When more than one variation exists in a line create a new line and use only one variation, with same phenotype, NOT(if exists), and paper as original line
Rule 3: Pool all phenotypes for a given variation, comma separate
Rule 4: If a Phenotype term in the list is a parent of (is_a or part_of parent) another Phenotype term in the list, keep the most granular child term and ignore the parent term(s)
Rule 5: Remove 'variant' from public names that have them

Rules for specific phenotype terms

Rule 1: For all terms that contain 'variant' remove 'variant' from public names that have them
Rule 2: For the following terms:

  • WBPhenotype:0000061 - extended life span
  • WBPhenotype:0001171 - shortened life span
    • add an "s" to the end of the term public name
    • use the phrasing "In addition, mutants have extended life spans."
  • WBPhenotype:0001838 - drug induced gene expression variant
  • WBPhenotype:0001871 - drug induced life span variant
  • WBPhenotype:0001872 - drug induced locomotion variant
    • change "induced" to "influenced"
  • Terms that end in "early emb"
    • replace "early emb" in the term with "in the early embryo"
    • for term WBPhenotype:0001185 "embryonic developmental delay early emb", remove "embryonic"
    • for term WBPhenotype:0001186 "embryo delayed at pronuclear contact early emb", remove the first term "embryo"

Building Phenotype Attribute summary sentences

Template for Phenotype Attribute summary sentences

<variation> is <app attribute>, <app attribute> <app_term>

  • Ex. "s1019 is cold-sensitive, maternal effect mid larval lethal;"

<variation> is <app attribute> <app_term> at <app_heat_temp> OR <variation> is <app attribute> <app term> at <

  • Ex. "e1368 is temperature sensitive dauer constitutive at 22.5 deg C;"

<variation> is <app attribute> for all phenotypes at <app_heat_degree> OR <variation> is <app attribute> for all phenotypes at <app_cold_degree>

  • Ex. "oj21 is temperature sensitive for all phenotypes at 25 deg C; "

<variation> is <app_mat_effect> <app_term>

  • Ex. "oj21 is maternal effect embryonic lethal;"

Source for Phenotype Attribute sentence

postgres app tables

Rules for Phenotype Attribute summary sentences

Rule 1: when there are more than one phentoype attribute sentences, use the generic "animals" instead of repeating the allele name

  • Ex "oj21 is temperature sensitive for all phenotypes at 25 deg C; oj21 is maternal effect embryonic lethal; animals are 100% sterile at 25 degrees C;"

Building Phenotype NOT Observed summary sentences

Template for Phenotype NOT observed sentences

<Allele> does not show (defects, alterations) in <NOT phenotype(s)>.

  • Ex. "e1368 animals do not show any defects in acetylcholine esterase activity, carbon dioxide avoidance, diacetyl chemotaxis, and DMPP response."

Source files Phenotype Observed data

Source 1 phenotype association file:

1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
WB	WBGene00000898	daf-2		WBPhenotype:0001682	WB:WBVar00143949	IMP	WB:WBPerson261	P		Y55D5A.5	gene	taxon:6239	20151027	WB
WB	WBGene00000898	daf-2	NOT	WBPhenotype:0001688	WB:WBVar00143949	IMP	WB:WBPerson261	P		Y55D5A.5	gene	taxon:6239	20151027	WB
WB	WBGene00000898	daf-2		WBPhenotype:0000190	WB_REF:WBPaper00002149	IMP	WB:WBVar00088561	P		Y55D5A.5	gene	taxon:6239	20151027	WB
WB	WBGene00000898	daf-2	NOT	WBPhenotype:0001660	WB_REF:WBPaper00006052	IMP	WB:WBVar00088561	P		Y55D5A.5	gene	taxon:6239	20151027	WB
WB	WBGene00000898	daf-2		WBPhenotype:0000136	WB_REF:WBPaper00046188	IMP	WB:WBVar00143947	P		Y55D5A.5	gene	taxon:6239	20151027	WB
WB	WBGene00000898	daf-2		WBPhenotype:0000631	WB_REF:WBPaper00036280	IMP	WB:WBVar00143947	P		Y55D5A.5	gene	taxon:6239	20151027	WB
WB	WBGene00000898	daf-2		WBPhenotype:0000637	WB_REF:WBPaper00038179	IMP	WB:WBVar00143947	P		Y55D5A.5	gene	taxon:6239	20151027	WB

Rules for phenotype sentence construction

Rule 1: For each line that contains (NOT) value in column 4, extract

  • variation(column 6 or 8),
  • phenotype (column 5)
  • paper(column 6)
  • NOT(column 4)
  • ex. "WB:WBVar00143947" "NOT" "WBPhenotype:0000114" "WBPaper00039871"
  • Map WBVariationID to Variation public_name using obo_name_variation
  • Map WBPhenotype to Phenotype name using phenotype_ontology.WS251.obo

Rule 2: When more than one variation exists in a line create a new line and use only one variation, with same phenotype, and paper as original line
Rule 3: Pool all NOT phenotype for a given variation, separate
Rule 4: If a Phenotype term in the list is a parent of (is_a or part_of parent) another Phenotype term in the list, keep the most granular child term and ignore the parent term(s) Rule 5: Remove 'variant' from public names that have them


Building Disease Orthology summary sentences

see Caltech group meeting May 7, 2015

  • to connect conserved/syntenic mutations
  • link elegans gene variations and phenotypes to homologous human disease gene variations
  • to link elegans mutations as a disease model ex. pdr-1 mutations used to model juvenile parkinsons

use ortholog(s) of gene defined by Generation_of_automated_descriptions#Orthology.2FHomology Orthology, Homology and Paralog data in WormBase

Examples for concise descriptions for variations of different types

  • Classic alleles

notes: can add: mutagen, history of isolation

    • Alleles with molecular data
e1368 is a reduction-of-function/hypomorphic allele of the insulin/IGF receptor ortholog daf-2. The
e1368 lesion is a missense mutation affecting 5 of 6 coding transcripts daf-2. e1368 affects many, but
not all DAF-2-activity requiring processes. Specifically, e1368 disrupts DAF-2 processes of embryonic
and larval development, formation of the developmentally arrested dauer larval stage (diapause),
adult longevity, fat storage, salt chemotaxis learning, and stress resistance, including response to
high temperature. e1368 mutants are temperature sensitive and are dauer constitutive at 22.5 deg. In
addition, e1368 animals have extended life spans. e1368 animals do not show any defects in acetylcholine
esterase activity, carbon dioxide avoidance, diacetyl chemotaxis, and DMPP response.
    • Alleles with only genetic data
  • Other variation types
  • Engineered alleles
  • Integrated transgenes

other categories of alleles

  • Most published alleles
  • Alleles with most phenotypes
  • Alleles of genes with concise descriptions (see concise description wiki to find these)

Prioritizing variations for automated descriptions

  • Classic alleles - Variations with Variation_type Allele (see Variation model)
    • Alleles with molecular data => Variation_type Allele + Type_of_mutation exists
    • Alleles with only genetic data => Variation_type Allele and NO Type_of_mutation exists
  • Variations of genes with concise descriptions
    • Variation Affects Gene + gene exists with cns_summary table entry; acedb model Gene->Gene_info->Concise_description
  • Other variation types
    • Variation_type SNP, Confirmed_SNP
    • Variation_type Transposons, RFLPs
  • Engineered variations
    • Variation_type Engineered_allele
  • Integrated transgenes with allele name
  • Variations that have been most published
    • textpresso search results
  • Variations with most phenotypes
    • app_variation count

Orthology/ Conserved domain mutations

see Caltech group meeting May 7, 2015
connections between mutations and protein domains, and predict affects on function
We currently do not capture mutations in the context of affecting a conserved amino acid - how and who would do this? Can Hinxton generate these?
Many examples can be found with the Textpresso search for 'mutation conserved'

Examples

id : WBPaper00037661
name : WBPaper00037661
title : Sequential action of Caenorhabditis elegans Rab GTPases regulates phagolysosome formation during apoptotic cell degradation.
Sequencing of rab-14 in qx18 mutants revealed a C to T transition, which resulted in substitution of the Threonine at codon 67 with Methionine (ACG > ATG; T67M). This mutation affects the phosphate/Mg2+
binding domain PM3, which is conserved in all members of the Ras GTPase superfamily

Sentences to make: 
*Molecular sentence 2: "qx18 encodes a C to T transition, T67M which affects the phosphate/Mg2+ binding domain PM3" 
*Orthology sentence: "The phosphate/Mg2+ binding domain PM3 is conserved in all members of the Ras GTPase superfamily"
Reverse in vitro mutation analysis of elegans mutation on mammalian disease gene.
Title: Introduction of a loss-of-function point mutation from the SH3 region of the Caenorhabditis elegans sem-5 gene activates the transforming ability of c-abl in vivo and abolishes binding of proline-rich ligands in vitro .
Authors: Van Etten RA ; Debnath J ; Zhou H ; Casasnovas JM
Journal: Oncogene
Year: 1995-05-18
Doc ID: WBPaper00002191
When the n1619 mutation , which confers a lethal and highly penetrant vulvaless phenotype in C . elegans , is introduced into the c-abl SH3 domain , substituting a leucine for proline at AN amino acid number 131 , the resulting mutant transforms NIH3T3 fibroblasts with an efficiency about 10 % that of SH3-deleted c-abl .

Sentence to make: 
*Molecular sentence 2: "n1619 encodes a L131P substitution in the c-abl SH3 domain"
Title: CED-9 and mitochondrial homeostasis in C . elegans muscle .
Authors: Tan FJ ; Husain M ; Manlandro CM ; Koppenol M ; Fire AZ ; Hill RB
Journal: J Cell Sci
Year: 2008-10-15
Doc ID: WBPaper00032231
SECTION: results. This allele encodes a mutation where glycine 169 in the BH3 binding pocket is replaced with glutamate ( Fig . 4C ) ( Hengartner and Horvitz , 1994a ) , which inhibits EGL-1 from binding and triggering a conformational change in CED-9 ( del Peso et al . , 2000 ; Yan et al . , 2004 )
SECTION: results. In the gain-of-function ced-9 ( n1950sd ) allele , glycine 169 , which resides in the CED-9 BH3 binding pocket , is mutated to glutamate ( G169E ) . [Field: results, subscore: 3.00]
SECTION: results. To test whether co-expression of DRP-1 modulates CED-9 via interactions with the BH3 binding pocket , we first created a construct corresponding to the ced- 9 ( n1950gf ) allele .

Sentence to make: 
*Molecular sentence 2: "n1950 encodes a G169E change in the BH3 binding pocket of CED-9"
WBPaper00029156 Modzelewska 2007
“the sy262 mutation lies within the Rac GEF Dbl domain , it is possible that the mutation acts through one or both of the GTPases .”
“In conjunction with molecular modeling , our data suggest that the C . elegans mutation as well as an equivalent mutation in human SOS1 activate the MAPK pathway by disrupting an auto-inhibitory function of the Dbl domain on Ras activation “
“A mutation equivalent to sy262 G322R activates hSOS1.”
“...in every experiment we found that at some time point, hSOS1 C282R dis- played two- to fourfold more activity than wild-type hSOS1. In conjunction with our genetic data, these data suggest that the G322R change in C. elegans SOS-1, as well as the equivalent C282R change in hSOS1, does indeed enhance EGF-dependent MAPK activation.”

Sentence to make: 
*Molecular sentence 2: "sy262 encodes a G33R mutation in the Rac GEF Dbl domain and is equivalent to a C282R change in the human hSOS1."
*Orthology sentence: "The sy262 G33R mutation is equivalent to a C282R change in the human hSOS1."
Title: The genetics of ivermectin resistance in Caenorhabditis elegans .
Authors: Dent JA ; Smith M ; Vassilatis DK ; Avery L
Journal: Proc Natl Acad Sci U S A
Year: 2000-03-14
Doc ID: WBPaper00003954
results. The region surrounding the conserved valine ( bold ) that is mutated to a glutamate in the ad1302 allele ( resulting from a T to A mutation in the second base of the V60 codon ) is shown lined up with the corresponding region in other GluCl subunits and in the rat glycine and -aminobutyric acid ( GABA ) type A- channel subunits ( 29 , 30 ) . [Field: results]

Sentence to make: 
*Molecular sentence 2: "ad1302 encodes a Val to Glu change."
*Orthology sentence: "ad1302 affects a region that is conserved in GluCl subunits as well as in rat glycine and GABA type A-channel subunits."
Title: POP-1 controls axis formation during early gonadogenesis in C . elegans .
Authors: Siegfried KR ; Kimble J
Journal: Development
Year: 2002-01
Doc ID: WBPaper00005116
SECTION: abstract. The pop-1 ( q624 ) allele is weakly penetrant for multiple defects and appears to be a partial loss-of-function mutation ; pop-1 ( q624 ) alters a conserved amino acid in the HMG-box DNA binding domain . [Field: abstract]
SECTION: discussion. This mutation alters a conserved amino acid in the HMG box DNA binding domain which is conserved specifically in TCF / LEF- 1 type HMG proteins ( Laudet et al . , 1993 ) , suggesting that the pop-1 ( q624 ) mutation may affect either recognition of the TCF / LEF-1 consensus sequence or DNA binding affinity , thereby lowering POP-1 activity . [Field: discussion]
SECTION: discussion. As the only mutation isolated thus far in a developmental system that changes a highly conserved amino acid in the -catenin binding domain , the pop-1 ( q645 ) missense mutation may shed new light on TCF / LEF-1 function during development . [Field: discussion]
SECTION: introduction. The pop-1 ( q624 ) allele is weakly penetrant for multiple defects and appears to be a partial loss-of function mutation ; pop-1 ( q624 ) alters a conserved amino acid in the HMG-box DNA binding domain . [Field: introduction]
SECTION: results. The pop-1 ( q645 ) mutation carries a nucleotide substitution predicted to change an aspartic acid ( D ) to a glutamic acid ( E ) ( Fig . 2B ) ; this mutation resides within the pop-1 -catenin binding domain and alters an amino acid conserved in all known TCF / LEF-1 proteins , including nematode , fly , and vertebrate homologues . [Field: results]
SECTION: results. The pop-1 ( q624 ) mutation possesses a nucleotide change in the region encoding the HMG box ; the predicted amino acid change in this case also affects a conserved amino acid ( Fig . 2C ) . [Field: results]

Sentence to make: 
*Molecular sentence 2: "q624 encodes a D to E change in the catenin binding domain" 
*Orthology sentence: "q624 affects a conserved amino acid in the HMG-box DNA binding domain specifically in TCF/LEF-1 type HMG proteins"

Location of project-related files

on Textpresso-dev

The directory structure should be something like

  • descriptions/ (descriptions and stats per release)
  • source_files/ (common input files, including variation. gene, obos, etc)
  • molecular/ (phrases generated from molecular info input)
  • process/ (phrases generated from go input)
  • phenotype/ (phrases generated from phenotype input)

on postgres

  • The latest dump:
  • Variation concise description pipeline:
  • Scripts:
  • Output location:

Order of sentences

  1. Molecular sentence 1
  2. Molecular sentence 2
  3. Disease/Orthology
  4. Process
  5. Phenotype observed
  6. Phenotype attribute
  7. Phenotype NOT observed

Postgres sources

Source files for phenotype attribute data for variation concise description
*OA (app) tables:
**app_variation
**app_term
app_paper, app_easescore, app_mmateff, app_hmateff, app_molecule, app_anatomy, app_lifestage, app_penetrance, app_mat_effect, app_temperature, app_cold_degree, app_heat_degree, app+pat_effect
app_haplo, app_cold_sens, app_heat_sens
<pre>
need to adapt for variation concise description
**for all these tables where exp_endogenous table value 'endogenous', grab the exp_gene, WBGeneXXXXXXXX
**exp_name, values look like Expr1005.
**exp_anatomy for anatomy terms, in the form of Wbt:0003679, translate these IDs using the anatomy ontology file
**anatomy ontology file from ftp site:ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS244/ONTOLOGY/anatomy_ontology.WS244.obo
**exp_paper for paper
**exp_qualifier for the qualifiers 'certain', 'uncertain' and 'partial'.

Preliminary results

Mapping of automated variation concise description data to OA fields

Mapping of data to data fields in the OA using table name vcd for variation concise description
OA field
number
OA field name Data to be inserted Example of data
to be inserted
Required or Not OA table name
1 WBVariation Variation WBVar00145853 OR gk448 Required vcd_variation
2 Species Species Onchocerca volvulus Required vcd_species
3 Curator Name of Curator James Done(first then replace with) Karen Yook
(insert for all rows)
Required vcd_curator
4 Curator History Name of Curator same as pgid
(insert for all rows)
Required vcd_curhistory
5 Description Type Automated_concise_description
(insert for all rows)
Automated_concise_description Required vcd_desctype
6 Description Text the automated concise description asp-19 encodes an ortholog... Required vcd_desctext
7 Reference WBPaper WBPaper00026979 Required vcd_paper
8 Last Updated Date when the descriptions
were last generated
2014-09-11 Required vcd_lastupdate
9 pgid pgid 1149
(Postgres will generate)
Required

Tab-delimited file for OA insert

(for gene concise desc, for reference)

  • One tab-delimited file per species
  • Order of the data will be: WBGene, Date, Paper, Accession_evidence, Automated_concise_description, Species, Inferred_automatically text
  • Format: tab-delimited file, comma separate the values when multiple values are present
  • Date is the last date that the script was run to generate the automated descriptions (eg. 2014-05-28)
  • File will be placed on textpresso-dev to be picked up by a cron job by JC

Directory structure for project

Use same structure as for gene concise descriptions (which follows)

Inserting automated descriptions into postgres

Populating script

Scripts for automated concise descriptions (for reference)

  • /home/acedb/ranjana/concise_testing/populate_automated_concise_descriptions.pl -> /home/postgres/work/pgpopulation/concise_description/20140909_automated_concise/populate_automated_concise_descriptions.pl

which look at

For testing on Mangolassi

Dumping to .ace

Tracking progress

Generate a report for numbers and place on Textpresso-dev

  • Report for each upload:
  • Total number of automated variation descriptions =
  • Number of automated descriptions with molecular details =
  • Number of automated descriptions with gene function/GO information =
  • Number of automated descriptions with phenotype information =
  • Number of automated descriptions with human disease reference =

Changes/Updates for each release

Issues to address

Automated descriptions software

Follow gene concise description pipeline Documentation for workflow and scripts

Publications related to Text-mining methods

  • Automatically generating gene summaries from biomedical literature. Ling X, Jiang J, He X, Mei Q, Zhai C, Schatz B. Pac Symp Biocomput. 2006:40-51. PMID:17094226
  • Generating gene summaries from biomedical literature: A study of semi-structured summarization. Xu Ling *, Jing Jiang, Xin He, Qiaozhu Mei, Chengxiang Zhai, Bruce Schatz