Difference between revisions of "VariationConciseDescriptions"

From WormBaseWiki
Jump to navigationJump to search
Line 22: Line 22:
 
First step -make individualized variation summaries for variation pages, then extract, combine, and shorten for corresponding gene pages.  
 
First step -make individualized variation summaries for variation pages, then extract, combine, and shorten for corresponding gene pages.  
  
Variation page summary
+
Sample variation summary:<br>
 
<pre>
 
<pre>
ju2 is a null allele of syd-1(F32D2.5). The ju2 lesion is a nonsense point mutation that results in a  
+
ju2 is a null allele of syd-1(F32D2.5). The ju2 lesion is a nonsense point mutation that results in a
truncation of all 3 SYD-1 isoforms. ju2 results in defects in axodendritic polarity of ASI and L1 DDs, neuron  
+
truncation of all 3 SYD-1 isoforms. ju2 results in defects in axodendritic polarity of ASI and L1 DDs, neuron morphology of ASI but not DD or VD neurons, presynaptic component localization, synaptic remodeling of VDs in adults, and backward movement resulting in coiling. ju2 animals do not show defects in neurite development or postsynaptic component localization.
morphology of ASI but not DD or VD neurons, presynaptic component localization, synaptic  
+
</pre>
remodeling of VDs in adults, and backward movement resulting in coiling. ju2 animals do not show  
+
 
defects in neurite development or postsynaptic component localization.  
+
Reconstruction from sources:
 +
<obo_data_variation:public_name> is a <app_nature> allele of <obo_data_variation:gene associated>. The <allele> is a <Variation->Description->"Nonsense"> that results in <unsure of source> of <count of <Variation->Affects->"Transcript">. <Allele> results in defects in <app_phenotype> where <app_not> is false>. <Allele> does not show defects in <app_not; app_phenotype>.
 +
 
 +
Variation description tags that are needed for the above text:
 +
Description tags:
 +
<pre>
 +
                      Nonsense UNIQUE Amber_UAG Text #Evidence
 +
                                      Ochre_UAA Text #Evidence
 +
                                      Opal_UGA Text #Evidence
 +
                      Missense Text #Evidence                // text fields stored details of codon change
 +
                      Silent Text #Evidence
 +
                      Splice_site Donor Text #Evidence
 +
                                  Acceptor Text #Evidence
 +
                      Frameshift Text #Evidence  // added sdm
 +
                      Readthrough Text #Evidence // klh WS228
 
</pre>
 
</pre>
 
 
<pre>
 
<pre>
 
e1368 is a reduction-of-function/hypomorphic allele of the insulin/IGF receptor ortholog daf-2. The  
 
e1368 is a reduction-of-function/hypomorphic allele of the insulin/IGF receptor ortholog daf-2. The  
e1368 lesion is a missense mutation affecting 5 of 6coding transcripts daf-2. e1368 affects many, but  
+
e1368 lesion is a missense mutation affecting 5 of 6 coding transcripts daf-2. e1368 affects many, but  
 
not all DAF-2-activity requiring processes. Specifically, e1368 disrupts DAF-2 processes of embryonic  
 
not all DAF-2-activity requiring processes. Specifically, e1368 disrupts DAF-2 processes of embryonic  
 
and larval development, formation of the developmentally arrested dauer larval stage (diapause),  
 
and larval development, formation of the developmentally arrested dauer larval stage (diapause),  
Line 73: Line 86:
 
* Variations with most phenotypes
 
* Variations with most phenotypes
 
**app_variation count
 
**app_variation count
 
==_________________________________==
 
 
 
==Location of project-related files==
 
===on Texptresso-dev===
 
 
===on postgres===
 
*The latest dump:
 
*Variation concise description pipeline:
 
*Scripts:
 
*Output location:
 
  
 
==Semantic categories in an Automated Description==
 
==Semantic categories in an Automated Description==
Line 113: Line 114:
 
Textpresso search<bre>
 
Textpresso search<bre>
 
*"engineered allele"
 
*"engineered allele"
**<pre>
 
Title: Cell death in C . elegans : molecular insights into mechanisms conserved between nematodes and mammals . 
 
WBPaper00002592
 
SECTION: introduction. After a cell nuc-1 gene is needed for ced-3 missense mutations affect a ced-3 allele engineered to encode a in vitro in vitro to generate ced-3 protein harboring an amino acid ced-3 loss-of-function allele in vitro assay , Figure 3 . [Field: introduction]
 
SECTION: introduction. Included in this family are mec-10 , which is expressed in the touch receptor neurons and can be engineered to encode toxic degeneration-inducing substitutions ( 43 ) ; unc105 , which appears to be expressed in muscle and which can mutate to a semi-dominant allele that induces muscle hypercontraction ( 60 ) ; and unc-8 ( Tavernarakis et al . submitted ) , which can mutate to a semi-dominant allele that induces swelling and dysfunction of ventral cord neurons ( 77 ) . [Field: introduction]
 
  
  
Line 226: Line 222:
 
SECTION: results. The pop-1 ( q624 ) mutation possesses a nucleotide change in the region encoding the HMG box ; the predicted amino acid change in this case also affects a conserved amino acid ( Fig . 2C ) . [Field: results]
 
SECTION: results. The pop-1 ( q624 ) mutation possesses a nucleotide change in the region encoding the HMG box ; the predicted amino acid change in this case also affects a conserved amino acid ( Fig . 2C ) . [Field: results]
 
</pre>
 
</pre>
 +
 +
==_________________________________==
 +
 +
 +
==Location of project-related files==
 +
===on Texptresso-dev===
 +
 +
===on postgres===
 +
*The latest dump:
 +
*Variation concise description pipeline:
 +
*Scripts:
 +
*Output location:
 +
  
 
==Orthology/Homology==
 
==Orthology/Homology==

Revision as of 20:37, 9 December 2015

Work in progress based on Ranjana's wiki for creating automated concise descriptions of genes.

Goals for variation concise descriptions

This project aims to create a human-readable summary of an allele that includes a description of its lesion and effect on the gene's function and resulting phenotypes. These descriptions aim to recreate the summaries (enhanced) like those in the C. elegans I & II books, for example

 
e51 : paralysed kinky small irregular pharyngeal pumping able to lay eggs. ES3 ME0. NA > 30 
(e450amber e312amber (non-null) e309 (see sup- 6) etc.; all similar to e51 or slightly weaker).
See also e51, e328, e450, e973, e985, e2208, e2274 [C.elegansII] e51 : paralysed, kinky, small, 
irregular pharyngeal pumping; able to lay eggs. Ric, high acetylcholine levels; variable 
neuroanatomical defects.ES3 ME0. OA>30: e450amb, e312amb (non-null),e309 (suppressed by 
sup-6), s69, s178 etc. All alleles similar to e51 or slightly weaker.

MGI produces these summaries (do not know if they are automated): for MGI:95294

 
Mutations widely affect epithelial development. Null homozygote survival is strain dependent, with 
defects observed in skin, eye, brain, viscera, palate, tongue and other tisses. Other  mutations 
produce an open eyed, curly whisker phenotype, while a dominant hypermorph yields a thickened 
epidermis.

First step -make individualized variation summaries for variation pages, then extract, combine, and shorten for corresponding gene pages.

Sample variation summary:

ju2 is a null allele of syd-1(F32D2.5). The ju2 lesion is a nonsense point mutation that results in a
truncation of all 3 SYD-1 isoforms. ju2 results in defects in axodendritic polarity of ASI and L1 DDs, neuron morphology of ASI but not DD or VD neurons, presynaptic component localization, synaptic remodeling of VDs in adults, and backward movement resulting in coiling. ju2 animals do not show defects in neurite development or postsynaptic component localization.

Reconstruction from sources: <obo_data_variation:public_name> is a <app_nature> allele of <obo_data_variation:gene associated>. The <allele> is a <Variation->Description->"Nonsense"> that results in <unsure of source> of <count of <Variation->Affects->"Transcript">. <Allele> results in defects in <app_phenotype> where <app_not> is false>. <Allele> does not show defects in <app_not; app_phenotype>.

Variation description tags that are needed for the above text: Description tags:

                       Nonsense UNIQUE Amber_UAG Text #Evidence
                                       Ochre_UAA Text #Evidence
                                       Opal_UGA Text #Evidence
                       Missense Text #Evidence                // text fields stored details of codon change
                       Silent Text #Evidence
                       Splice_site Donor Text #Evidence
                                   Acceptor Text #Evidence
                       Frameshift Text #Evidence  // added sdm
                       Readthrough Text #Evidence // klh WS228
e1368 is a reduction-of-function/hypomorphic allele of the insulin/IGF receptor ortholog daf-2. The 
e1368 lesion is a missense mutation affecting 5 of 6 coding transcripts daf-2. e1368 affects many, but 
not all DAF-2-activity requiring processes. Specifically, e1368 disrupts DAF-2 processes of embryonic 
and larval development, formation of the developmentally arrested dauer larval stage (diapause), 
adult longevity, fat storage, salt chemotaxis learning, and stress resistance, including response to 
high temperature. e1368 mutants are temperature sensitive and are dauer constitutive at 22.5 deg. In 
addition, e1368 animals have extended life spans. e1368 animals do not show any defects in acetylcholine 
esterase activity, carbon dioxide avoidance, diacetyl chemotaxis, and DMPP response.

Classes of variations for concise descriptions

  • Classic alleles
    • Alleles with molecular data
    • Alleles with only genetic data
  • Alleles of genes with concise descriptions (see concise description wiki to find these)
  • Other variation types
  • Engineered alleles
  • Integrated transgenes
  • Most published alleles
  • Alleles with most phenotypes

Querying for variation sets

Obtaining variation IDs and names for automated descriptions from geneace (?)

  • Classic alleles
Variations with Variation_type Allele (see  Variation model)
    • Alleles with molecular data
***Variation_type Allele + Type_of_mutation exists
    • Alleles with only genetic data
***Variation_type Allele and NO Type_of_mutation exists
  • Variations of genes with concise descriptions
**Variation Affects Gene + gene exists with cns_summary table entry; acedb model Gene->Gene_info->Concise_description
  • Other variation types
**Variation_type SNP, Confirmed_SNP
**Variation_type Transposons, RFLPs
  • Engineered variations
    • Variation_type Engineered_allele
  • Integrated transgenes with allele name
  • Variations that have been most published
    • textpresso search results
  • Variations with most phenotypes
    • app_variation count

Semantic categories in an Automated Description

Molecular

  • Lesion
  • Gene feature
  • Gene product
    • Gene product domain
    • Gene molecular function

Genetic

  • Gene function (null, hypomorph, etc. ->app_func)
  • Allele nature (recessive, dominant, etc. ->app_nature)
  • Phenotype
    • Affects tissue expression, subcellular localization
    • Affects gene regulation
    • Affects gene interaction
  • Orthology to human gene mutations related to disease

Lesion

  • Rationale: Molecular details
  • Example:
  • Model tags:
  • Source files:
  • Template sentence:
<Variation> is a(n) <Variation_type> in the <Species> <Gene>. This variation results in a <molecular summary> <Type_of_mutation> in the gene. 

Targeted mutations/Engineered alleles

Textpresso search<bre>

  • "engineered allele"


Protein domain mutations

see Caltech group meeting May 7, 2015
connections between mutations and protein domains, and predict affects on function
We currently do not capture mutations in the context of affecting a conserved amino acid - how and who would do this? Can Hinxton generate these?
Many examples can be found with the Textpresso search for 'mutation conserved'

Examples

id : WBPaper00037661
name : WBPaper00037661
title : Sequential action of Caenorhabditis elegans Rab GTPases regulates phagolysosome formation during apoptotic cell degradation.
Sequencing of rab-14 in qx18 mutants revealed a C to T transition, which resulted in substitution of the Threonine at codon 67 with Methionine (ACG > ATG; T67M). This mutation affects the phosphate/Mg2+
binding domain PM3, which is conserved in all members of the Ras GTPase superfamily
title : SYD-1, a presynaptic protein with PDZ, C2 and rhoGAP-like domains, specifies axon identity in C. elegans.
id : WBPaper00005543
pdf : 5543_Hallam02.pdf
syd-1(GAPdeletion) mutation interferes with neurite outgrowth. 
construct with various missense mutations as well as a deletion in the conserved rhoGAP domain in syd-1 were made and assessed for a phenotype in transgenic animals.
id : WBPaper00027028
name : WBPaper00027028
title : Conditional dominant mutations in the Caenorhabditis elegans gene act-2 identify cytoplasmic and muscle roles for a redundant actin isoform.
semidominant and embryonic-lethal mutations in the C. elegans act-2 gene. These mutations alter conserved amino acids in the predicted ATP binding pocket of actin and promote contractile instabilities and ectopic furrowing in early embryonic cells, implicating ACT-2 as a cytoplasmic
actin.
Title: The Caenorhabditis elegans Iodotyrosine Deiodinase Ortholog SUP-18 Functions through a Conserved Channel SC-Box to Regulate the Muscle Two-Pore Domain Potassium Channel SUP-9 .
Authors: de la Cruz IP ; Ma L ; Horvitz HR
Journal: PLoS Genet
Year: 2014-02
Doc ID: WBPaper00044940
SECTION: discussion. Five other sup-18 mutations affecting highly conserved residues in the NADH oxidase / flavin reductase domain also behave like null mutations , consistent with the hypothesis that SUP-18 enzymatic activity is essential for its function.
SECTION: discussion. While Kvb2 knockout mice have seizures and reduced lifespans , mice carrying a catalytic null mutation in Kvb2 have a wild-type phenotype , suggesting that if an enzymatic activity for Kvb2 exists , it is functionally dispensable SUP-18 Interacts with a Two-Pore Domain K + Channel PLOS Genetics | www . plosgenetics . org 11 February 2014 | Volume 10 | Issue 2 | e1004175 in vivo [ 59 ] . 
IYDs across metazoan species share a similar enzymatic activity in reductive deiodination of diiodotyrosine [51], and it seems likely that SUP-18 acts similarly in C. elegans. Like mammalian IYDs,
SUP-18 contains a presumptive N-terminal transmembrane domain that is required for full activity. Interestingly, the SUP-18 intracellular region lacking the transmembrane domain could still partially activate the SUP-9 channel, suggesting that membrane association is not absolutely required for SUP-9 activation by SUP-18. Membrane association is important for mammalian IYD enzymatic activities [5,52,53].
Reverse in vitro mutation analysis of elegans mutation on mammalian disease gene.
Title: Introduction of a loss-of-function point mutation from the SH3 region of the Caenorhabditis elegans sem-5 gene activates the transforming ability of c-abl in vivo and abolishes binding of proline-rich ligands in vitro .
Authors: Van Etten RA ; Debnath J ; Zhou H ; Casasnovas JM
Journal: Oncogene
Year: 1995-05-18
Doc ID: WBPaper00002191
When the n1619 mutation , which confers a lethal and highly penetrant vulvaless phenotype in C . elegans , is introduced into the c-abl SH3 domain , substituting a leucine for proline at AN amino acid number 131 , the resulting mutant transforms NIH3T3 fibroblasts with an efficiency about 10 % that of SH3-deleted c-abl . 
Title: CED-9 and mitochondrial homeostasis in C . elegans muscle .
Authors: Tan FJ ; Husain M ; Manlandro CM ; Koppenol M ; Fire AZ ; Hill RB
Journal: J Cell Sci
Year: 2008-10-15
Doc ID: WBPaper00032231
SECTION: results. This allele encodes a mutation where glycine 169 in the BH3 binding pocket is replaced with glutamate ( Fig . 4C ) ( Hengartner and Horvitz , 1994a ) , which inhibits EGL-1 from binding and triggering a conformational change in CED-9 ( del Peso et al . , 2000 ; Yan et al . , 2004 )
SECTION: results. In the gain-of-function ced-9 ( n1950sd ) allele , glycine 169 , which resides in the CED-9 BH3 binding pocket , is mutated to glutamate ( G169E ) . [Field: results, subscore: 3.00]
SECTION: results. To test whether co-expression of DRP-1 modulates CED-9 via interactions with the BH3 binding pocket , we first created a construct corresponding to the ced- 9 ( n1950gf ) allele . 
Schwartz, 2010, WBPaper00036020
“Since the HMT-1 polypeptide of gk161 allele lacked TMD and NBD that are required for the function of ABC transporters, we used this strain in our studies.” gk161 is hypersensitive to cadmium
WBPaper0002481 Wang 1996
identifies unc-86 binding sites in mec-3 promoter region, with accompanying evaluation of phenotypes resulting from mec-3 mutations in these regions
“UNC-86 binding is blocked by certain mutations, as described above. When met-3-lacZ fusions with UNC-86 site mutations were introduced into C. elegans, mutations in Region III had a strong effect on expression, mutations in Region II had a significant effect, and mutations in Region I had no detectable effect”
WBPaper00029156 Modzelewska 2007
“the sy262 mutation lies within the Rac GEF Dbl domain , it is possible that the mutation acts through one or both of the GTPases .”
“In conjunction with molecular modeling , our data suggest that the C . elegans mutation as well as an equivalent mutation in human SOS1 activate the MAPK pathway by disrupting an auto-inhibitory function of the Dbl domain on Ras activation “
“A mutation equivalent to sy262 G322R activates hSOS1.”
“...in every experiment we found that at some time point, hSOS1 C282R dis- played two- to fourfold more activity than wild-type hSOS1. In conjunction with our genetic data, these data suggest that the G322R change in C. elegans SOS-1, as well as the equivalent C282R change in hSOS1, does indeed enhance EGF-depen- dent MAPK activation.”
Title: The genetics of ivermectin resistance in Caenorhabditis elegans . 
Authors: Dent JA ; Smith M ; Vassilatis DK ; Avery L 
Journal: Proc Natl Acad Sci U S A 
Year: 2000-03-14 
Doc ID: WBPaper00003954
results. The region surrounding the conserved valine ( bold ) that is mutated to a glutamate in the ad1302 allele ( resulting from a T to A mutation in the second base of the V60 codon ) is shown lined up with the corresponding region in other GluCl subunits and in the rat glycine and -aminobutyric acid ( GABA ) type A- channel subunits ( 29 , 30 ) . [Field: results]


Title: POP-1 controls axis formation during early gonadogenesis in C . elegans . 
Authors: Siegfried KR ; Kimble J 
Journal: Development 
Year: 2002-01 
Doc ID: WBPaper00005116
SECTION: abstract. The pop-1 ( q624 ) allele is weakly penetrant for multiple defects and appears to be a partial loss-of-function mutation ; pop-1 ( q624 ) alters a conserved amino acid in the HMG-box DNA binding domain . [Field: abstract]
SECTION: discussion. This mutation alters a conserved amino acid in the HMG box DNA binding domain which is conserved specifically in TCF / LEF- 1 type HMG proteins ( Laudet et al . , 1993 ) , suggesting that the pop-1 ( q624 ) mutation may affect either recognition of the TCF / LEF-1 consensus sequence or DNA binding affinity , thereby lowering POP-1 activity . [Field: discussion]
SECTION: discussion. As the only mutation isolated thus far in a developmental system that changes a highly conserved amino acid in the -catenin binding domain , the pop-1 ( q645 ) missense mutation may shed new light on TCF / LEF-1 function during development . [Field: discussion]
SECTION: introduction. The pop-1 ( q624 ) allele is weakly penetrant for multiple defects and appears to be a partial loss-offunction mutation ; pop-1 ( q624 ) alters a conserved amino acid in the HMG-box DNA binding domain . [Field: introduction]
SECTION: results. The pop-1 ( q645 ) mutation carries a nucleotide substitution predicted to change an aspartic acid ( D ) to a glutamic acid ( E ) ( Fig . 2B ) ; this mutation resides within the pop-1 -catenin binding domain and alters an amino acid conserved in all known TCF / LEF-1 proteins , including nematode , fly , and vertebrate homologues . [Field: results]
SECTION: results. The pop-1 ( q624 ) mutation possesses a nucleotide change in the region encoding the HMG box ; the predicted amino acid change in this case also affects a conserved amino acid ( Fig . 2C ) . [Field: results]

_________________________________

Location of project-related files

on Texptresso-dev

on postgres

  • The latest dump:
  • Variation concise description pipeline:
  • Scripts:
  • Output location:


Orthology/Homology

see Caltech group meeting May 7, 2015

  • to connect conserved/syntenic mutations
  • link elegans gene variations and phenotypes to homologous human disease gene variations
  • to link elegans mutations as a disease model ex. pdr-1 mutations used to model juvenile parkinsons

use ortholog(s) of gene defined by Generation_of_automated_descriptions#Orthology.2FHomology Orthology, Homology and Paralog data in WormBase

Template of an Orthology sentence

  • <Worm Gene> is an ortholog of <human gene>.
  • Similar or identical mutations in the <human gene> has been associated with <disease>.

Rules for orthology sentence construction

How to pick orthologs for the description

use human gene that is associated with disease and has mutation information

Process

  • Rationale:
  • Template for a process sentence

<Variation> affects <Gene>'s role in <process>;

  • Source file for Process data
*Source 1: gene_association file for C.elegans from the WormBase FTP site:
**ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS243/ONTOLOGY/gene_association.WS243.wb.c_elegans
**this file is one build behind, compared to release label for the phenotype2go annotations
*Source 2: (from WS250): phenotype2go file, these automated annotations are now 'IEAs', but will be treated like IMPs, if they have the 'WBPhenotype:XXXXXXX' in column 8 (with)
**this file is current with the release label for the phenotype2go annotations
**Source 1 and 2 will have redundant annotations but these will get resolved
**All rows with column 15 (assigned by) with 'WB' are WormBase annotations, those with 'UniProtKB' or 'InterPro' are from those databases
**Need data from these rows:
***where column 9: has value 'P' (Process), 
***column 2 (DB_Object ID): the associated genes, UniProt ID or WBGene ID, 
***column 3 (DB_Object symbol), eg, wht-7, GO terms are from column 5.
***column 5: GOID, eg, GO:0000346
***column 6: DB:Reference (Reference), eg.PMID:12062106, GO_REF:0000002
***column 7: Evidence code, eg, IMP
***column 8: With, eg. 'WB:WBRNAi00000785|WBPhenotype:0000050'
  • Rules for process sentence construction
*'''Rule 1''': Only apply source from IMP GO terms, if variation is listed
*'''Rule 4''': For all IMP terms, add the words 'based on mutant phenotype' at the end of the sentence.
**Examples:
**Data: WBGene00001972,hmg-1.2,cell fate specification[IMP],WB_REF:WBPaper00003453|PMID:10197987,,WB,gonad development[IMP],Wnt signaling pathway[IGI],vulval development[IMP]
**Sentence: hmg-1.2 is involved in cell fat=e specification, gonad development and vulval development, based on mutant phenotypes.
**Data: WBGene00002148,gon-14,muscle contraction[IMP],WB_REF:WBPaper00024213|PMID:15133127,,WB,defecation[IGI],gonad morphogenesis[IMP],multicellular organism growth[IGI],regulation of pharyngeal pumping[IMP]
**Sentence: gon-14 is involved in muscle contraction, gonad morphogenesis and regulation of pharyngeal pumping, based on mutant phenotypes and in defecation and growth.
*'''Rule 5''': No exclusions as of 07.07.2014, leave in reproduction:
*'''Rule 6''' If a GO term has the words 'involved in' anywhere, beginning, middle or at the end, use the words 'functions in'
**Data: WBGene00006495,cpna-1,striated muscle contraction involved in embryonic body morphogenesis[IMP],WB_REF:WBPaper00041875|PMID:23283987
**Sentence: cpna-1 functions in striated muscle contraction involved in embryonic body morphogenesis;
*'''Rule 7''': For all other Process terms the sentence will be:
**<Gene> is involved in <process term>;
**Examples:
**Data: WBGene00014123,elpc-3,tRNA wobble uridine modification[IMP],WB_REF:WBPaper00034713|PMID:19593383,,WB,translation[IMP],spermatogenesis[IGI],olfactory learning[IMP],vulval development[IGI],embryonic morphogenesis[IGI],oocyte development[IGI]
**Sentence: elpc-3 is involved in tRNA wobble uridine modification, translation, spermatogenesis, olfactory learning, vulval development and embryonic morphogenesis.
*'''Rule 8''': For the term 'molting cycle, collagen and cuticulin-based cuticle' GO term, drop the comma and the words 'collagen and cuticulin-based cuticle'
**Example:
**Data: WBGene00016643,vps-45,molting cycle, collagen and cuticulin-based cuticle[IMP],WB_REF:WBPaper00029049|PMID:17235359,,WB,vesicle docking involved in exocytosis[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR001619,vesicle-mediated transport[IEA]
**Sentence: vps-45 is involved in '''the''' molting cycle;
*'''Rule 9''': Replacement rule:
**Replace term 'multicellular organism growth' with 'growth'.
**Replace term 'embryo development ending in birth or egg hatching' with 'embryo development'
**Replace term 'synaptic transmission, <word>' with '<word> synaptic transmission'.
***Example Data: WBGene00011488,nra-2, synaptic transmission, cholinergic[IMP],WB_REF:WBPaper00034730|PMID:19609303,,WB,locomotion[IMP],protein processing[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR008710,regulation of signal transduction[IEA],INTERPRO:IPR016574
***Sentence: nra-2 is involved in cholinergic synaptic transmission and locomotion.
'''Rule 10''': Granularity rule:
*If a GO term in the list is a parent of (is_a or part_of parent) another GO term in the list, keep the most granular child term and ignore the parent terms.

References

WBPaper00047004 title : Comparative mapping of the 22q11.2 deletion region and the potential of simple model organisms. Guna_2015

==Molecular function/identity==
====Source file for molecular function data====

**gene_association file for C.elegans from the WormBase FTP site:
**ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS243/ONTOLOGY/gene_association.WS243.wb.c_elegans
**All rows with column 15 (assigned by) with 'WB' are WormBase annotations, those with 'UniProtKB' or 'InterPro' are from those databases
**Need data from these rows:
*** where column 9 has value 'F' (Molecular Function)
***column 2, associated genes, has UniProt ID or WormBase GeneID, need to translate UniProtID to WormBase ID.
***column 3: DB_Object symbol, eg, wht-7, 
***column 5: GOID, eg, GO:0000346
***column 6: DB:Reference (Reference), eg.PMID:12062106, GO_REF:0000002
***column 7: Evidence code, eg, IMP
***column 8: 'With (or) From' eg., INTERPRO:IPR002293, 
***column 15: Assigned By, eg., WB (which database created the annotation)

====Rules for molecular function sentence construction====
*'''Rule 1''': Ignore all GO terms with the tag 'is_obsolete: true'
*'''Rule 2''': Exclusion list:
**Ignore the term 'protein binding'
**Ignore the term 'binding'
*'''Rule 3''': For identical GO terms with both experimental (EXP, IDA, IPI, IMP, IGI, IEP) and computational evidence codes (ISS, IEA), keep only the non-redundant GO terms from experimental evidence and ignore the computational evidence codes.
*'''Rule 4''': Order the experimental GO terms first in the sentence followed by ISS and IEA terms.
*'''Rule 5''': If the evidence code is 'IEA' and 'activity' term is present, add the words 'predicted to have <activity term>' and add the words ', based on protein domain information' to the sentence:
**Sentence: <Gene> is predicted to have <activity term>, based on protein domain information.
*Examples:
**WBGene00000108,alh-2,oxidoreductase activity[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR015590,WB,oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor[IEA],INTERPRO:IPR016163
**alh-2 is predicted to have oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor, based on protein domain information.
*'''Rule 6''': If Evidence code is 'ISS' add the words 'based on sequence information' to the sentence
**WBGene00001951,hlh-4,sequence-specific DNA binding transcription factor activity[ISS],WB_REF:WBPaper00034761|PMID:19632181,,WB,protein dimerization activity[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR011598,DNA binding[IEA],INTERPRO:IPR015660
**Sentence: hlh-4 is predicted to have sequence-specific DNA binding transcription factor activity based on sequence information, protein dimerization activity and DNA binding activity, based on protein domain information.

*'''Rule 7''': If a binding term is present add the word 'activity' to it.

*'''Rule 8''': If the evidence code is IDA, IMP, IGI, IPI, or IEP add the words 'exhibits' before the GO term and 'based on experimental evidence' after the GO term and/or at the end of the sentence.
**IDA example:
**WBGene00001952,hlh-6,RNA polymerase II distal enhancer sequence-specific DNA binding transcription factor activity[IDA],WB_REF:WBPaper00032277|PMID:18927627,,WB,protein dimerization activity[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR011598,DNA binding[IEA],INTERPRO:IPR015660
**Sentence: hlh-6 exhibits RNA polymerase II distal enhancer sequence-specific DNA binding transcription factor activity and is predicted to have protein dimerization activity and DNA binding activity.
**IMP example:
**WBGene00009583,aagr-3,alpha-glucosidase activity[IMP],WB_REF:WBPaper00036069|PMID:20349118,,WB,hydrolase activity, hydrolyzing O-glycosyl compounds[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR000322,catalytic activity[IEA],INTERPRO:IPR011013,carbohydrate binding[IEA]
**Sentence: aagr-3 exhibits alpha-glucosidase activity and is predicted to have hydrolase activity, hydrolyzing O-glycosyl compounds, catalytic activity, and carbohydrate binding activity, based on protein domain information.

*'''Rule 9''': For non-IEA terms beginning with 'structural constituent of <word>', use the words 'is a 'structural constituent of <word>' in the sentence.
*Example:
**WBGene00010783,mrpl-36,structural constituent of ribosome[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR000473,WB
**Sentence: mrpl-36 is a structural constituent of ribosome, based on protein domain information.
**WBGene00010783,mrpl-36,structural constituent of ribosome[ISS]
**Sentence: mrpl-36 is a structural constituent of ribosome, based on sequence information.
**WBGene00010783,mrpl-36,structural constituent of ribosome[IMP] or [IDA]
**Sentence: mrpl-36 is a structural constituent of ribosome.

*'''Rule 10''' Replacement
**For GO term: 'oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor' becomes 'oxidoreductase activity, acting on the aldehyde or oxo group of donors' (drop the text string 'NAD or NADP as acceptor')

*'''Rule 11''': If a GO term has a comma in it, then add a comma after this GO term and the next ‘and’.

==Sub-cellular localization==
====Source file for Sub-cellular localization data====
*GO data
*Source 1: gene_association file for C. elegans at:http://www.geneontology.org/GO.current.annotations.shtml?all
**Need data from these rows:
*** where column 9 has value 'C' (Cellular Component)
***column 2, associated genes, has UniProt ID or WormBase GeneID, need to translate UniProtID to WormBase ID.
***column 3: DB_Object symbol, eg, wht-7, 
***column 5: GOID, eg, GO:0000346
***column 7: Evidence code, eg, IDA

====Rules for sub-cellular localization sentence construction====
*'''Rule 1''': Ignore GO terms with the tag 'is_obsolete: true' in the obo file
*'''Rule 2''': Ignore all IEA and ISS GO terms, use only non-IEA, non-ISS GO terms
*'''Rule 4''': Ignore IBA and IBD GO terms (PAINT annotations)
*'''Rule 5''': For identical GO terms with both experimental (EXP, IDA, IPI, IMP, IGI, IEP) and computational evidence codes (ISS, IEA), keep only the non-redundant GO terms from experimental evidence and ignore the computational evidence codes.

*'''Rule 6''': For 'integral component of ....' terms add the words 'is an';
**Eg. for Rule 2: WBGene00006319,sup-10,integral component of plasma membrane[IDA],WB_REF:WBPaper00006135|PMID:14534247,,WB,striated muscle dense body[IDA]
**sup-10 is an integral component of plasma membrane and is localized to striated muscle dense body;
*Examples
**WBGene00023405,sor-1,nucleoplasm[IDA],WB_REF:WBPaper00027128|PMID:16501168,,WB,nuclear speck[IDA]
**Sentence: sor-1 is localized to the nucleoplasm and nuclear speck;

**WBGene00004681,rsd-2,nucleolus[IDA],WB_REF:WBPaper00044261|PMID:18430922,,WB,endoplasmic reticulum[IDA],cytosol[IDA]
**Sentence: rsd-2 is localized to the nucleolus, endoplasmic reticulum and cytosol;

**WBGene00021827,dnc-6,dynactin complex[IDA],WB_REF:WBPaper00037699|PMID:20964796,,WB
**Sentence: dnc-6 is localized to the dynactin complex;

*'''Rule 7''': For the GO term 'intracellular [IEA]' structure of the sentence will be different, use 'is intracellular'.
**Eg.1 WBGene00089742
**PPA00188 is an ortholog of C. elegans T26A5.8; based on protein domain information, PPA00188 is predicted to have sequence-specific DNA binding activity and protein heterodimerization activity and is intracellular.

==Order of sentences==
*Orthology
*Process
*Function/identity
*Component

==Tissue expression==
====Source files for Tissue expression data====
*Source 1: Expression data 
*OA (exprpat), PG table names:
**for all these tables where exp_endogenous table value 'endogenous', grab the exp_gene, WBGeneXXXXXXXX
**exp_name, values look like Expr1005.
**exp_anatomy for anatomy terms, in the form of Wbt:0003679, translate these IDs using the anatomy ontology file
**anatomy ontology file from ftp site:ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS244/ONTOLOGY/anatomy_ontology.WS244.obo
**exp_paper for paper
**exp_qualifier for the qualifiers 'certain’, ‘uncertain’ and ‘partial’.
*Contact Person: Daniela


====Rules for tissue expression sentence construction====
'''Rule 1''': Use only the data that has the qualifiers 'Certain' and 'Partial' and ignore all those data that have 'uncertain'; treat 'Certain' and 'Partial' equally; also use all expression data with no qualifiers as well, qualifiers were added recently.

'''Rule 2''': Pick an anatomy term only once
*Sentence: <Gene> is expressed in the <anatomy term1, anatomy term2 and anatomy term3>;

*Examples:
*Data for alh-10: 
*WBGene00000116	alh-10	Expr5583	nervous system,intestine,tail neuron WBPaper00031006,WBPaper00006525	Endogenous
*'''Sentence''': alh-10 is expressed in the nervous system, intestine and tail neuron;


*Data for asp-5:
*WBGene00000218	asp-5	Expr5817	intestine	WBPaper00031006,WBPaper00006525	Endogenous
*WBGene00000218	asp-5	Expr4352	intestine	WBPaper00028802	Endogenous
*'''Sentence''': asp-5 is expressed in the intestine;


*Data for ccr-4:
*WBGene00000376	ccr-4	Expr4479	pharynx	WBPaper00027076	Endogenous
*WBGene00000376	ccr-4	Expr11132	male,hermaphrodite,somatic cell,germ line	*WBPaper00043886	Endogenous
*WBGene00000376	ccr-4	Expr4479	hyp6,hyp5,hyp4,hyp3,hyp2,hyp1,pharyngeal neuron,head muscle	WBPaper00027076	Endogenous
*WBGene00000376	ccr-4	Expr7174	pharynx,hypodermis,seam cell	*WBPaper00031006,WBPaper00006525	Endogenous
*WBGene00000376	ccr-4	Expr4480	pharynx,body wall musculature,head neuron,tail neuron	*WBPaper00027076	Endogenous
*WBGene00000376	ccr-4	Expr4480	hyp6,hyp5,hyp4,hyp3,hyp2,hyp1,pharyngeal neuron,head muscle
*'''Sentence''': ccr-4 is expressed in the pharynx, somatic cell, germ line, hyp6, hyp5, hyp4, hyp3, hyp2, hyp1, pharyngeal neuron, head muscle, hypodermis, seam cell, body wall musculature and in the head and tail neurons.

*'''Rule 3''': Replacement Rule'''
*Replacement 1: If the anatomy term 'cell' occurs by itself, then use the words 'widely expressed' instead of 'cell'.
**sentence: col-178 is expressed in the Cell;
**Becomes: col-178 is expressed widely.

*Replacement 2: If the anatomy term 'cell' is present with other anatomy terms, use instead the words 'expressed in several tissues including the' <other anatomy terms>
**Sentence: frm-1 is expressed in the intestine, pharynx, and the Cell;
**Becomes: frm-1 is expressed in several tissues including the intestine and pharynx.

*Replacement 3: If the anatomy term 'neuron' occurs as an anatomy term by itself, use the words 'in the nervous system' instead.
**Sentence: ceh-82 is expressed in the neuron;
**Becomes: ceh-82 is expressed in the nervous system;


*Replacement 4: If the word 'neuron' occurs as part of an anatomy term, eg, 'head neuron', pluralize it, except for the exceptions:
**Exceptions: 
**I3 neuron
**I4 neuron
**I5 neuron
**I6 neuron
**M1 neuron
**M4 neuron
**M5 neuron
**MI neuron
**Sentence: nhr-194 is expressed in the amphid neuron, ciliated neuron, head neuron, and the sensory neuron;
**Becomes: nhr-194 is expressed in the amphid neurons, ciliated neurons, head neurons, and the sensory neurons;


'''Rule 5''': Granularity Rule: When a string of cell names are present, check to see if any parent terms are present in any ExprXXX for that gene, keep the highest common parent term available.

==Preliminary results==
These descriptions are based on Homology predictions and the GO annotations for Process, Component and Function:
<pre style="white-space: pre-wrap; 
white-space: -moz-pre-wrap;
white-space: -pre-wrap;
white-space: -o-pre-wrap; 
word-wrap: break-word"> 

*alh-2
alh-2 encodes an ortholog of human dehydrogenase aldehyde dehydrogenase family 1 member dehydrogenase; alh-2 is predicted to have oxidoreductase activity and oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor based on protein domain information.

*asp-5
asp-5 encodes an ortholog of human cathepsin d; asp-5 is involved in cell death and locomotion; asp-5 is predicted to have aspartic-type endopeptidase activity based on protein domain information.

*cng-1
cng-1 encodes an ortholog of human cyclic nucleotide gated channel alpha 3; cng-1 is predicted to have ion channel activity based on protein domain information; cng-1 is localized to the neuronal cell body.

Mapping of automated variation concise description data to OA fields

Mapping of data to data fields in the OA using table name vcd for variation concise description
OA field
number
OA field name Data to be inserted Example of data
to be inserted
Required or Not OA table name
1 WBVariation Variation WBVar00145853 OR gk448 Required vcd_variation
2 Species Species Onchocerca volvulus Required vcd_species
3 Curator Name of Curator James Done(first then replace with) Karen Yook
(insert for all rows)
Required vcd_curator
4 Curator History Name of Curator same as pgid
(insert for all rows)
Required vcd_curhistory
5 Description Type Automated_concise_description
(insert for all rows)
Automated_concise_description Required vcd_desctype
6 Description Text the automated concise description asp-19 encodes an ortholog... Required vcd_desctext
7 Reference WBPaper WBPaper00026979 Required vcd_paper
8 Accesssion Evidence For Homology, for elegans, use ENSEMBL Gene ID, for non-elegans species use WBGeneID
For Process, Function, use InterPro ID
For elegans: ENSEMBL:ENSG00000103257 (previously used the protein ENSEMBL protein ids) and INTERPRO:IPR002293
For non-elegans species: WBGene00007443 and INTERPRO:IPR002293
(comma separate multiple values)
Not required con_accession
9 Last Updated Date when the descriptions
were last generated
2014-09-11 Required vcd_lastupdate
10 pgid pgid 1149
(Postgres will generate)
Required

Tab-delimited file for OA insert

  • One tab-delimited file per species
  • Order of the data will be: WBGene, Date, Paper, Accession_evidence, Automated_concise_description, Species, Inferred_automatically text
  • Format: tab-delimited file, comma separate the values when multiple values are present
  • Date is the last date that the script was run to generate the automated descriptions (eg. 2014-05-28)
  • File will be placed on textpresso-dev to be picked up by a cron job by JC

Directory structure for project

Inserting automated descriptions into postgres

Populating script

Run the script to populate from here: /home/acedb/ranjana/concise_testing/populate_automated_concise_descriptions.pl

Script actually at: /home/postgres/work/pgpopulation/concise_description/20140909_automated_concise/populate_automated_concise_descriptions.pl

Script looks at http://textpresso-dev.caltech.edu/concise_descriptions/production_release.txt for release number and

http://textpresso-dev.caltech.edu/concise_descriptions/species.txt for the different species

Script gets data from the following URL for each of the species: For elegans: http://textpresso-dev.caltech.edu/concise_descriptions/release/WS247/c_elegans/descriptions/OA_concise_descriptions.WS247.txt

For briggsae: http://textpresso-dev.caltech.edu/concise_descriptions/release/WS247/c_briggsae/descriptions/OA_concise_descriptions.WS247.txt

and so on.

When populating for each OA row that has con_desctype set to 'Automated_concise_description', it will delete from these tables :

  • con_wbgene
  • con_species
  • con_curator
  • con_curhistory
  • con_desctext
  • con_paper
  • con_accession
  • con_inferredauto
  • con_lastupdate

Meaning that it's keeping the 'Automated_concise_description' value, so that future runs of the script will reuse existing pgids. When running out of pre-existing pgids, it will create a new one and assign 'Automated_concise_description' to con_desctype

For testing on Mangolassi

Both populating and dumping scripts at:/home/acedb/ranjana/concise_testing

For the polulating script always redirect output to a file: populate_automated_concise_descriptions.pg.<date>

Dumping the automated, concise and provisional descriptions

Dumping script

  • Run the dumping script manually, on tazendra: /home/acedb/kimberly/citace_upload/concise/wrapper.pl
  • The file, concise_dump_new.ace, can be scp-ed for testing from:scp acedb@tazendra.caltech.edu:/home/postgres/public_html/cgi-bin/data/concise_dump_new.ace
  • Concise descriptions dumper at /home/postgres/work/citace_upload/concise/dump_concise.pl
  • /home/postgres/public_html/cgi-bin/data/concise_dump_new.ace

Script that finds genes with concise descrips that also have an automated description

On Tazendra and Mangolassi: /home/acedb/ranjana/concise_testing/find_concise_with_automated.pl

Script that compares citace test file and Postgres for automated descriptions

Location:/home/acedb/ranjana/concise_testing/compare_concise_postgres_vs_acefile.pl

  • Needs a 'citace_genes_with_automated.ace' file which is the .ace export of all genes in citace with the 'Automated_description' tag (I use Query buildr to do this query and then export the 'Names' to a file 'citace_genes_with_automated.ace').
  • Compares the output from citace (from testing concise_dump_new.ace in empty citace database on Maya) of genes with automated tag, to postgres data, also outputs if a gene has a concise description (which would not be dumped so not in .ace file, but in Postgres).
  • For the WS247 upload, 1 gene not accounted for, WBGene00020108, which has been merged into a dead gene?

Discontinued from WS246 upload

Text for Automatically_inferred tag

Tag looks like: Evidence Automatically_inferred ?Text

Text will be: This description was automatically generated by Textpresso scripts based on homology/orthology data and Gene Ontology (GO) annotations from the <WS246> version of WormBase.

Rules for dumping the different types of descriptions in the OA

.ace format

List of tags to be dumped:

  • Automated_description
  • Paper_evidence
  • Accession_evidence
  • Date_last_updated
  • Inferred_automatically
Gene : "WBGene00009585"
Automated_description	"cal-7 encodes an ortholog of human calmodulin-like 4 (HGNC:CALML4); cal-7 is predicted to have calcium ion binding activity, based on protein domain information."	Date_last_updated	"2012-07-24"
Automated_description	"cal-7 encodes an ortholog of human calmodulin-like 4 (HGNC:CALML4); cal-7 is predicted to have calcium ion binding activity, based on protein domain information."	Paper_evidence	"WBPaper000045688"
Automated_description	"cal-7 encodes an ortholog of human calmodulin-like 4 (HGNC:CALML4); cal-7 is predicted to have calcium ion binding activity, based on protein domain information."	Paper_evidence	"WBPaper000045689"
Automated_description	"cal-7 encodes an ortholog of human calmodulin-like 4 (HGNC:CALML4); cal-7 is predicted to have calcium ion binding activity, based on protein domain information."	Acession_evidence "ENSEMBL" "ENSP00000419081"
Automated_description	"cal-7 encodes an ortholog of human calmodulin-like 4 (HGNC:CALML4); cal-7 is predicted to have calcium ion binding activity, based on protein domain information."	Acession_evidence "INTERPRO" "IPR002048"
Automated_description	"cal-7 encodes an ortholog of human calmodulin-like 4 (HGNC:CALML4); cal-7 is predicted to have calcium ion binding activity, based on protein domain information."	Inferred_automatically "This description was generated automatically by Textpresso scripts based on homology/orthology data and Gene Ontology (GO) annotations, from the WS243 version of WormBase."	

Numbers from citace testing

  • concise_dump_new.ace was scped from: /home/postgres/public_html/cgi-bin/data/concise_dump_new.ace and tested
  • Tested concise_dump_new.ace on empty citaceminus mirrror on local machine with the WS246 models file saved, read in fine!
  • WS246 numbers: Total genes: 9,529 genes, Genes with automated descriptions: 3,362; 85,039 lines

Reporting numbers

--Currently the automated descriptions are generated for genes without a concise description

--Generate a report for numbers and place on Textpresso-dev, http://textpresso-dev.caltech.edu/concise_descriptions/

  • Report for WS246 upload, Sept/Oct, 2014:
  • Total number of automated descriptions = 3,364
  • Number of automated descriptions with homology = 2,353
  • Number of automated descriptions with process information = 1,206
  • Number of automated descriptions with function information = 2,183
  • Number of automated descriptions with component information = 244

Publications related to Text-mining methods

  • Automatically generating gene summaries from biomedical literature.

Ling X, Jiang J, He X, Mei Q, Zhai C, Schatz B.

Pac Symp Biocomput. 2006:40-51.

PMID:17094226

  • Generating gene summaries from biomedical literature: A study of semi-structured summarization

Xu Ling *, Jing Jiang, Xin He, Qiaozhu Mei, Chengxiang Zhai, Bruce Schatz

Information Processing and Management 43 (2007) 1777–1791

Changes/Updates for WS248

  • Changes to Orthology,

Automated descriptions for C. briggsae

Automated descriptions for C. briggsae

Issues to address

  • 1. For the non-elegans species, need to use the WBGeneID as Accession_evidence, so can we use our own database, and do: Accession_evidence "WormBase" "WBGene00005678" (not done)
  • 2. For non-elegans species, need to add in elegans gene experimental data, but with multiple orthologs, which do we pick? (done, using only the common GO terms for orthologs)
  • 3. Gene names should be pulled and mapped to the Gene IDs only from the gene names file, not from the source files of orthology or GO (will be addressed for WS250)
  • 4. For WS250, check that the gene names are (somewhat) current; corrected mel-47 to tofu-6 by hand for WS249.

Automated descriptions software

Documentation for workflow and scripts