Difference between revisions of "Generation of automated descriptions"

From WormBaseWiki
Jump to navigationJump to search
 
(131 intermediate revisions by the same user not shown)
Line 1: Line 1:
==Querying for gene sets==
 
 
====Set of genes with a concise description====
 
Query for all genes with a concise description from Postgres:
 
Relevant postgres table names:
 
*con_wbgene: Stores the WBGene ID and gene names
 
*con_desctype: Type of description (relevant for us: Concise_description)
 
*con_desctext: Text of the concise description
 
 
Query for all WBGenes that have a concise description (in con_desctext AND con_desctype):
 
 
SELECT DISTINCT(con_wbgene) FROM con_wbgene WHERE joinkey IN (SELECT joinkey FROM con_desctext WHERE con_desctext IS NOT NULL) AND joinkey IN (SELECT joinkey FROM con_desctype WHERE con_desctype IS NOT NULL) ORDER BY con_wbgene;
 
 
*Number of genes with a concise description (as of 05.07.2014)=6,624
 
 
====Set of genes with no concise description====
 
====Set of genes with no concise description and at least one published paper====
 
 
 
==Location of project-related files on Textpresso==
 
==Location of project-related files on Textpresso==
 
http://textpresso-dev.caltech.edu/concise_descriptions/
 
http://textpresso-dev.caltech.edu/concise_descriptions/
  
====Location of the manual concise description files:====
+
==Generate top level directory structures for project==
*For viewing the latest dump:
+
#Generate directory structure similar to current structure: http://textpresso-dev.caltech.edu/concise_descriptions/
http://tazendra.caltech.edu/~postgres/cgi-bin/data/concise_dump_new.ace
+
#Create all top level directories to file level, files and file names could be different.
*Script: /home/postgres/work/citace_upload/concise/dump_concise.pl
 
*File location: /home/postgres/public_html/cgi-bin/data/concise_dump_new.ace
 
  
==Semantic categories in an Automated Description==
+
==Obtaining gene sets==
1. Orthology/Similarity/Molecular identity <br \>
+
====Obtaining a set of genes and gene names====
2. Processes <br \>
 
3. Molecular Function <br \>
 
4. Sub-cellular localiztion <br \>
 
5. Tissue expression  (not done yet) <br \>
 
  
==Orthology/Homology==
+
#Top level directory is 'gene_lists' under ~/release/<WSXXX>/<species>/ for all nine species
 +
#Obtain the complete list of gene names and IDs from postgres for the nine species, only gene Ids (gin_wbgene) and gene names (gin_locus) are used
 +
#Write descriptions for only live genes, throw out the dead genes, as we don't want to write descriptions for dead genes.
 +
#Write live genes into the 'live genes' file.
  
====WS248 upload (end of Feb, 2015)====
+
==Current semantic categories and their order in a gene description==
Orthology source files:
+
#Orthology <br \>
*Kevin will release new WS247 orthology files with Ensemble genes instead of proteins
+
#Biological Process<br \>
*ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS246/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS246.orthologs.txt.gz
+
#Molecular Function <br \>
 +
#Tissue expression <br \>
 +
#Sub-cellular localization <br \>
 +
#drug/chemical/gene regulation from large scale data summaries
  
Species list:
+
==Orthology/Homology==
*Caenorhabditis elegans  --use orthology to human
+
Orthology source files (in use from the WS248 March 2015, from WS247 was switched to Ensemble genes instead of proteins):
*Caenorhabditis briggsae --use orthology to elegans
+
*Orthology files are obtained from the EBI FTP site: ftp://ftp.ebi.ac.uk/pub/databases/wormbase/releases/<release>/species
*Caenorhabditis japonica --use ortholgy to elegans
+
**C. elegans: ~/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS262.orthologs.txt.gz
*Caenorhabditis remanei  --use orthology to elegans
+
**C. briggsae:~/c_briggsae/PRJNA10731/annotation/c_briggsae.PRJNA10731.WS262.orthologs.txt.gz
*Caenorhabditis brenneri --use orthology to elegans
+
**C. japonica:~/c_briggsae/PRJNA12591/annotation/c_japonica.PRJNA12591.WS262.orthologs.txt.gz
*Brugia malayi          --use orthology to elegans, if not present use orthology to Onchocerca
+
**C. remanei:~/c_remanei/PRJNA53967/annotation/c_remanei.PRJNA53967.WS262.orthologs.txt.gz
*Onchocerca volvulus    --use orthology to elegans, if not present, then use orthology to Burgia
+
**C. brenneri:~/c_brenneri/PRJNA20035/annotation/c_brenneri.PRJNA20035.WS262.orthologs.txt.gz
*Pristionchus pacificus  --use orthology to elegans
+
**Brugia malayi:~/b_malayi/PRJNA10729/annotation/b_malayi.10729.WS262.orthologs.txt.gz
 
+
**Pristionchus pacificus: ~/p_pacificus/PRJNA12644/annotation/p_pacificus.PRJNA12644.WS262.orthologs.txt.gz
'''Rule1'''
+
**O. volvulus~/o_volvulus/PRJEB513/annotation/o_volvulus.PRJEB513.WS262.orthologs.txt.gz
We will use only those human genes that have been predicted by more than one orthology prediction method (from Kevin’s new WS247 orthology file).
+
**S. ratti:~/s_ratti/PRJEB125/annotation/s_ratti.PRJEB125.WS262.orthologs.txt.gz
 
 
*Example 1:
 
aat-1 (from orthology file)
 
**Homo sapiens ENSG00000013293 SLC7A14 Panther
 
**Homo sapiens ENSG00000165349 SLC7A3 Panther
 
**Homo sapiens ENSG00000139514 SLC7A1 Panther
 
**Homo sapiens ENSG00000151012 SLC7A11 Panther
 
**Homo sapiens ENSG00000103064 SLC7A6 Panther
 
**Homo sapiens ENSG00000155465 SLC7A7 Panther
 
**Homo sapiens ENSG00000130876 SLC7A10 Panther
 
**Homo sapiens ENSG00000103257 SLC7A5 Panther;Inparanoid_8
 
**Homo sapiens ENSG00000092068 SLC7A8 Inparanoid_8
 
  
 +
====Rules====
 +
#Species list and orthology rule:
 +
#*Caenorhabditis elegans (WS246) --use orthology to human
 +
#*Caenorhabditis briggsae (WS247) --use orthology to elegans
 +
#*Caenorhabditis japonica (WS248) --use ortholgy to elegans
 +
#*Caenorhabditis remanei  (WS248) --use orthology to elegans
 +
#*Caenorhabditis brenneri (WS248) --use orthology to elegans
 +
#*Brugia malayi          (WS248) --use orthology to elegans, if not present use orthology to Onchocerca
 +
#*Onchocerca volvulus    (WS248) --use orthology to elegans, if not present, then use orthology to Brugia
 +
#*Pristionchus pacificus  (WS248) --use orthology to elegans
 +
#*Strongyloides ratti    (WS250) --use orthology to elegans, if not present, use orthology to Brugia and Onchocerca
 +
#For orthology to human genes, use only those human genes that have been predicted by more than one orthology prediction method.
 +
#*Example 1: aat-1 (from orthology file)
 +
#**Homo sapiens ENSG00000013293 SLC7A14 Panther
 +
#**Homo sapiens ENSG00000165349 SLC7A3 Panther
 +
#**Homo sapiens ENSG00000139514 SLC7A1 Panther
 +
#**Homo sapiens ENSG00000151012 SLC7A11 Panther
 +
#**Homo sapiens ENSG00000103064 SLC7A6 Panther
 +
#**Homo sapiens ENSG00000155465 SLC7A7 Panther
 +
#**Homo sapiens ENSG00000130876 SLC7A10 Panther
 +
#**Homo sapiens ENSG00000103257 SLC7A5 Panther;Inparanoid_8
 +
#**Homo sapiens ENSG00000092068 SLC7A8 Inparanoid_8
 
In the above list, we would pick only ENSG00000103257, since it was predicted to be a human ortholog of aat-1 by more than one method, both Panther and Inparanoid_8.
 
In the above list, we would pick only ENSG00000103257, since it was predicted to be a human ortholog of aat-1 by more than one method, both Panther and Inparanoid_8.
 
+
*Example 2: nipi-4 from orthology file
*Example 2
 
If only one method existed for all human genes, we will have to list them all
 
*nipi-4
 
 
**Homo sapiens ENSG00000120899 PTK2B Panther
 
**Homo sapiens ENSG00000120899 PTK2B Panther
 
**Homo sapiens ENSG00000169398 PTK2 Panther
 
**Homo sapiens ENSG00000169398 PTK2 Panther
 
+
If only one method existed for all human genes, we will have to list them all, so list both PTK2B and PTK2.
List both PTK2B and PTK2
 
  
 
====Template of an Orthology sentence====
 
====Template of an Orthology sentence====
*<Gene> is an ortholog of <human gene>.
+
*<Worm Gene> is an ortholog of <human gene>.
  
*<Gene> is an ortholog of <human gene1>, <human gene2>, <human gene3> and <human gene4>.
+
*<Worm Gene> is an ortholog of <human gene1>, <human gene2>, <human gene3> and <human gene4>.
  
 
*We will use the HGNC name outside parentheses, as the gene name, and put the description inside the parentheses.
 
*We will use the HGNC name outside parentheses, as the gene name, and put the description inside the parentheses.
Line 84: Line 72:
 
Example 1
 
Example 1
 
*mtp-18 is an ortholog of human <ENSG00000242114> and <ENSG00000249590>;
 
*mtp-18 is an ortholog of human <ENSG00000242114> and <ENSG00000249590>;
*mtp-18 is an ortholog of human MTFP1 (mitochondrial fission process 1) and <no HGNC symbol/name> (Uncharacterized protein)
+
*mtp-18 is an ortholog of human MTFP1 (mitochondrial fission process 1) and <no HGNC symbol/name> (Uncharacterized protein);
 
Resolves into
 
Resolves into
 
*mtp-18 is an ortholog of human MTFP1 (protein mitochondrial fission process 1);
 
*mtp-18 is an ortholog of human MTFP1 (protein mitochondrial fission process 1);
Line 90: Line 78:
 
Example 2
 
Example 2
 
*marc-1 is an ortholog of human <ENSG00000183654>, <ENSG00000144583>, <ENSG00000139266>, <ENSG00000145416>, and <ENSG00000278545>;
 
*marc-1 is an ortholog of human <ENSG00000183654>, <ENSG00000144583>, <ENSG00000139266>, <ENSG00000145416>, and <ENSG00000278545>;
*marc-1 is an ortholog of human MARCH11 (   ), MARCH4 (membrane-associated ring finger (C3HC4) 4, E3 ubiquitin protein ligase), MARCH9 (membrane-associated ring finger (C3HC4) 9), MARCH1 (membrane-associated ring finger (C3HC4) 1, E3 ubiquitin protein ligase) and MARCH8 (membrane-associated ring finger (C3HC4) 8, E3 ubiquitin protein ligase);  
+
*marc-1 is an ortholog of human MARCH11 (membrane-associated ring finger (C3HC4) 11), MARCH4 (membrane-associated ring finger (C3HC4) 4, E3 ubiquitin protein ligase), MARCH9 (membrane-associated ring finger (C3HC4) 9), MARCH1 (membrane-associated ring finger (C3HC4) 1, E3 ubiquitin protein ligase) and MARCH8 (membrane-associated ring finger (C3HC4) 8, E3 ubiquitin protein ligase);  
  
 
Example 3
 
Example 3
Line 147: Line 135:
 
*'''Rule 7''' : Grab the HGNC symbol and the place it on the end of the sentence, within parentheses as ', (HGNC symbol:<HGNC symbol>)'
 
*'''Rule 7''' : Grab the HGNC symbol and the place it on the end of the sentence, within parentheses as ', (HGNC symbol:<HGNC symbol>)'
  
====Explanation of Orthology/homology information in WormBase====
+
====How to pick orthologs for the description====
'''1. Orthology, Homology and Paralog data'''
+
 
*Ace tags: ?Gene Ortholog_other, Paralog
+
We have too many C.elegans genes listed as orthologs for a non-elegans gene; these will be pruned using popularity (via number of publications) and (gene class name):
*Contact: Michael Paulini
+
 
 +
*If more then 5 orthologs, in the form of gene names (eg, abu-6, abu-7, abu-8) use the popularity and gene class to prune
 +
*Genes that don't have any other members of their class get picked and mentioned first, by popularity, if not alphabetical; if tied by popularity, order by alphabetically
 +
*If there is only one gene class, use popularity to pick the top 3 genes, order by popularity, if tied, use numerical
 +
*If there is more than one gene class, use popularity to pick the top gene in each class, meaning you would name the leading (in popularity) gene class first; if tied, order alphabetically; in both cases
 +
*If cosmid names (eg.,C54D10.9), list them all, as gene class cannot be used
 +
*If both genes and cosmids are present, use the gene class and popularity rules for the genes and leave the cosmids as is
 +
*If both popularity and gene class cannot be used, leave as such
 +
*If genes are tied by popularity, then stop at 5 genes.
  
From Michael Paulini:
 
 
<pre style="white-space: pre-wrap;  
 
<pre style="white-space: pre-wrap;  
 
white-space: -moz-pre-wrap;
 
white-space: -moz-pre-wrap;
Line 158: Line 153:
 
white-space: -o-pre-wrap;  
 
white-space: -o-pre-wrap;  
 
word-wrap: break-word">  
 
word-wrap: break-word">  
a.) orthology
+
Example 1:
it is in the ACeDB database on the genes as ortholog/paralog/ortholog_other, but we also dump it since a while each build (as  example, here: ftp://ftp.wormbase.org/pub/wormbase/releases/WS243/species/c_elegans/PRJNA13758/annotation
+
WBGene00023726
/c_elegans.PRJNA13758.WS243.orthologs.txt.gz).  
+
CBG00317 is an ortholog of C. elegans fbxc-16, fbxc-15, fbxc-18, sdz-4, fbxc-28, fbxc-19 and fbxc-12.
  
b.) homology
+
Resolves to:
1. protein homology)
+
CBG00317 is an ortholog of C. elegans sdz-4 and members of the fbxc gene class including fbxc-28, fbxc-15 and fbxc-18.
the blastx data is in the GFF files, as well as here: ftp://ftp.wormbase.org/pub/wormbase/releases/WS243/acedb                /Non_C_elegans_BLAST
 
for C.elegans the patch file is also loaded during the build, so you can find them as regular Homology_data on the respective  parent sequences in ACeDB.
 
  
the blastp data is as Homology_data on the proteins, as well as partially dumped into that one: ftp://ftp.wormbase.org/pub/wormbase/releases/WS243/species/c_elegans/PRJNA13758/c_elegans.PRJNA13758.WS243.best_blastp_hits.txt.gz
 
  
We also go protein clusters (mostly eggNOG based ones), which show homoloy and shared function and are connected to the member proteins in ACeDB through the Homology_group tag .
+
Example 2:
 +
Cbr-myo-2 is an ortholog of C. elegans myo-1, myo-2, myo-3, nmy-1, nmy-2, unc-54, myo-6, hum-9 and myo-5; based on protein domain information, Cbr-myo-2 is predicted to have ATP binding activity and motor activity and is localized to the myosin complex.
  
2. nucleotide homology)
+
Resolves to:
Mostlyy based on blat, but with the current release switched to star
+
Cbr-myo-2 is an ortholog of C. elegans unc-54, hum-9 and members of the C. elegans myo and nmy gene classes including myo-3 and nmy-2;
You can find them in the respective GFF files and also similar to the blastx as homology data on the parent sequences in ACeDB
 
  
We also got RNASeq, which currently lives as RNASeq features in the GFF and ACeDB, but also as expression level data in the Gene/Transcript/CDS.
+
Example 3:
 +
WBGene00052523
 +
CRE27108 is an ortholog of C. elegans E03H4.2, ZK1025.3, C13A2.9, C13A2.10, C33G8.13, F07G11.1 and F07G11.2.
  
And last, but not least, we got pairwise whole genome alignments for selected species, which currently we only show on EnsEMBL Genomes, but you can use the generic Compara API to pull the alignments from there.
+
CRE27108 is an ortholog of C. elegans E03H4.2, ZK1025.3, C13A2.9, C13A2.10, C33G8.13, F07G11.1 and F07G11.2.
 +
 
 +
Cannot be resolved, will have to stay as such!
 +
 
 +
Example 4:
 +
CBG03582 is an ortholog of C. elegans fbxb-9, fbxb-102, fbxb-117, fbxb-76, fbxb-79, fbxb-85, fbxb-1, fbxb-115, fbxb-2, fbxb-86, fbxb-22, sdz-33, fbxb-24, fbxb-101, fbxb-93, fbxb-94, fbxb-13, fbxb-98, fbxb-38, C33E10.1, fbxb-88, fbxb-95, fbxb-97, fbxb-96, fbxb-78, fbxb-80, fbxb-35, fbxb-36, fbxb-108, fbxb-105, sdz-10, sdz-9, fbxb-111, sdz-11, F18A12.7, fbxb-106, fbxb-107, F36H5.8, F36H5.9, fbxb-12, fbxb-71, fbxb-48, fbxb-28, fbxb-29, fbxb-30, fbxb-103, fbxb-25, fbxb-26, fbxb-17, fbxb-18, fbxb-19, fbxb-47, fbxb-72, fbxb-49, fbxb-50, fbxb-51, K05F6.4, fbxb-44, fbxb-52, fbxb-54, K05F6.8, fbxb-46, fbxb-39, fbxb-41, fbxb-40, fbxb-45, sdz-25, fbxb-31, fbxb-32, fbxb-74, fbxb-75, fbxb-10, fbxb-37, fbxb-81, fbxb-82, fbxb-83, fbxb-84, fbxb-34, fbxb-14, fbxb-99, fbxb-77, fbxb-42, fbxb-43, fbxb-15, fbxb-21, fbxb-20, fbxb-16, fbxb-87, fbxb-92, fbxb-33, fbxb-91, fbxb-90, W08F4.13 and F49B2.7;
 +
 
 +
 
 +
Resolves to:
 +
CBG03582 is an ortholog of members of the C. elegans fbxb  and sdz gene classes including fbxb-x and sdz-x, and C33E10.1, F18A12.7, F36H5.8, F36H5.9, K05F6.4, K05F6.8, W08F4.13 and F49B2.7;
  
As orthology + homology covers such a huge swath of very different data in WormBase, there is no unifying format, except ACeDB and to a certain extent GFF.
 
 
</pre>
 
</pre>
 +
 +
====Grouping human genes into families for the C.elegans descriptions====
 +
#Use the HGNC gene family dataset found here http://www.genenames.org/cgi-bin/statistics
 +
File: ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/hgnc_complete_set.txt
 +
#Group human genes by their family names, when 3 or more human orthologs are present, mention the first member
 +
#If more than one family is present, include the first member from each family
 +
#If a gene does not fall into any human gene family, leave as is, and mention this gene first, with the word 'human' before it.
 +
#If a gene is the only member of a human gene family, mention the gene with the word 'human' before it, and not the family.
 +
#From the file, use columns 'Gene family tag' and 'Gene family description'
 +
#Write the sentence using 'Gene family tag' first with the word 'human' before it and then the 'Gene family description' in parentheses
 +
#If no 'Gene family tag' exists, use only 'Gene family description', do not put this in parentheses, and add the word 'family' after it; if the 'Gene family description' ends with 'family', do not add the word 'family' after it.
 +
 +
 +
<pre style="white-space: pre-wrap;
 +
white-space: -moz-pre-wrap;
 +
white-space: -pre-wrap;
 +
white-space: -o-pre-wrap;
 +
word-wrap: break-word">
 +
 +
Template 1: If both gene family tag and gene family description are available:
 +
Senence:
 +
<C. elegans gene> is an ortholog of human <gene1 that did not cluster> and members of the <human gene family tag> <(human gene family description)> family.
 +
 +
Template 2: If no gene family tag is available, do not use parentheses for the gene family description:
 +
Sentence:
 +
<C. elegans gene> is an ortholog of human <gene1 that did not cluster> and members of the <human gene family description> family.
 +
 +
Template 3: If no gene family tag is available, and if the gene family description ends in 'family', do not add 'family' after it.
 +
Sentence:
 +
<C. elegans gene> is an ortholog of human <gene1 that did not cluster> and members of the <human gene family description that ends in 'family'>.
 +
 +
 +
1. Example his-6:
 +
his-6 is an ortholog of human CENPA (centromere protein A), HIST1H3H (histone cluster 1, H3h), HIST3H3 (histone cluster 3, H3) and HIST2H3A (histone cluster 2, H3a); 
 +
 +
Resolves to:
 +
his-3 is an ortholog of human CENPA (centromere protein A) and members of the histones family including HIST1H3H (histone cluster 1, H3h);
 +
 +
2. Example hrp-1
 +
hrp-1 is an ortholog of human HNRNPA1L2 (heterogeneous nuclear ribonucleoprotein A1-like 2), HNRNPA1 (heterogeneous nuclear ribonucleoprotein A1), HNRNPA3 (heterogeneous nuclear ribonucleoprotein A3) and HNRNPA2B1 (heterogeneous nuclear ribonucleoprotein A2/B1);
 +
 +
Resolves to:
 +
hrp-1 is an ortholog of members of the human RBM (RNA binding motif containing) family including HNRNPA1L2 (heterogeneous nuclear ribonucleoprotein A1-like 2);
 +
</pre>
 +
 +
====Explanation of Orthology/homology information in WormBase====
 +
[[Orthology, Homology and Paralog data in WormBase]]
  
 
==Process==
 
==Process==
Line 187: Line 236:
  
 
====Source file for Process data====
 
====Source file for Process data====
*Source: gene_association file for C.elegans from the WormBase FTP site:
+
*Only ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS252/ONTOLOGY/gene_association.WS252.wb.c_elegans, needs to be used as a source file, since all of the automated annotations, InterPro2GO (all IEA) and Phenotype2GO (all IEA) are in this file.
 +
*No longer relevant: Source 1: gene_association file for C.elegans from the WormBase FTP site:
 
**ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS243/ONTOLOGY/gene_association.WS243.wb.c_elegans
 
**ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS243/ONTOLOGY/gene_association.WS243.wb.c_elegans
 +
**this file is one build behind, compared to release label for the phenotype2go annotations
 +
*No longer relevant: Source 2: (from WS250): phenotype2go file, these automated annotations are now 'IEAs', but will be treated like IMPs, if they have the 'WBPhenotype:XXXXXXX' in column 8 (with)
 +
**this file is current with the release label for the phenotype2go annotations
 +
**Source 1 and 2 will have redundant annotations but these will get resolved
 
**All rows with column 15 (assigned by) with 'WB' are WormBase annotations, those with 'UniProtKB' or 'InterPro' are from those databases
 
**All rows with column 15 (assigned by) with 'WB' are WormBase annotations, those with 'UniProtKB' or 'InterPro' are from those databases
 
**Need data from these rows:
 
**Need data from these rows:
Line 200: Line 254:
  
 
====Rules for process sentence construction====
 
====Rules for process sentence construction====
*'''Rule 1''': Ignore all IEA and ISS process terms
+
*'''Rule 1''': Ignore all GO terms with the tag 'is_obsolete: true' in the obo file
*'''Rule 2''': For all IMP terms, add the words 'based on mutant phenotype' at the end of the sentence.
+
*'''Rule 2''': Ignore all IEA and ISS process terms
 +
*'''Rule 3''': For identical GO terms with both experimental (EXP, IDA, IPI, IMP, IGI, IEP) and computational evidence codes (ISS, IEA), keep only the non-redundant GO terms from experimental evidence and ignore the computational evidence codes.
 +
*'''Rule 4''': For all IMP terms, add the words 'based on mutant phenotype' at the end of the sentence.
 
**Examples:
 
**Examples:
 
**Data: WBGene00001972,hmg-1.2,cell fate specification[IMP],WB_REF:WBPaper00003453|PMID:10197987,,WB,gonad development[IMP],Wnt signaling pathway[IGI],vulval development[IMP]
 
**Data: WBGene00001972,hmg-1.2,cell fate specification[IMP],WB_REF:WBPaper00003453|PMID:10197987,,WB,gonad development[IMP],Wnt signaling pathway[IGI],vulval development[IMP]
Line 207: Line 263:
 
**Data: WBGene00002148,gon-14,muscle contraction[IMP],WB_REF:WBPaper00024213|PMID:15133127,,WB,defecation[IGI],gonad morphogenesis[IMP],multicellular organism growth[IGI],regulation of pharyngeal pumping[IMP]
 
**Data: WBGene00002148,gon-14,muscle contraction[IMP],WB_REF:WBPaper00024213|PMID:15133127,,WB,defecation[IGI],gonad morphogenesis[IMP],multicellular organism growth[IGI],regulation of pharyngeal pumping[IMP]
 
**Sentence: gon-14 is involved in muscle contraction, gonad morphogenesis and regulation of pharyngeal pumping, based on mutant phenotypes and in defecation and growth.
 
**Sentence: gon-14 is involved in muscle contraction, gonad morphogenesis and regulation of pharyngeal pumping, based on mutant phenotypes and in defecation and growth.
*'''Rule 3''': No exclusions as of 07.07.2014, leave in reproduction:
+
*'''Rule 5''': No exclusions as of 07.07.2014, leave in reproduction:
*'''Rule 4''' If a GO term has the words 'involved in' anywhere, beginning, middle or at the end, use the words 'functions in'
+
*'''Rule 6''' If a GO term has the words 'involved in' anywhere, beginning, middle or at the end, use the words 'functions in'
 
**Data: WBGene00006495,cpna-1,striated muscle contraction involved in embryonic body morphogenesis[IMP],WB_REF:WBPaper00041875|PMID:23283987
 
**Data: WBGene00006495,cpna-1,striated muscle contraction involved in embryonic body morphogenesis[IMP],WB_REF:WBPaper00041875|PMID:23283987
 
**Sentence: cpna-1 functions in striated muscle contraction involved in embryonic body morphogenesis;
 
**Sentence: cpna-1 functions in striated muscle contraction involved in embryonic body morphogenesis;
*'''Rule 5''': For all other Process terms the sentence will be:
+
*'''Rule 7''': For all other Process terms the sentence will be:
 
**<Gene> is involved in <process term>;
 
**<Gene> is involved in <process term>;
 
**Examples:
 
**Examples:
 
**Data: WBGene00014123,elpc-3,tRNA wobble uridine modification[IMP],WB_REF:WBPaper00034713|PMID:19593383,,WB,translation[IMP],spermatogenesis[IGI],olfactory learning[IMP],vulval development[IGI],embryonic morphogenesis[IGI],oocyte development[IGI]
 
**Data: WBGene00014123,elpc-3,tRNA wobble uridine modification[IMP],WB_REF:WBPaper00034713|PMID:19593383,,WB,translation[IMP],spermatogenesis[IGI],olfactory learning[IMP],vulval development[IGI],embryonic morphogenesis[IGI],oocyte development[IGI]
 
**Sentence: elpc-3 is involved in tRNA wobble uridine modification, translation, spermatogenesis, olfactory learning, vulval development and embryonic morphogenesis.
 
**Sentence: elpc-3 is involved in tRNA wobble uridine modification, translation, spermatogenesis, olfactory learning, vulval development and embryonic morphogenesis.
*'''Rule 6''': For the term 'molting cycle, collagen and cuticulin-based cuticle' GO term, drop the comma and the words 'collagen and cuticulin-based cuticle'
+
*'''Rule 8''': For the term 'molting cycle, collagen and cuticulin-based cuticle' GO term, drop the comma and the words 'collagen and cuticulin-based cuticle'
 
**Example:
 
**Example:
 
**Data: WBGene00016643,vps-45,molting cycle, collagen and cuticulin-based cuticle[IMP],WB_REF:WBPaper00029049|PMID:17235359,,WB,vesicle docking involved in exocytosis[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR001619,vesicle-mediated transport[IEA]
 
**Data: WBGene00016643,vps-45,molting cycle, collagen and cuticulin-based cuticle[IMP],WB_REF:WBPaper00029049|PMID:17235359,,WB,vesicle docking involved in exocytosis[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR001619,vesicle-mediated transport[IEA]
 
**Sentence: vps-45 is involved in '''the''' molting cycle;
 
**Sentence: vps-45 is involved in '''the''' molting cycle;
*'''Rule 7''': Replacement rule:
+
*'''Rule 9''': Replacement rule:
 
**Replace term 'multicellular organism growth' with 'growth'.
 
**Replace term 'multicellular organism growth' with 'growth'.
 
**Replace term 'embryo development ending in birth or egg hatching' with 'embryo development'
 
**Replace term 'embryo development ending in birth or egg hatching' with 'embryo development'
Line 226: Line 282:
 
***Example Data: WBGene00011488,nra-2, synaptic transmission, cholinergic[IMP],WB_REF:WBPaper00034730|PMID:19609303,,WB,locomotion[IMP],protein processing[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR008710,regulation of signal transduction[IEA],INTERPRO:IPR016574
 
***Example Data: WBGene00011488,nra-2, synaptic transmission, cholinergic[IMP],WB_REF:WBPaper00034730|PMID:19609303,,WB,locomotion[IMP],protein processing[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR008710,regulation of signal transduction[IEA],INTERPRO:IPR016574
 
***Sentence: nra-2 is involved in cholinergic synaptic transmission and locomotion.
 
***Sentence: nra-2 is involved in cholinergic synaptic transmission and locomotion.
'''Rule 8''': Granularity rule:
+
'''Rule 10''': Granularity rule:
 
*If a GO term in the list is a parent of (is_a or part_of parent) another GO term in the list, keep the most granular child term and ignore the parent terms.
 
*If a GO term in the list is a parent of (is_a or part_of parent) another GO term in the list, keep the most granular child term and ignore the parent terms.
 +
 +
====Addding elegans process information for non-C.elegans nematodes====
 +
*From WS249: Write process sentences based on experimental process info from elegans, only for those non-elegans genes that have no GO
 +
sentences
 +
*Use only non-IEA, non-ISS process terms from elegans
 +
*If only one ortholog exists, use all non-IEA, non-ISS, elegans process terms
 +
*If more than one ortholog exists, use only the common non-IEA, non-ISS, elegans process terms from the orthologs
 +
*Sentence construction:
 +
Cbr-ajm-1 is an ortholog of C. elegans ajm-1 which is involved in cell-cell junction organization and
 +
embryo development ending in birth or egg hatching.
  
 
==Molecular function/identity==
 
==Molecular function/identity==
Line 246: Line 312:
  
 
====Rules for molecular function sentence construction====
 
====Rules for molecular function sentence construction====
*'''Rule 1''': Exclusion list:
+
*'''Rule 1''': Ignore all GO terms with the tag 'is_obsolete: true'
 +
*'''Rule 2''': Exclusion list:
 
**Ignore the term 'protein binding'
 
**Ignore the term 'protein binding'
 
**Ignore the term 'binding'
 
**Ignore the term 'binding'
*'''Rule 2''': Order the IDA and IMP terms first in the sentence followed by ISS and IEA terms.
+
*'''Rule 3''': For identical GO terms with both experimental (EXP, IDA, IPI, IMP, IGI, IEP) and computational evidence codes (ISS, IEA), keep only the non-redundant GO terms from experimental evidence and ignore the computational evidence codes.
*'''Rule 3''': If the evidence code is 'IEA' and 'activity' term is present, add the words 'predicted to have <activity term>' and add the words ', based on protein domain information' to the sentence:
+
*'''Rule 4''': Order the experimental GO terms first in the sentence followed by ISS and IEA terms.
 +
*'''Rule 5''': If the evidence code is 'IEA' and 'activity' term is present, add the words 'predicted to have <activity term>' and add the words ', based on protein domain information' to the sentence:
 
**Sentence: <Gene> is predicted to have <activity term>, based on protein domain information.
 
**Sentence: <Gene> is predicted to have <activity term>, based on protein domain information.
 
*Examples:
 
*Examples:
 
**WBGene00000108,alh-2,oxidoreductase activity[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR015590,WB,oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor[IEA],INTERPRO:IPR016163
 
**WBGene00000108,alh-2,oxidoreductase activity[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR015590,WB,oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor[IEA],INTERPRO:IPR016163
 
**alh-2 is predicted to have oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor, based on protein domain information.
 
**alh-2 is predicted to have oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor, based on protein domain information.
*'''Rule 3''': If Evidence code is 'ISS' add the words 'based on sequence information' to the sentence
+
*'''Rule 6''': If Evidence code is 'ISS' add the words 'based on sequence information' to the sentence
 
**WBGene00001951,hlh-4,sequence-specific DNA binding transcription factor activity[ISS],WB_REF:WBPaper00034761|PMID:19632181,,WB,protein dimerization activity[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR011598,DNA binding[IEA],INTERPRO:IPR015660
 
**WBGene00001951,hlh-4,sequence-specific DNA binding transcription factor activity[ISS],WB_REF:WBPaper00034761|PMID:19632181,,WB,protein dimerization activity[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR011598,DNA binding[IEA],INTERPRO:IPR015660
 
**Sentence: hlh-4 is predicted to have sequence-specific DNA binding transcription factor activity based on sequence information, protein dimerization activity and DNA binding activity, based on protein domain information.
 
**Sentence: hlh-4 is predicted to have sequence-specific DNA binding transcription factor activity based on sequence information, protein dimerization activity and DNA binding activity, based on protein domain information.
  
*'''Rule 4''': If a binding term is present add the word 'activity' to it.
+
*'''Rule 7''': If a binding term is present add the word 'activity' to it.
  
*'''Rule 5''': If the evidence code is IDA, IMP, IGI, IPI, or IEP add the words 'exhibits' before the GO term and 'based on experimental evidence' after the GO term and/or at the end of the sentence.
+
*'''Rule 8''': If the evidence code is IDA, IMP, IGI, IPI, or IEP add the words 'exhibits' before the GO term and 'based on experimental evidence' after the GO term and/or at the end of the sentence.
 
**IDA example:
 
**IDA example:
 
**WBGene00001952,hlh-6,RNA polymerase II distal enhancer sequence-specific DNA binding transcription factor activity[IDA],WB_REF:WBPaper00032277|PMID:18927627,,WB,protein dimerization activity[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR011598,DNA binding[IEA],INTERPRO:IPR015660
 
**WBGene00001952,hlh-6,RNA polymerase II distal enhancer sequence-specific DNA binding transcription factor activity[IDA],WB_REF:WBPaper00032277|PMID:18927627,,WB,protein dimerization activity[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR011598,DNA binding[IEA],INTERPRO:IPR015660
Line 269: Line 337:
 
**Sentence: aagr-3 exhibits alpha-glucosidase activity and is predicted to have hydrolase activity, hydrolyzing O-glycosyl compounds, catalytic activity, and carbohydrate binding activity, based on protein domain information.
 
**Sentence: aagr-3 exhibits alpha-glucosidase activity and is predicted to have hydrolase activity, hydrolyzing O-glycosyl compounds, catalytic activity, and carbohydrate binding activity, based on protein domain information.
  
*'''Rule 6''': For non-IEA terms beginning with 'structural constituent of <word>', use the words 'is a 'structural constituent of <word>' in the sentence.
+
*'''Rule 9''': For non-IEA terms beginning with 'structural constituent of <word>', use the words 'is a 'structural constituent of <word>' in the sentence.
 
*Example:
 
*Example:
 
**WBGene00010783,mrpl-36,structural constituent of ribosome[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR000473,WB
 
**WBGene00010783,mrpl-36,structural constituent of ribosome[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR000473,WB
Line 278: Line 346:
 
**Sentence: mrpl-36 is a structural constituent of ribosome.
 
**Sentence: mrpl-36 is a structural constituent of ribosome.
  
*'''Rule 7''' Replacement
+
*'''Rule 10''' Replacement
 
**For GO term: 'oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor' becomes 'oxidoreductase activity, acting on the aldehyde or oxo group of donors' (drop the text string 'NAD or NADP as acceptor')
 
**For GO term: 'oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor' becomes 'oxidoreductase activity, acting on the aldehyde or oxo group of donors' (drop the text string 'NAD or NADP as acceptor')
  
*'''Rule 8''': If a GO term has a comma in it, then add a comma after this GO term and the next ‘and’.
+
*'''Rule 11''': If a GO term has a comma in it, then add a comma after this GO term and the next ‘and’.
*'''Rule 9''': Always put the GO terms with experimental evidence codes (EXP, IMP, IGI, IPI, IDA, IEP) first in the sentence, followed by the Automatic and Computational Analysis Evidence Codes (IEA, ISS, ISA, ISO, ISM, IGC, IBA, IBD, IKR, IRD and RCA).
 
  
 
==Sub-cellular localization==
 
==Sub-cellular localization==
Line 296: Line 363:
  
 
====Rules for sub-cellular localization sentence construction====
 
====Rules for sub-cellular localization sentence construction====
*'''Rule 1''': Ignore all IEA and ISS GO terms, use only non-IEA GO terms
+
*'''Rule 1''': Ignore GO terms with the tag 'is_obsolete: true' in the obo file
**Sentence: <Gene> is localized to <component term>;
+
*'''Rule 2''': Ignore all IEA and ISS GO terms, use only non-IEA, non-ISS GO terms
*'''Rule 2''': For 'integral component of ....' terms add the words 'is an';
+
*'''Rule 4''': Ignore IBA and IBD GO terms (PAINT annotations)
 +
*'''Rule 5''': For identical GO terms with both experimental (EXP, IDA, IPI, IMP, IGI, IEP) and computational evidence codes (ISS, IEA), keep only the non-redundant GO terms from experimental evidence and ignore the computational evidence codes.
 +
 
 +
*'''Rule 6''': For 'integral component of ....' terms add the words 'is an';
 
**Eg. for Rule 2: WBGene00006319,sup-10,integral component of plasma membrane[IDA],WB_REF:WBPaper00006135|PMID:14534247,,WB,striated muscle dense body[IDA]
 
**Eg. for Rule 2: WBGene00006319,sup-10,integral component of plasma membrane[IDA],WB_REF:WBPaper00006135|PMID:14534247,,WB,striated muscle dense body[IDA]
 
**sup-10 is an integral component of plasma membrane and is localized to striated muscle dense body;
 
**sup-10 is an integral component of plasma membrane and is localized to striated muscle dense body;
Line 310: Line 380:
 
**WBGene00021827,dnc-6,dynactin complex[IDA],WB_REF:WBPaper00037699|PMID:20964796,,WB
 
**WBGene00021827,dnc-6,dynactin complex[IDA],WB_REF:WBPaper00037699|PMID:20964796,,WB
 
**Sentence: dnc-6 is localized to the dynactin complex;
 
**Sentence: dnc-6 is localized to the dynactin complex;
 +
 +
*'''Rule 7''': For the GO term 'intracellular [IEA]' structure of the sentence will be different, use 'is intracellular'.
 +
**Eg.1 WBGene00089742
 +
**PPA00188 is an ortholog of C. elegans T26A5.8; based on protein domain information, PPA00188 is predicted to have sequence-specific DNA binding activity and protein heterodimerization activity and is intracellular.
  
 
==Order of sentences==
 
==Order of sentences==
*Orthology
+
#Orthology
*Process
+
#GO Process
*Function/identity
+
#GO Function/identity
*Component
+
#Tissue expression
 +
#GO Cell Component
 +
 
 +
Or the order could be (when a gene has no automated description or has only orthology), the order will be:
 +
#Orthology
 +
#Expression cluster data - gene regulation
 +
#Expression cluster data - molecule regulation
 +
#Expression cluster data - tissue enrichment (anatomy)
  
 
==Tissue expression==
 
==Tissue expression==
 
====Source files for Tissue expression data====
 
====Source files for Tissue expression data====
 +
*Source file for annotations:ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS250/ONTOLOGY/anatomy_association.WS250.wb
 +
**Column 2: Gene ID (WBGene00001386)
 +
**Column 3: Gene name (far-2 or F49E2.5)
 +
**These gene names need to be checked against the latest gene names in Postgres for the most recent names
 +
**Column 4: Qualifier (Certain, Partial, Uncertain and NULL (no qualifier), use only Certain, Partial and data with no qualifier, ignore Uncertain
 +
**Column 5: Anatomy ID (WWBbt:0005821), translate to name using anatomy ontology file from same release, eg: ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS250/ONTOLOGY/anatomy_ontology.WS250.obo
 +
**Column 6: Reference (WB_REF:WBPaper00001469), take WBPaper00001469 to put into the reference column of file for OA import: eg., http://textpresso-dev.caltech.edu/concise_descriptions/release/WS251/c_elegans/descriptions/OA_concise_descriptions.WS251.txt
 +
**Do not use data from lines with the Paper Reference: WBPaper00040986
 +
 +
====Previously: Source files for Tissue expression data====
 
*Source 1: Expression data  
 
*Source 1: Expression data  
 
*OA (exprpat), PG table names:
 
*OA (exprpat), PG table names:
Line 331: Line 422:
  
 
====Rules for tissue expression sentence construction====
 
====Rules for tissue expression sentence construction====
'''Rule 1''': Use only the data that has the qualifiers 'Certain' and 'Partial' and ignore all those data that have 'uncertain'.
+
'''Rule 1''': Ignore all data with the qualifier 'Uncertain' and 'NOT'; all data that has the qualifiers 'Certain' and 'Partial' as well as data with no qualifiers will be used. 
 +
 
 +
Explanation for Certain, Partial, Uncertain and NOT, as used in curation (Daniela Raciti)
 +
<pre style="white-space: pre-wrap;
 +
white-space: -moz-pre-wrap;
 +
white-space: -pre-wrap;
 +
white-space: -o-pre-wrap;
 +
word-wrap: break-word">
 +
 
 +
Partially expressed:
 +
Gene A was observed to be expressed in some cells of a group of cells that include Y. Example 1: "Expressed in 4-5 pairs of amphid neurons." You should select amphid neuron in the ’Partially Expressed in’ box.
 +
Example 2: "Expressed in the anterior intestine." Select Intestine in the ’Partially Expressed in’ box.
 +
 +
Uncertain or Possibly Expressed:
 +
Gene A was sometimes observed to be expressed in cell Y OR Gene A was observed to be expressed in a cell that could be Y.
 +
Example 1: "Occasional expression of DDL-2 in one adult intestinal cell." You should select intestinal cell in the ’Possibly expressed in’ box.
 +
Example 2: "Expression was observed less frequently in the PVPL/R interneurons." You should select PVPL and PVPR in the ’Possibly expressed in’ box.
 +
 
 +
The ‘Not expressed in’ field should contain information about where the gene product is Certainly NOT expressed in.
 +
 
 +
</pre>
 +
 
 +
For automated descriptions only data that has the qualifiers 'Certain' and 'Partial' as well as data with no qualifiers will be used.
  
 
'''Rule 2''': Pick an anatomy term only once
 
'''Rule 2''': Pick an anatomy term only once
Line 337: Line 450:
  
 
*Examples:
 
*Examples:
 +
<pre style="white-space: pre-wrap;
 +
white-space: -moz-pre-wrap;
 +
white-space: -pre-wrap;
 +
white-space: -o-pre-wrap;
 +
word-wrap: break-word">
 
*Data for alh-10:  
 
*Data for alh-10:  
 
*WBGene00000116 alh-10 Expr5583 nervous system,intestine,tail neuron WBPaper00031006,WBPaper00006525 Endogenous
 
*WBGene00000116 alh-10 Expr5583 nervous system,intestine,tail neuron WBPaper00031006,WBPaper00006525 Endogenous
*'''Sentence''': alh-10 is expressed in the nervous system, intestine and tail neuron;
+
*Sentence: alh-10 is expressed in the nervous system, intestine and tail neuron;
 
 
  
 
*Data for asp-5:
 
*Data for asp-5:
 
*WBGene00000218 asp-5 Expr5817 intestine WBPaper00031006,WBPaper00006525 Endogenous
 
*WBGene00000218 asp-5 Expr5817 intestine WBPaper00031006,WBPaper00006525 Endogenous
 
*WBGene00000218 asp-5 Expr4352 intestine WBPaper00028802 Endogenous
 
*WBGene00000218 asp-5 Expr4352 intestine WBPaper00028802 Endogenous
*'''Sentence''': asp-5 is expressed in the intestine;
+
*Sentence: asp-5 is expressed in the intestine;
 
 
  
 
*Data for ccr-4:
 
*Data for ccr-4:
Line 355: Line 471:
 
*WBGene00000376 ccr-4 Expr4480 pharynx,body wall musculature,head neuron,tail neuron *WBPaper00027076 Endogenous
 
*WBGene00000376 ccr-4 Expr4480 pharynx,body wall musculature,head neuron,tail neuron *WBPaper00027076 Endogenous
 
*WBGene00000376 ccr-4 Expr4480 hyp6,hyp5,hyp4,hyp3,hyp2,hyp1,pharyngeal neuron,head muscle
 
*WBGene00000376 ccr-4 Expr4480 hyp6,hyp5,hyp4,hyp3,hyp2,hyp1,pharyngeal neuron,head muscle
*'''Sentence''': ccr-4 is expressed in the pharynx, somatic cell, germ line, hyp6, hyp5, hyp4, hyp3, hyp2, hyp1, pharyngeal neuron, head muscle, hypodermis, seam cell, body wall musculature and in the head and tail neurons.
+
*Sentence: ccr-4 is expressed in the pharynx, somatic cell, germ line, hyp6, hyp5, hyp4, hyp3, hyp2, hyp1, pharyngeal neuron, head muscle, hypodermis, seam cell, body wall musculature and in the head and tail neurons.
 +
</pre>
  
 
*'''Rule 3''': Replacement Rule'''
 
*'''Rule 3''': Replacement Rule'''
*Replacement 1: If the anatomy term 'cell' occurs by itself, then use the words 'widely expressed' instead of 'cell'.
+
*1: If the anatomy term 'cell' occurs by itself, then use the words 'widely expressed' instead of 'cell'.
 
**sentence: col-178 is expressed in the Cell;
 
**sentence: col-178 is expressed in the Cell;
 
**Becomes: col-178 is expressed widely.
 
**Becomes: col-178 is expressed widely.
  
*Replacement 2: If the anatomy term 'cell' is present with other anatomy terms, use instead the words 'expressed in several tissues including the' <other anatomy terms>
+
*2: If the anatomy term 'cell' is present with other anatomy terms, use instead the words 'expressed in several tissues including the' <other anatomy terms>
 
**Sentence: frm-1 is expressed in the intestine, pharynx, and the Cell;
 
**Sentence: frm-1 is expressed in the intestine, pharynx, and the Cell;
 
**Becomes: frm-1 is expressed in several tissues including the intestine and pharynx.
 
**Becomes: frm-1 is expressed in several tissues including the intestine and pharynx.
  
*Replacement 3: If the anatomy term 'neuron' occurs as an anatomy term by itself, use the words 'in the nervous system' instead.
+
*3: If the anatomy term 'cell' is present with either the term 'hermaphrodite' or the term 'male'
 +
**Sentence: <Gene> is expressed in several tissues and in the hermaphrodite.
 +
**Sentence: <Gene> is expressed in several tissues and in the male.
 +
 
 +
*4. If the anatomy terms 'cell, hermaphrodite and male' are present:
 +
**Sentence: <Gene> is expressed in several tissues and in the hermaphrodite and the male.
 +
 
 +
*5. If the anatomy term 'neuron' occurs as an anatomy term by itself, use the words 'in the nervous system' instead.
 
**Sentence: ceh-82 is expressed in the neuron;
 
**Sentence: ceh-82 is expressed in the neuron;
 
**Becomes: ceh-82 is expressed in the nervous system;
 
**Becomes: ceh-82 is expressed in the nervous system;
  
 
+
*6: If the word 'neuron' occurs as part of an anatomy term, eg, 'head neuron', pluralize it, except for the exceptions:
*Replacement 4: If the word 'neuron' occurs as part of an anatomy term, eg, 'head neuron', pluralize it, except for the exceptions:
 
 
**Exceptions:  
 
**Exceptions:  
 
**I3 neuron
 
**I3 neuron
Line 384: Line 507:
 
**Becomes: nhr-194 is expressed in the amphid neurons, ciliated neurons, head neurons, and the sensory neurons;
 
**Becomes: nhr-194 is expressed in the amphid neurons, ciliated neurons, head neurons, and the sensory neurons;
  
 +
'''Rule 5''': Granularity Rule: When a string of cell names are present, check to see if any parent terms are present in any ExprXXX for that gene, keep the highest common parent term available.
 +
 +
'''Rule 6''': Clustering/Grouping: Group terms according to their immediate parent terms, and use the parent term in the sentence. If a term cannot be grouped, leave as such in sentence.
 +
**Needs an input file of anatomy term annotations and immediate parents
 +
***Will contain Gene Id, Gene, anatomy terms, immediate 'instance_of' parent terms
 +
 +
==Expression cluster data==
 +
*Contact person for data-related questions: Wen Chen
 +
*This data will be used when there is no automated description or when there is only orthology data
 +
*Order for concatenation: gene regulation, molecule regulation, enrichment in tissues (anatomy)
 +
 +
====Location of files====
 +
*ftp://caltech.wormbase.org/pub/wormbase/ExprClusterSummary/
 +
*Wen will update files for each WB release
 +
*Wen will separate out genes by species, now they are mixed up.
 +
 +
===Anatomy expression cluster data===
 +
*File:ftp://caltech.wormbase.org/pub/wormbase/ExprClusterSummary/WS253/bmaECsummary_anatomy.WS252.txt
 +
#Include all data for all genes
 +
#Data is in 5 columns: Gene ID, Public Name, Relationship, Anatomy name, Experiment Type
 +
#Take the WBGeneID (Column 1) and then map this to the most current Brugia malayi gene name from our latest gene names list
 +
#Expand the experiment type 'RNA seq' to 'RNA sequencing'
 +
#Template: <Experiment Type> indicates that <gene> <relationship> in the <Anatomy name/s>.
 +
 +
<pre style="white-space: pre-wrap;
 +
white-space: -moz-pre-wrap;
 +
white-space: -pre-wrap;
 +
white-space: -o-pre-wrap;
 +
word-wrap: break-word">
 +
Example 1:
 +
Data:WBGene00220481 Bm220 Enriched in bodywall proteomic study
 +
Sentence: Proteomic studies indicate that Bm220 is enriched in the body wall.
 +
 +
Example 2:
 +
Data:WBGene00222508 Bm2247 Enriched in reproductive tract proteomic study
 +
Sentence: Proteomic studies indicate that Bm2247 is enriched in the reproductive tract.
 +
 +
Example 3:
 +
Data:WBGene00222254 Bma-ckk-1 Enriched in digestive tract proteomic study
 +
Sentence: Proteomic studies indicate that Bma-ckk-1 is enriched in the digestive tract.
 +
</pre>
 +
 +
====Rules for sentence construction for C. elegans genes====
 +
*File:ftp://caltech.wormbase.org/pub/wormbase/ExprClusterSummary/WS253/ceECsummary_anatomy.WS252.txt
 +
 +
#Use this data when no data is present for a gene automated description.
 +
#Use this data when the automated description has only orthology data.
 +
#Use this data when the automated description has no tissue expression data
 +
#When the data is used, use all the rules for tissue expression pattern data
 +
#Group all neurons together in the sentence constructed.
 +
#When only 'neuron' is present construct the following: <Experiment type> studies indicate that <gene> is enriched in neurons.
 +
#When neuron is present with other specific neuron types, group 'neuron' and specific neuron types together so that the following sentence can be constructed: <Experiment type> studies indicate that <gene> is enriched in neurons including <specific neuron names>.
 +
#Expand the experiment type 'RNA seq' to 'RNA sequencing'.
 +
#Pluralize 'coelomocyte' to 'coelomocytes'.
 +
#In general pluralize cell names, add the definite article 'the' where necessary. Exceptions: 'anchor cell'
 +
#For the term 'male-specific', modify to 'male-specific tissues' and place last when other anatomy terms are present.
 +
#For the term 'hermaphrodite-specific' modify to 'hermaphrodite-specific tissues' and place last when other anatomy terms are present.
 +
Template: <Experiment type> studies indicate that <gene> is enriched in <Anatomy names>.
 +
 +
<pre style="white-space: pre-wrap;
 +
white-space: -moz-pre-wrap;
 +
white-space: -pre-wrap;
 +
white-space: -o-pre-wrap;
 +
word-wrap: break-word">
 +
 +
Example 1:
 +
Data: WBGene00022866 ZK1240.1 Enriched in neuron tiling array
 +
Sentence: Tiling array studies indicate that ZK1240.1 is enriched in neurons.
 +
 +
Example 2:
 +
Data:
 +
WBGene00000048 acr-9 Enriched in GABAergic neuron,DA neuron,VA neuron,neuron,AVE,coelomocyte,ventral nerve cord microarray,tiling array
 +
Sentence:
 +
Microarray and tiling array studies indicate that acr-9 is enriched in neurons including the GABAergic, DA, VA, and AVE neurons, ventral nerve cord and in the coelomocytes.
 +
 +
Example 3:
 +
Data: WBGene00022803 ZK688.9 Enriched in body wall muscle cell,AFD,AWB,I5 neuron,DA neuron,SAB,retrovesicular ganglion,neuron tiling array,microarray
 +
Sentence: Tiling array and microarray studies indicate that ZK688.9 is enriched in body wall muscle cells,  and the neurons including the AFD, AWB, I5, DA, SAB and retrovesicular ganglion.
 +
 +
Example 4:
 +
Data: WBGene00022824 ZK813.5 Enriched in BAG,NSM tiling array,RNA-seq
 +
Sentence: Tiling array and RNA sequencing studies indicate that ZK813.5 is enriched in the BAG and NSM neurons.
 +
 +
Example 5:
 +
Data: WBGene00000675 col-101 Enriched in PVD,OLL,cephalic sheath cell,coelomocyte,dopaminergic neuron,ventral nerve cord,hypodermis,germ line microarray,tiling array,RNA-seq
 +
Sentence: Microarray, tiling array and RNA sequencing studies indicate that col-101 is enriched in PVD, OLL and dopaminergic neurons and the cephalic sheath cells, coelomocytes, ventral nerve cord, hypodermis and the germ line.
 +
 +
Example 6:
 +
Data: WBGene00022867 ZK1240.2 Enriched in intestine tiling array
 +
Sentence: Tiling array studies indicate that ZK1240.2 is enriched in the intestine.
 +
 +
Example 7:
 +
Data: WBGene00008175 C48B4.12 Enriched in male-specific microarray
 +
Sentence: Microarray studies indicate that C48B4.12 is enriched in male-specific tissues.
 +
 +
Example 8:
 +
Data: WBGene00008447 E01G4.5 Enriched in pharynx,body wall muscle cell,male-specific,muscle cell,intestine microarray,tiling array
 +
Sentence: Microarray and tiling array studies indicate that E01G4.5 is enriched in the pharynx, body wall muscle cells, muscle cells, intestine and male-specific tissues.
 +
 +
Example 9:
 +
Data: WBGene00009453 F36A2.3 Enriched in germline precursor cell,hypodermis,hermaphrodite-specific tiling array,microarray
 +
Sentence: Tiling array and microarray studies indicate that F36A2.3 is enriched in germline precursor cells, hypdodermis and hermaphrodite-specific tissues.
 +
 +
</pre>
 +
 +
====Rules for sentence construction for Pristionchus pacificus (ppa) genes====
 +
File: ftp://caltech.wormbase.org/pub/wormbase/ExprClusterSummary/WS253/ppaECsummary_anatomy.WS252.txt
 +
*Include all data for all genes
 +
*Expand the experiment type 'RNA seq' to 'RNA sequencing'
 +
 +
<pre style="white-space: pre-wrap;
 +
white-space: -moz-pre-wrap;
 +
white-space: -pre-wrap;
 +
white-space: -o-pre-wrap;
 +
word-wrap: break-word">
 +
 +
Example 1:
 +
Data:WBGene00102297 Ppa-dnj-22 Enriched in germ line microarray
 +
Sentence: Microarray studies indicate that Ppa-dnj-22 is enriched in the germline.
 +
 +
</pre>
 +
 +
===Gene regulation expression cluster data===
 +
File: ftp://caltech.wormbase.org/pub/wormbase/ExprClusterSummary/WS253/ceECsummary_geneReg.WS252.txt
 +
*Include this data only when a gene has no concise description or no automated description
 +
*Expand the experiment type 'RNA seq' to 'RNA sequencing'
 +
*Add the word 'expression' after the gene name:*Add the word 'expression' after gene name in sentence: <Experiment Type> indicate that <gene name> expression is regulated by <Regulator Gene Name 1, Regulator Gene Name 2 and Regulator Gene Name 3>.
 +
 +
<pre style="white-space: pre-wrap;
 +
white-space: -moz-pre-wrap;
 +
white-space: -pre-wrap;
 +
white-space: -o-pre-wrap;
 +
word-wrap: break-word">
 +
 +
Example 1:
 +
Data:WBGene00220086 T23D8.12 Regulated by fbf-1,rnp-8,gld-1,aak-2,tdp-1,isp-1 microarray
 +
Sentence: Microarray studies indicate that T23D8.12 expression is regulated by fbf-1, rnp-8, gld-1, aak-2, tdp-1 and isp-1.
 +
 +
Example 2:
 +
Data: WBGene00220023 K11C4.14 Regulated by prg-1,dcr-1 RNA-seq,microarray
 +
Sentence: RNA-sequencing and microarray studies indicate that K11C4.14 expression is regulated by prg-1 and dcr-1.
 +
</pre>
 +
 +
===Molecule regulation expression cluster data===
 +
File: ftp://caltech.wormbase.org/pub/wormbase/ExprClusterSummary/WS253/ceECsummary_molReg.WS252.txt
 +
 +
*Include for genes if only no concise and no automated descriptions
 +
*Follow capitaliztion of molecules as in data source file
 +
*Expand the experiment type 'RNA seq' to 'RNA sequencing'
 +
*Add the word 'expression' after gene name in sentence: <Experiment Type> indicate that <gene name> expression is regulated by <Chemical Name 1, Chemical Name 2 and Chemical Name 3>.
 +
*Pluralize the Chemical Name 'adsorbable organic bromine compound' to 'adsorbable organic bromine compounds'
 +
 +
<pre style="white-space: pre-wrap;
 +
white-space: -moz-pre-wrap;
 +
white-space: -pre-wrap;
 +
white-space: -o-pre-wrap;
 +
word-wrap: break-word">
 +
 +
Example 1:
 +
Data: WBGene00235256 Y54G11A.19 Regulated by Chlorpyrifos,Diazinon microarray
 +
Sentence: Microarray studies indicate that Y54G11.A expression is regulated by Chlorpyrifos and Diazinon.
 +
 +
Example 2:
 +
Data: WBGene00012041 T26E3.8 Regulated by 1-methylnicotinamide,methylmercuric chloride,resveratrol,Atrazine,adsorbable organic bromine compound RNA-seq,microarray
 +
Sentence: RNA-sequencing and microarray studies indicate that T26E3.8 expression is regulated by 1-methylnicotinamide, methylmercuric chloride, resveratrol, Atrazine and adsorbable organic bromine compounds.
  
'''Rule 5''': Granularity Rule: When a string of cell names are present, check to see if any parent terms are present in any ExprXXX for that gene, keep the highest common parent term available.
+
</pre>
  
 
==Preliminary results==
 
==Preliminary results==
Line 412: Line 700:
 
!OA field <br/>number !! OA field name !! Data to be inserted !! Example of data <br/>to be inserted!! Required or Not !! OA table name
 
!OA field <br/>number !! OA field name !! Data to be inserted !! Example of data <br/>to be inserted!! Required or Not !! OA table name
 
|-
 
|-
|1 ||WBGene|| WBGene||WBGene000||Required||con_wbgene
+
|1 ||WBGene|| WBGene||WBGene00000376||Required||con_wbgene
 +
|-
 +
|2 ||Species|| Species||Onchocerca volvulus||Required||con_species
 
|-
 
|-
|2 ||Curator||Name of Curator || James Done(first then replace with) Ranjana Kishore<br/> (insert for all rows) ||Required||con_curator
+
|3 ||Curator||Name of Curator || James Done(first then replace with) Ranjana Kishore<br/> (insert for all rows) ||Required||con_curator
 
|-
 
|-
|3 ||Curator History|| Name of Curator ||same as pgid<br/>(insert for all rows)||Required||con_curhistory  
+
|4 ||Curator History|| Name of Curator ||same as pgid<br/>(insert for all rows)||Required||con_curhistory  
 
|-
 
|-
|4 ||Description Type|| Automated_concise_description<br/>(insert for all rows)||Automated_concise_description||Required||con_desctype
+
|5 ||Description Type|| Automated_concise_description<br/>(insert for all rows)||Automated_concise_description||Required||con_desctype
 
|-
 
|-
|5 ||Description Text|| the automated concise description ||asp-19 encodes an ortholog...||Required||con_desctext  
+
|6 ||Description Text|| the automated concise description ||asp-19 encodes an ortholog...||Required||con_desctext  
 
|-
 
|-
|6 ||Reference||WBPaper||WBPaper00026979||Required||con_paper
+
|7 ||Reference||WBPaper||WBPaper00026979||Required||con_paper
 
|-
 
|-
|7 ||Accesssion Evidence||For Homology use ENSEMBL ID<br/> For Process, Function, use InterPro ID||ENSEMBL:ENSP00000431595<br/>INTERPRO:IPR002293<br/>(comma separate multiple values)||Not required||con_accession
+
|8 ||Accesssion Evidence||For Homology, for elegans, use ENSEMBL Gene ID, for non-elegans species use WBGeneID<br/>For Process, Function, use InterPro ID||For elegans: ENSEMBL:ENSG00000103257 (previously used the protein ENSEMBL protein ids) and INTERPRO:IPR002293<br/>For non-elegans species: WBGene00007443 and INTERPRO:IPR002293 <br/>(comma separate multiple values)||Not required||con_accession
 
|-
 
|-
|8 ||Last Updated||Date when the descriptions<br/>were last generated||2014-09-11||Required||con_lastupdate
+
|9 ||Last Updated||Date when the descriptions<br/>were last generated||2014-09-11||Required||con_lastupdate
 
|-
 
|-
|9 || pgid||pgid||1149<br/>(Postgres will generate)||Required||
+
|10 || pgid||pgid||1149<br/>(Postgres will generate)||Required||
 
|-
 
|-
 
|}
 
|}
  
 
==Tab-delimited file for OA insert==
 
==Tab-delimited file for OA insert==
*Order of the data will be: WBGene, Date, Paper, Accession_evidence, Automated_concise_description, Inferred_automatically text
+
*One tab-delimited file per species
 +
*Order of the data will be: WBGene, Date, Paper, Accession_evidence, Automated_concise_description, Species, Inferred_automatically text
 
*Format: tab-delimited file, comma separate the values when multiple values are present
 
*Format: tab-delimited file, comma separate the values when multiple values are present
 
*Date is the last date that the script was run to generate the automated descriptions (eg. 2014-05-28)
 
*Date is the last date that the script was run to generate the automated descriptions (eg. 2014-05-28)
 
*File will be placed on textpresso-dev to be picked up by a cron job by JC
 
*File will be placed on textpresso-dev to be picked up by a cron job by JC
  
==Inserting automated descriptions into postgres==
+
==Directory structure for project==
====Location of files for OA import====
 
 
*http://textpresso-dev.caltech.edu/concise_descriptions/  Top level parent directory for project
 
*http://textpresso-dev.caltech.edu/concise_descriptions/  Top level parent directory for project
 
*http://textpresso-dev.caltech.edu/concise_descriptions/production_release.txt  Indicates what release the file corresponds to  
 
*http://textpresso-dev.caltech.edu/concise_descriptions/production_release.txt  Indicates what release the file corresponds to  
Line 445: Line 735:
 
*http://textpresso-dev.caltech.edu/concise_descriptions/release/WS247/c_elegans/descriptions/OA_concise_descriptions.WS247.txt  WS247 elegans file for import into OA
 
*http://textpresso-dev.caltech.edu/concise_descriptions/release/WS247/c_elegans/descriptions/OA_concise_descriptions.WS247.txt  WS247 elegans file for import into OA
  
 +
==Inserting automated descriptions into postgres==
 
====Populating script====
 
====Populating script====
Run the script to populate from here:
+
'''Run the script to populate from here:'''
 
/home/acedb/ranjana/concise_testing/populate_automated_concise_descriptions.pl
 
/home/acedb/ranjana/concise_testing/populate_automated_concise_descriptions.pl
 +
 +
Use the command 'screen' after ssh-ing into Tazendra, to keep screen alive (takes about 2 hrs)
  
 
Script actually at:
 
Script actually at:
 
/home/postgres/work/pgpopulation/concise_description/20140909_automated_concise/populate_automated_concise_descriptions.pl
 
/home/postgres/work/pgpopulation/concise_description/20140909_automated_concise/populate_automated_concise_descriptions.pl
  
Script gets data from:
+
Script looks at
http://textpresso-dev.caltech.edu/concise_descriptions/semantic_categories/concise_descriptions/OA_concise_descriptions.txt
+
http://textpresso-dev.caltech.edu/concise_descriptions/production_release.txt  for release number
 +
and
 +
 
 +
http://textpresso-dev.caltech.edu/concise_descriptions/species.txt  for the different species
 +
 
 +
Script gets data from the following URL for each of the species:
 +
For elegans: http://textpresso-dev.caltech.edu/concise_descriptions/release/WS247/c_elegans/descriptions/OA_concise_descriptions.WS247.txt
 +
 
 +
For briggsae:
 +
http://textpresso-dev.caltech.edu/concise_descriptions/release/WS247/c_briggsae/descriptions/OA_concise_descriptions.WS247.txt
 +
 
 +
and so on.
  
 
When populating for each OA row that has con_desctype set to 'Automated_concise_description', it will delete from these tables :
 
When populating for each OA row that has con_desctype set to 'Automated_concise_description', it will delete from these tables :
Line 465: Line 769:
 
*con_inferredauto
 
*con_inferredauto
 
*con_lastupdate
 
*con_lastupdate
 +
 +
Leaves
 +
*con_desctype
 +
*pgid
  
 
Meaning that it's keeping the 'Automated_concise_description' value, so that future runs of the script will reuse existing pgids.  When running out of pre-existing pgids, it will create a new one and assign 'Automated_concise_description' to con_desctype
 
Meaning that it's keeping the 'Automated_concise_description' value, so that future runs of the script will reuse existing pgids.  When running out of pre-existing pgids, it will create a new one and assign 'Automated_concise_description' to con_desctype
Line 479: Line 787:
 
*Concise descriptions dumper at /home/postgres/work/citace_upload/concise/dump_concise.pl
 
*Concise descriptions dumper at /home/postgres/work/citace_upload/concise/dump_concise.pl
 
*/home/postgres/public_html/cgi-bin/data/concise_dump_new.ace
 
*/home/postgres/public_html/cgi-bin/data/concise_dump_new.ace
 +
 +
====Change for the WS265 data upload, March 2018====
 +
Will dump both automated and concise descriptions, script changed on Mangolassi, check before making live on Tazendra.
  
 
====Script that finds genes with concise descrips that also have an automated description====
 
====Script that finds genes with concise descrips that also have an automated description====
Line 494: Line 805:
 
*which you can see on the web at:http://tazendra.caltech.edu/~postgres/cgi-bin/data/concise_dump_new.ace
 
*which you can see on the web at:http://tazendra.caltech.edu/~postgres/cgi-bin/data/concise_dump_new.ace
 
*Then on spica, login, go to the Data_for_citace/Data_from_Kimberly directory, remove the existing file, and upload the latest file using: wget http://tazendra.caltech.edu/~postgres/cgi-bin/data/concise_dump_new.ace
 
*Then on spica, login, go to the Data_for_citace/Data_from_Kimberly directory, remove the existing file, and upload the latest file using: wget http://tazendra.caltech.edu/~postgres/cgi-bin/data/concise_dump_new.ace
 +
 +
====Changes to dumping script====
 +
*Jan 21st, 2016: 'Inferred_automatically' tag
 +
**Script was splitting on commas for all evidences: paper, inferredauto, person accession, exprtext, rnai microarray and lastupdate, so the text in the 'Inferred_automatically' tag was being split on the commas in the natural language sentence, 'This description was generated automatically by a Textpresso script based on homology/orthology data, Gene Ontology (GO) annotations and tissue expression data from the WS251 version of WormBase.
  
 
====Text for Automatically_inferred tag====
 
====Text for Automatically_inferred tag====
Line 500: Line 815:
 
Text will be: This description was automatically generated by Textpresso scripts based on homology/orthology data and Gene Ontology (GO) annotations from the <WS246> version of WormBase.
 
Text will be: This description was automatically generated by Textpresso scripts based on homology/orthology data and Gene Ontology (GO) annotations from the <WS246> version of WormBase.
  
====Rules for dumping the different types of descriptions in the OA====
+
====Rules for dumping the different types of descriptions in the OA into .ace====
====.ace format====
+
 
List of tags to be dumped:
+
1. List of tags to be dumped:
 
*Automated_description
 
*Automated_description
 
*Paper_evidence
 
*Paper_evidence
Line 508: Line 823:
 
*Date_last_updated
 
*Date_last_updated
 
*Inferred_automatically
 
*Inferred_automatically
 +
 +
2. For those genes that have both automated and concise descriptions, we dump only the concise descriptions
 +
 +
3. For those genes that have concise descriptions, with the 'NO DUMP' flag, we dump the automated descriptions
 +
 +
4. Dead and merged genes are not dumped (for both concise and automated descriptions)
 +
 +
====.ace format example====
  
 
<pre style="white-space: pre-wrap;  
 
<pre style="white-space: pre-wrap;  
Line 527: Line 850:
 
*Tested concise_dump_new.ace on empty citaceminus mirrror on local machine with the WS246 models file saved, read in fine!
 
*Tested concise_dump_new.ace on empty citaceminus mirrror on local machine with the WS246 models file saved, read in fine!
 
*WS246 numbers: Total genes: 9,529 genes, Genes with automated descriptions: 3,362; 85,039 lines
 
*WS246 numbers: Total genes: 9,529 genes, Genes with automated descriptions: 3,362; 85,039 lines
 +
 +
==Postgres queries==
 +
====For genes that have more than one concise description====
 +
SELECT con_wbgene, COUNT(*) AS count FROM con_wbgene WHERE joinkey IN (SELECT joinkey FROM con_desctype WHERE con_desctype = 'Concise_description') GROUP BY con_wbgene HAVING COUNT(*) > 1;
 +
 +
WBGene00003979 | pes-5, curated as transposon by P. Davis, has 2 concise, 1 automated
 +
WBGene00007839 | C31C9.5, curated as transposon by P. Davis, has 2 concise,1 automated
 +
WBGene00000029 | Older concise has 'NO DUMP' flag, valid one is dumpable, 1 automated
 +
WBGene00002992 | Older concise has 'NO DUMP" flag, valid one is dumpable, 1 automated
 +
WBGene00003071 |
 +
WBGene00004174 |
 +
 +
====Number of genes with automated descriptions====
 +
SELECT COUNT(*) FROM con_wbgene WHERE joinkey IN (SELECT joinkey FROM con_desctype WHERE con_desctype = 'Automated_description');
 +
 +
====Number of genes with both concise and automated descriptions====
 +
SELECT DISTINCT(con_wbgene) FROM con_wbgene WHERE joinkey IN (SELECT joinkey FROM con_desctype WHERE con_desctype = 'Automated_description') AND con_wbgene IN (SELECT DISTINCT(con_wbgene) FROM con_wbgene WHERE joinkey IN (SELECT joinkey FROM con_desctype WHERE con_desctype = 'Concise_description') );
 +
(=5241 for WS252)
 +
 +
====Genes that have a concise description but no automated description====
 +
SELECT DISTINCT(con_wbgene) FROM con_wbgene WHERE joinkey IN (SELECT joinkey FROM con_desctype WHERE con_desctype = 'Concise_description') AND con_wbgene NOT IN (SELECT con_wbgene FROM con_wbgene WHERE joinkey IN (SELECT joinkey FROM con_desctype WHERE con_desctype = 'Automated_description')) ORDER BY con_wbgene;
 +
 +
====Genes that have a concise description, no automated description, excluding NO DUMP==== Genes with dumpable concise, these have no dumpable automated descriptions. <br/>
 +
SELECT DISTINCT(con_wbgene) FROM con_wbgene WHERE joinkey IN (SELECT joinkey FROM con_desctype WHERE con_desctype = 'Concise_description') AND joinkey NOT IN (SELECT joinkey FROM con_nodump WHERE con_nodump = 'NO DUMP') AND con_wbgene NOT IN (SELECT con_wbgene FROM con_wbgene WHERE joinkey IN (SELECT joinkey FROM con_desctype WHERE con_desctype = 'Automated_description') AND joinkey NOT IN (SELECT joinkey FROM con_nodump WHERE con_nodump = 'NO DUMP') ) ORDER BY con_wbgene;
 +
 +
====same as right above, but excluding dead genes====
 +
SELECT DISTINCT(con_wbgene) FROM con_wbgene WHERE joinkey IN (SELECT joinkey FROM con_desctype WHERE con_desctype = 'Concise_description') AND joinkey NOT IN (SELECT joinkey FROM con_nodump WHERE con_nodump = 'NO DUMP') AND con_wbgene NOT IN (SELECT con_wbgene FROM con_wbgene WHERE joinkey IN (SELECT joinkey FROM con_desctype WHERE con_desctype = 'Automated_description') AND joinkey NOT IN (SELECT joinkey FROM con_nodump WHERE con_nodump = 'NO DUMP') ) AND con_wbgene NOT IN (SELECT gin_wbgene FROM gin_wbgene WHERE joinkey IN (SELECT joinkey FROM gin_dead)) ORDER BY con_wbgene;
  
 
==Reporting numbers==
 
==Reporting numbers==
Line 554: Line 904:
 
Information Processing and Management 43 (2007) 1777–1791
 
Information Processing and Management 43 (2007) 1777–1791
 
   
 
   
 +
==Project milestones by release==
 +
====WS246====
 +
*Wrote descriptions for elegans included orthology (based on BlastP), and process, function and cellular component (based on GO).
 +
*Used BlastP for human orthologs and the orthologs file
 +
 +
====WS247====
 +
*Added C. briggsae and P. pacificus
 +
 +
====WS248====
 +
*Added 5 more species: B. malayi, C. brenneri, C. japonica, C. remanei, O.volvulus and P. pacificus
 +
*Orthology:switched from using BlastP to using all orthologs, used only those human orthologs confirmed by more than one method, from K. Howe generated orthology file.
 +
*Used the HGNC symbol as names for the human genes and placed the full name (description) inside parentheses.
 +
 +
====WS249====
 +
*Process: Added C. elegans gene experimental process information to non-elegans species, that have no GO annotations.
 +
*Tissue expression: Began work on tissue expression for C. elegans, wrote the sentences but did not include in the descriptions.
 +
 +
====WS250====
 +
*Gene names: Pulled and mapped to the Gene IDs only from the gene names file, not from the source files of orthology or GO (as previously done)
 +
*Tissue expression: Continued working on tissue expression, did not include in descriptions.
 +
 +
====WS251====
 +
*Orthology: Used C. elegans gene classes to group the C.elegans orthologs of non-elegans species
 +
 +
====WS252====
 +
*Orthology:Used HGNC human gene families to group the human orthologs for C. elegans.
 +
*Since the rule was to use the common GO terms, and the cosmid.genes did not have GO terms (less likely that they will have GO terms compared to named genes), we lost GO processes for the orthologs (these are C. elegans genes that we list as orthologs for non-elegans species.
 +
*Strict separation of sentences by semantic category for all species
 +
*first sentence will have orthology only, using gene classes and publication popularity to reduce the number of orthologs (for non-elegans species); the second sentence will have orthologs listed by common GO term for process.
 +
*all cosmid.gene names ignored for finding the GO terms (since the rule is to find the common GO terms for all the orthologs, including even one cosmid.gene without a GO process term would result in loss of all GO processes for the elegans gene orthologs for the process sentences for non-elegans species).
 +
*Added tissue expression
 +
 +
====WS253====
 +
*Added expression cluster data for C. elegans (~10,000 genes) and very few for Brugia and Pristionchus.
 +
*Expression cluster data is gene regulation ('regulated by gene'), molecule regulation ('regulated by molecule') and anatomy enrichment ('enriched in anatomy_term').
 +
 
==Automated descriptions for ''C. briggsae''==
 
==Automated descriptions for ''C. briggsae''==
 
[[Automated descriptions for C. briggsae]]
 
[[Automated descriptions for C. briggsae]]
 +
 +
==Issues to address by release==
 +
*1. For the non-elegans species, need to use the WBGeneID as Accession_evidence, so can we use our own database, and do: Accession_evidence "WormBase" "WBGene00005678" (not done)
 +
*2. For non-elegans species, need to add in elegans gene experimental data, but with multiple orthologs, which do we pick?  (done, using only the common GO terms for orthologs)
 +
*3. Gene names should be pulled and mapped to the Gene IDs only from the gene names file, not from the source files of orthology or GO (will be addressed for WS250; done)
 +
*4. For WS250, check that the gene names are (somewhat) current; corrected mel-47 to tofu-6 by hand for WS249.
 +
*5. For WS252: Issue: When new orthologs get added and they are cosmid.gene names, since the rule was to use the common GO terms, and the cosmid.genes did not have GO terms (less likely that they will have GO terms compared to named genes), we lost GO processes for the orthologs (these are C. elegans genes that we list as orthologs for non-elegans species.
 +
Solution: Separate by semantic category, first sentence will have orthology only, using gene classes and publication popularity to reduce the number of orthologs; second sentence will have orthologs listed by common GO term for process; ignore all cosmid.gene names for finding the GO terms.
 +
*For WS253 or WS254: Wen needs to separate the different species into different files for expression clusters
 +
 +
<pre style="white-space: pre-wrap;
 +
white-space: -moz-pre-wrap;
 +
white-space: -pre-wrap;
 +
white-space: -o-pre-wrap;
 +
word-wrap: break-word”>
 +
Example 1:WBGene00124170
 +
WS250:
 +
CJA04965 is an ortholog of C. elegans msp-10, msp-36, msp-56 and msp-76, which are involved in lipid storage.
 +
WS251:
 +
CJA04965 is an ortholog of the members of the C. elegans msp gene class including msp-76, msp-56 and msp-36,
 +
and ZK1248.17 and Y50E8A.2.
 +
WS252:
 +
CJA04965 is an ortholog of the members of the C. elegans msp gene class including msp-76, msp-56 and msp-36,
 +
and ZK1248.17 and Y50E8A.2; in C. elegans, msp-10, msp-36, msp-56 and msp-76 are involved in lipid storage.
 +
 +
Example 2: WBGene00127848
 +
WS250:CJA08645 is an ortholog of C. elegans msp-10, msp-36, msp-56 and msp-76, which are involved in lipid storage.
 +
WS251:CJA08645 is an ortholog of the members of the C. elegans msp gene class including msp-76, msp-56 and
 +
msp-36, and ZK1248.17 and Y50E8A.2.
 +
WS252:CJA08645 is an ortholog of the members of the C. elegans msp gene class including msp-76, msp-56 and
 +
msp-36, and ZK1248.17 and Y50E8A.2; in C. elegans, msp-10, msp-36, msp-56 and msp-76, are involved in lipid storage.
 +
</pre>
 +
 +
==Automated descriptions software==
 +
[[Documentation for workflow and scripts]]
 +
 +
==Phasing out the manual annotations==
 +
[[Phasing out the manual annotations]]
  
 
Back To [[Concise Descriptions]]
 
Back To [[Concise Descriptions]]

Latest revision as of 22:39, 5 July 2018

Contents

Location of project-related files on Textpresso

http://textpresso-dev.caltech.edu/concise_descriptions/

Generate top level directory structures for project

  1. Generate directory structure similar to current structure: http://textpresso-dev.caltech.edu/concise_descriptions/
  2. Create all top level directories to file level, files and file names could be different.

Obtaining gene sets

Obtaining a set of genes and gene names

  1. Top level directory is 'gene_lists' under ~/release/<WSXXX>/<species>/ for all nine species
  2. Obtain the complete list of gene names and IDs from postgres for the nine species, only gene Ids (gin_wbgene) and gene names (gin_locus) are used
  3. Write descriptions for only live genes, throw out the dead genes, as we don't want to write descriptions for dead genes.
  4. Write live genes into the 'live genes' file.

Current semantic categories and their order in a gene description

  1. Orthology
  2. Biological Process
  3. Molecular Function
  4. Tissue expression
  5. Sub-cellular localization
  6. drug/chemical/gene regulation from large scale data summaries

Orthology/Homology

Orthology source files (in use from the WS248 March 2015, from WS247 was switched to Ensemble genes instead of proteins):

  • Orthology files are obtained from the EBI FTP site: ftp://ftp.ebi.ac.uk/pub/databases/wormbase/releases/<release>/species
    • C. elegans: ~/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS262.orthologs.txt.gz
    • C. briggsae:~/c_briggsae/PRJNA10731/annotation/c_briggsae.PRJNA10731.WS262.orthologs.txt.gz
    • C. japonica:~/c_briggsae/PRJNA12591/annotation/c_japonica.PRJNA12591.WS262.orthologs.txt.gz
    • C. remanei:~/c_remanei/PRJNA53967/annotation/c_remanei.PRJNA53967.WS262.orthologs.txt.gz
    • C. brenneri:~/c_brenneri/PRJNA20035/annotation/c_brenneri.PRJNA20035.WS262.orthologs.txt.gz
    • Brugia malayi:~/b_malayi/PRJNA10729/annotation/b_malayi.10729.WS262.orthologs.txt.gz
    • Pristionchus pacificus: ~/p_pacificus/PRJNA12644/annotation/p_pacificus.PRJNA12644.WS262.orthologs.txt.gz
    • O. volvulus~/o_volvulus/PRJEB513/annotation/o_volvulus.PRJEB513.WS262.orthologs.txt.gz
    • S. ratti:~/s_ratti/PRJEB125/annotation/s_ratti.PRJEB125.WS262.orthologs.txt.gz

Rules

  1. Species list and orthology rule:
    • Caenorhabditis elegans (WS246) --use orthology to human
    • Caenorhabditis briggsae (WS247) --use orthology to elegans
    • Caenorhabditis japonica (WS248) --use ortholgy to elegans
    • Caenorhabditis remanei (WS248) --use orthology to elegans
    • Caenorhabditis brenneri (WS248) --use orthology to elegans
    • Brugia malayi (WS248) --use orthology to elegans, if not present use orthology to Onchocerca
    • Onchocerca volvulus (WS248) --use orthology to elegans, if not present, then use orthology to Brugia
    • Pristionchus pacificus (WS248) --use orthology to elegans
    • Strongyloides ratti (WS250) --use orthology to elegans, if not present, use orthology to Brugia and Onchocerca
  2. For orthology to human genes, use only those human genes that have been predicted by more than one orthology prediction method.
    • Example 1: aat-1 (from orthology file)
      • Homo sapiens ENSG00000013293 SLC7A14 Panther
      • Homo sapiens ENSG00000165349 SLC7A3 Panther
      • Homo sapiens ENSG00000139514 SLC7A1 Panther
      • Homo sapiens ENSG00000151012 SLC7A11 Panther
      • Homo sapiens ENSG00000103064 SLC7A6 Panther
      • Homo sapiens ENSG00000155465 SLC7A7 Panther
      • Homo sapiens ENSG00000130876 SLC7A10 Panther
      • Homo sapiens ENSG00000103257 SLC7A5 Panther;Inparanoid_8
      • Homo sapiens ENSG00000092068 SLC7A8 Inparanoid_8

In the above list, we would pick only ENSG00000103257, since it was predicted to be a human ortholog of aat-1 by more than one method, both Panther and Inparanoid_8.

  • Example 2: nipi-4 from orthology file
    • Homo sapiens ENSG00000120899 PTK2B Panther
    • Homo sapiens ENSG00000169398 PTK2 Panther

If only one method existed for all human genes, we will have to list them all, so list both PTK2B and PTK2.

Template of an Orthology sentence

  • <Worm Gene> is an ortholog of <human gene>.
  • <Worm Gene> is an ortholog of <human gene1>, <human gene2>, <human gene3> and <human gene4>.
  • We will use the HGNC name outside parentheses, as the gene name, and put the description inside the parentheses.

Example 1

  • mtp-18 is an ortholog of human <ENSG00000242114> and <ENSG00000249590>;
  • mtp-18 is an ortholog of human MTFP1 (mitochondrial fission process 1) and <no HGNC symbol/name> (Uncharacterized protein);

Resolves into

  • mtp-18 is an ortholog of human MTFP1 (protein mitochondrial fission process 1);

Example 2

  • marc-1 is an ortholog of human <ENSG00000183654>, <ENSG00000144583>, <ENSG00000139266>, <ENSG00000145416>, and <ENSG00000278545>;
  • marc-1 is an ortholog of human MARCH11 (membrane-associated ring finger (C3HC4) 11), MARCH4 (membrane-associated ring finger (C3HC4) 4, E3 ubiquitin protein ligase), MARCH9 (membrane-associated ring finger (C3HC4) 9), MARCH1 (membrane-associated ring finger (C3HC4) 1, E3 ubiquitin protein ligase) and MARCH8 (membrane-associated ring finger (C3HC4) 8, E3 ubiquitin protein ligase);

Example 3

  • poml-3 is an ortholog of human <ENSG00000005421>, <ENSG00000105852>, <ENSG00000105854>
  • poml-3 is an ortholog of human PON1 (paraoxonase 1), PON3 (paraoxonase 3), and PON2 (paraoxonase 2);

For WS247 upload

We used BlastP hits file, and if no ortholog were to be found here we used orthologs file.

1. Best BlastP hits file: ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS246/species/c_elegans/PRJNA13758/c_elegans.PRJNA13758.WS246.best_blastp_hits.txt.gz For WS247 we used the following order: 1. Mapping of elegans genes to proteins: ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS246/species/c_elegans/PRJNA13758/c_elegans.PRJNA13758.WS246.xrefs.txt.gz

Rules for orthology sentence construction

  • Rule 1: If the words 'family member' occurs in the description between words before it and after it, then ignore 'family member'
    • Examples:
    • 'aldehyde dehydrogenase 8 family member a1’ becomes 'aldehyde dehydrogenase 8a1'
    • 'aldehyde dehydrogenase 9 family member a1' becomes 'aldehyde dehydrogenase 9a1'

Jul 2, 2014:

  • Rule 2: If the words 'human Uncharacterized protein' occur ignore this homology
    • Examples:
    • ctg-1 encodes an ortholog of human Uncharacterized protein;
    • mtp-18 encodes an ortholog of human Uncharacterized protein;
  • Rule 3 : If 2 or more of these words occur: 'family', 'subfamily', group', 'member' 'polypeptide' or 'class', ignore them and resolve as in examples:
    • olfactory receptor, family 56, subfamily B, member 1 becomes olfactory receptor 56B1
    • potassium intermediate/small conductance calcium-activated channel, subfamily N, member 2 becomes potassium intermediate/small conductance calcium-activated channel N2
    • potassium inwardly-rectifying channel, subfamily J, member 12 becomes human potassium inwardly-rectifying channel J12
    • nuclear receptor subfamily 3, group C, member 2 becomes nuclear receptor 3C2
    • nuclear receptor subfamily 5, group A, member 2 becomes nuclear receptor 5A2
    • nuclear receptor subfamily 1, group H, member 4 becomes nuclear receptor 1H4
    • cytochrome P450, family 3, subfamily A, polypeptide 5 becomes cytochrome P450 3A5
    • cytochrome P450, family 21, subfamily A, polypeptide 2 becomes cytochrome P450 21A2
    • UDP glycosyltransferase 3 family, polypeptide A1 becomes UDP glycosyltransferase 3A1
    • mannosidase, alpha, class 1B, member 1 becomes human mannosidase, alpha, 1B1
    • phosphatidylinositol glycan anchor biosynthesis, class V stays as is, because the word 'class' occurs by itself.
    • scavenger receptor class B, member 2 becomes scavenger receptor B2
  • Rule 4: If the word 'homolog' co-occurs with a species name = 'Drosophila', 'S. cerevisiae', 'yeast', inside brackets , ignore 'homolog' and move the species without the brackets.
    • salvador homolog 1 (Drosophila) becomes Drosophila and human salvador 1
    • SSU72 RNA polymerase II CTD phosphatase homolog (S. cerevisiae) becomes S. cerevisia and human SSU72 RNA polymerase II CTD phosphatase
    • translocase of outer mitochondrial membrane 22 homolog (yeast) becomes yeast and human translocase of outer mitochondrial membrane 22
    • human vacuolar protein sorting 4 homolog B (S. cerevisiae) becomes S. cerevisiae and human vacuolar protein sorting 4
    • unconventional SNARE in the ER 1 homolog (S. cerevisiae) becomes S. cerevisiae and human unconventional SNARE in the ER 1
  • Rule 5: if the description field (meaning human protein name), has '(C.elegans)' in it (these refer to an elegans gene, making it circular), then ignore the description field and use the HGNC symbol instead (accession number lookup to symbol required).
    • Examples:
      • Data: WBGene00017948,mth-1,ENSEMBL:ENSP00000407190,ENSG00000166979,ENST00000435323,eva-1 homolog C (C. elegans) [Source:HGNC Symbol;Acc:13239],EVA1C,EVA1C,EVA1C-005,EVA1C,EVA 1 HOMOLOG C PRECURSOR FAM176C
      • Sentence: mth-1 encodes an ortholog of human EVA1C (HGNC:EVA1C);.
      • Data:WBGene00004895,smu-1,ENSEMBL:ENSP00000380336,ENSG00000122692,ENST00000397149,smu-1 suppressor of mec-8 and unc-52 homolog (C. elegans) [Source:HGNC Symbol;Acc:18247],SMU1,SMU1,SMU1-001,SMU1,WD40 REPEAT CONTAINING SMU1 SMU 1 SUPPRESSOR OF MEC 8 AND UNC 52 HOMOLOG
      • Sentence: smu-1 encodes an ortholog of human SMU1 (HGNC:SMU1);.
      • Data: WBGene00044079,tag-241,ENSEMBL:ENSP00000287482,ENSG00000156876,ENST00000287482,spindle assembly 6 homolog (C. elegans) [Source:HGNC Symbol;Acc:25403],SASS6,SASS6,SASS6-001,SASS6,SPINDLE ASSEMBLY ABNORMAL 6 HOMOLOG
      • Sentence: tag-241 encodes an ortholog of human SASS6 (HGNC:SASS6);.
  • Rule 6: Ignore the text string '<numeral>kDa at the end of terms and the trailing comma, including within parentheses.
    • cleavage and polyadenylation specific factor 4, 30kDa becomes cleavage and polyadenylation specific factor 4
    • cleavage stimulation factor, 3' pre-RNA, subunit 1, 50kDa becomes cleavage stimulation factor, 3' pre-RNA, subunit 1
    • nucleoporin 153kDa becomes nucleoporin
  • Rule 7 : Grab the HGNC symbol and the place it on the end of the sentence, within parentheses as ', (HGNC symbol:<HGNC symbol>)'

How to pick orthologs for the description

We have too many C.elegans genes listed as orthologs for a non-elegans gene; these will be pruned using popularity (via number of publications) and (gene class name):

  • If more then 5 orthologs, in the form of gene names (eg, abu-6, abu-7, abu-8) use the popularity and gene class to prune
  • Genes that don't have any other members of their class get picked and mentioned first, by popularity, if not alphabetical; if tied by popularity, order by alphabetically
  • If there is only one gene class, use popularity to pick the top 3 genes, order by popularity, if tied, use numerical
  • If there is more than one gene class, use popularity to pick the top gene in each class, meaning you would name the leading (in popularity) gene class first; if tied, order alphabetically; in both cases
  • If cosmid names (eg.,C54D10.9), list them all, as gene class cannot be used
  • If both genes and cosmids are present, use the gene class and popularity rules for the genes and leave the cosmids as is
  • If both popularity and gene class cannot be used, leave as such
  • If genes are tied by popularity, then stop at 5 genes.
 
Example 1:
WBGene00023726
CBG00317 is an ortholog of C. elegans fbxc-16, fbxc-15, fbxc-18, sdz-4, fbxc-28, fbxc-19 and fbxc-12.

Resolves to:
CBG00317 is an ortholog of C. elegans sdz-4 and members of the fbxc gene class including fbxc-28, fbxc-15 and fbxc-18.


Example 2:
Cbr-myo-2 is an ortholog of C. elegans myo-1, myo-2, myo-3, nmy-1, nmy-2, unc-54, myo-6, hum-9 and myo-5; based on protein domain information, Cbr-myo-2 is predicted to have ATP binding activity and motor activity and is localized to the myosin complex.

Resolves to:
Cbr-myo-2 is an ortholog of C. elegans unc-54, hum-9 and members of the C. elegans myo and nmy gene classes including myo-3 and nmy-2;

Example 3:
WBGene00052523
CRE27108 is an ortholog of C. elegans E03H4.2, ZK1025.3, C13A2.9, C13A2.10, C33G8.13, F07G11.1 and F07G11.2.

CRE27108 is an ortholog of C. elegans E03H4.2, ZK1025.3, C13A2.9, C13A2.10, C33G8.13, F07G11.1 and F07G11.2.

Cannot be resolved, will have to stay as such!

Example 4:
CBG03582 is an ortholog of C. elegans fbxb-9, fbxb-102, fbxb-117, fbxb-76, fbxb-79, fbxb-85, fbxb-1, fbxb-115, fbxb-2, fbxb-86, fbxb-22, sdz-33, fbxb-24, fbxb-101, fbxb-93, fbxb-94, fbxb-13, fbxb-98, fbxb-38, C33E10.1, fbxb-88, fbxb-95, fbxb-97, fbxb-96, fbxb-78, fbxb-80, fbxb-35, fbxb-36, fbxb-108, fbxb-105, sdz-10, sdz-9, fbxb-111, sdz-11, F18A12.7, fbxb-106, fbxb-107, F36H5.8, F36H5.9, fbxb-12, fbxb-71, fbxb-48, fbxb-28, fbxb-29, fbxb-30, fbxb-103, fbxb-25, fbxb-26, fbxb-17, fbxb-18, fbxb-19, fbxb-47, fbxb-72, fbxb-49, fbxb-50, fbxb-51, K05F6.4, fbxb-44, fbxb-52, fbxb-54, K05F6.8, fbxb-46, fbxb-39, fbxb-41, fbxb-40, fbxb-45, sdz-25, fbxb-31, fbxb-32, fbxb-74, fbxb-75, fbxb-10, fbxb-37, fbxb-81, fbxb-82, fbxb-83, fbxb-84, fbxb-34, fbxb-14, fbxb-99, fbxb-77, fbxb-42, fbxb-43, fbxb-15, fbxb-21, fbxb-20, fbxb-16, fbxb-87, fbxb-92, fbxb-33, fbxb-91, fbxb-90, W08F4.13 and F49B2.7;


Resolves to:
CBG03582 is an ortholog of members of the C. elegans fbxb  and sdz gene classes including fbxb-x and sdz-x, and C33E10.1, F18A12.7, F36H5.8, F36H5.9, K05F6.4, K05F6.8, W08F4.13 and F49B2.7;

Grouping human genes into families for the C.elegans descriptions

  1. Use the HGNC gene family dataset found here http://www.genenames.org/cgi-bin/statistics

File: ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/hgnc_complete_set.txt

  1. Group human genes by their family names, when 3 or more human orthologs are present, mention the first member
  2. If more than one family is present, include the first member from each family
  3. If a gene does not fall into any human gene family, leave as is, and mention this gene first, with the word 'human' before it.
  4. If a gene is the only member of a human gene family, mention the gene with the word 'human' before it, and not the family.
  5. From the file, use columns 'Gene family tag' and 'Gene family description'
  6. Write the sentence using 'Gene family tag' first with the word 'human' before it and then the 'Gene family description' in parentheses
  7. If no 'Gene family tag' exists, use only 'Gene family description', do not put this in parentheses, and add the word 'family' after it; if the 'Gene family description' ends with 'family', do not add the word 'family' after it.



Template 1: If both gene family tag and gene family description are available:
Senence:
<C. elegans gene> is an ortholog of human <gene1 that did not cluster> and members of the <human gene family tag> <(human gene family description)> family.

Template 2: If no gene family tag is available, do not use parentheses for the gene family description:
Sentence:
<C. elegans gene> is an ortholog of human <gene1 that did not cluster> and members of the <human gene family description> family.

Template 3: If no gene family tag is available, and if the gene family description ends in 'family', do not add 'family' after it.
Sentence:
<C. elegans gene> is an ortholog of human <gene1 that did not cluster> and members of the <human gene family description that ends in 'family'>.


1. Example his-6:
his-6 is an ortholog of human CENPA (centromere protein A), HIST1H3H (histone cluster 1, H3h), HIST3H3 (histone cluster 3, H3) and HIST2H3A (histone cluster 2, H3a);  

Resolves to:
his-3 is an ortholog of human CENPA (centromere protein A) and members of the histones family including HIST1H3H (histone cluster 1, H3h);

2. Example hrp-1
hrp-1 is an ortholog of human HNRNPA1L2 (heterogeneous nuclear ribonucleoprotein A1-like 2), HNRNPA1 (heterogeneous nuclear ribonucleoprotein A1), HNRNPA3 (heterogeneous nuclear ribonucleoprotein A3) and HNRNPA2B1 (heterogeneous nuclear ribonucleoprotein A2/B1);

Resolves to:
hrp-1 is an ortholog of members of the human RBM (RNA binding motif containing) family including HNRNPA1L2 (heterogeneous nuclear ribonucleoprotein A1-like 2);

Explanation of Orthology/homology information in WormBase

Orthology, Homology and Paralog data in WormBase

Process

Template for a process sentence

<Gene> is (required, functions, regulates, is involved in, is part of) <process>;

Source file for Process data

  • Only ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS252/ONTOLOGY/gene_association.WS252.wb.c_elegans, needs to be used as a source file, since all of the automated annotations, InterPro2GO (all IEA) and Phenotype2GO (all IEA) are in this file.
  • No longer relevant: Source 1: gene_association file for C.elegans from the WormBase FTP site:
  • No longer relevant: Source 2: (from WS250): phenotype2go file, these automated annotations are now 'IEAs', but will be treated like IMPs, if they have the 'WBPhenotype:XXXXXXX' in column 8 (with)
    • this file is current with the release label for the phenotype2go annotations
    • Source 1 and 2 will have redundant annotations but these will get resolved
    • All rows with column 15 (assigned by) with 'WB' are WormBase annotations, those with 'UniProtKB' or 'InterPro' are from those databases
    • Need data from these rows:
      • where column 9: has value 'P' (Process),
      • column 2 (DB_Object ID): the associated genes, UniProt ID or WBGene ID,
      • column 3 (DB_Object symbol), eg, wht-7, GO terms are from column 5.
      • column 5: GOID, eg, GO:0000346
      • column 6: DB:Reference (Reference), eg.PMID:12062106, GO_REF:0000002
      • column 7: Evidence code, eg, IMP
      • column 8: With, eg. 'WB:WBRNAi00000785|WBPhenotype:0000050'

Rules for process sentence construction

  • Rule 1: Ignore all GO terms with the tag 'is_obsolete: true' in the obo file
  • Rule 2: Ignore all IEA and ISS process terms
  • Rule 3: For identical GO terms with both experimental (EXP, IDA, IPI, IMP, IGI, IEP) and computational evidence codes (ISS, IEA), keep only the non-redundant GO terms from experimental evidence and ignore the computational evidence codes.
  • Rule 4: For all IMP terms, add the words 'based on mutant phenotype' at the end of the sentence.
    • Examples:
    • Data: WBGene00001972,hmg-1.2,cell fate specification[IMP],WB_REF:WBPaper00003453|PMID:10197987,,WB,gonad development[IMP],Wnt signaling pathway[IGI],vulval development[IMP]
    • Sentence: hmg-1.2 is involved in cell fat=e specification, gonad development and vulval development, based on mutant phenotypes.
    • Data: WBGene00002148,gon-14,muscle contraction[IMP],WB_REF:WBPaper00024213|PMID:15133127,,WB,defecation[IGI],gonad morphogenesis[IMP],multicellular organism growth[IGI],regulation of pharyngeal pumping[IMP]
    • Sentence: gon-14 is involved in muscle contraction, gonad morphogenesis and regulation of pharyngeal pumping, based on mutant phenotypes and in defecation and growth.
  • Rule 5: No exclusions as of 07.07.2014, leave in reproduction:
  • Rule 6 If a GO term has the words 'involved in' anywhere, beginning, middle or at the end, use the words 'functions in'
    • Data: WBGene00006495,cpna-1,striated muscle contraction involved in embryonic body morphogenesis[IMP],WB_REF:WBPaper00041875|PMID:23283987
    • Sentence: cpna-1 functions in striated muscle contraction involved in embryonic body morphogenesis;
  • Rule 7: For all other Process terms the sentence will be:
    • <Gene> is involved in <process term>;
    • Examples:
    • Data: WBGene00014123,elpc-3,tRNA wobble uridine modification[IMP],WB_REF:WBPaper00034713|PMID:19593383,,WB,translation[IMP],spermatogenesis[IGI],olfactory learning[IMP],vulval development[IGI],embryonic morphogenesis[IGI],oocyte development[IGI]
    • Sentence: elpc-3 is involved in tRNA wobble uridine modification, translation, spermatogenesis, olfactory learning, vulval development and embryonic morphogenesis.
  • Rule 8: For the term 'molting cycle, collagen and cuticulin-based cuticle' GO term, drop the comma and the words 'collagen and cuticulin-based cuticle'
    • Example:
    • Data: WBGene00016643,vps-45,molting cycle, collagen and cuticulin-based cuticle[IMP],WB_REF:WBPaper00029049|PMID:17235359,,WB,vesicle docking involved in exocytosis[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR001619,vesicle-mediated transport[IEA]
    • Sentence: vps-45 is involved in the molting cycle;
  • Rule 9: Replacement rule:
    • Replace term 'multicellular organism growth' with 'growth'.
    • Replace term 'embryo development ending in birth or egg hatching' with 'embryo development'
    • Replace term 'synaptic transmission, <word>' with '<word> synaptic transmission'.
      • Example Data: WBGene00011488,nra-2, synaptic transmission, cholinergic[IMP],WB_REF:WBPaper00034730|PMID:19609303,,WB,locomotion[IMP],protein processing[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR008710,regulation of signal transduction[IEA],INTERPRO:IPR016574
      • Sentence: nra-2 is involved in cholinergic synaptic transmission and locomotion.

Rule 10: Granularity rule:

  • If a GO term in the list is a parent of (is_a or part_of parent) another GO term in the list, keep the most granular child term and ignore the parent terms.

Addding elegans process information for non-C.elegans nematodes

  • From WS249: Write process sentences based on experimental process info from elegans, only for those non-elegans genes that have no GO

sentences

  • Use only non-IEA, non-ISS process terms from elegans
  • If only one ortholog exists, use all non-IEA, non-ISS, elegans process terms
  • If more than one ortholog exists, use only the common non-IEA, non-ISS, elegans process terms from the orthologs
  • Sentence construction:

Cbr-ajm-1 is an ortholog of C. elegans ajm-1 which is involved in cell-cell junction organization and embryo development ending in birth or egg hatching.

Molecular function/identity

Source file for molecular function data

    • gene_association file for C.elegans from the WormBase FTP site:
    • ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS243/ONTOLOGY/gene_association.WS243.wb.c_elegans
    • All rows with column 15 (assigned by) with 'WB' are WormBase annotations, those with 'UniProtKB' or 'InterPro' are from those databases
    • Need data from these rows:
      • where column 9 has value 'F' (Molecular Function)
      • column 2, associated genes, has UniProt ID or WormBase GeneID, need to translate UniProtID to WormBase ID.
      • column 3: DB_Object symbol, eg, wht-7,
      • column 5: GOID, eg, GO:0000346
      • column 6: DB:Reference (Reference), eg.PMID:12062106, GO_REF:0000002
      • column 7: Evidence code, eg, IMP
      • column 8: 'With (or) From' eg., INTERPRO:IPR002293,
      • column 15: Assigned By, eg., WB (which database created the annotation)

Rules for molecular function sentence construction

  • Rule 1: Ignore all GO terms with the tag 'is_obsolete: true'
  • Rule 2: Exclusion list:
    • Ignore the term 'protein binding'
    • Ignore the term 'binding'
  • Rule 3: For identical GO terms with both experimental (EXP, IDA, IPI, IMP, IGI, IEP) and computational evidence codes (ISS, IEA), keep only the non-redundant GO terms from experimental evidence and ignore the computational evidence codes.
  • Rule 4: Order the experimental GO terms first in the sentence followed by ISS and IEA terms.
  • Rule 5: If the evidence code is 'IEA' and 'activity' term is present, add the words 'predicted to have <activity term>' and add the words ', based on protein domain information' to the sentence:
    • Sentence: <Gene> is predicted to have <activity term>, based on protein domain information.
  • Examples:
    • WBGene00000108,alh-2,oxidoreductase activity[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR015590,WB,oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor[IEA],INTERPRO:IPR016163
    • alh-2 is predicted to have oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor, based on protein domain information.
  • Rule 6: If Evidence code is 'ISS' add the words 'based on sequence information' to the sentence
    • WBGene00001951,hlh-4,sequence-specific DNA binding transcription factor activity[ISS],WB_REF:WBPaper00034761|PMID:19632181,,WB,protein dimerization activity[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR011598,DNA binding[IEA],INTERPRO:IPR015660
    • Sentence: hlh-4 is predicted to have sequence-specific DNA binding transcription factor activity based on sequence information, protein dimerization activity and DNA binding activity, based on protein domain information.
  • Rule 7: If a binding term is present add the word 'activity' to it.
  • Rule 8: If the evidence code is IDA, IMP, IGI, IPI, or IEP add the words 'exhibits' before the GO term and 'based on experimental evidence' after the GO term and/or at the end of the sentence.
    • IDA example:
    • WBGene00001952,hlh-6,RNA polymerase II distal enhancer sequence-specific DNA binding transcription factor activity[IDA],WB_REF:WBPaper00032277|PMID:18927627,,WB,protein dimerization activity[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR011598,DNA binding[IEA],INTERPRO:IPR015660
    • Sentence: hlh-6 exhibits RNA polymerase II distal enhancer sequence-specific DNA binding transcription factor activity and is predicted to have protein dimerization activity and DNA binding activity.
    • IMP example:
    • WBGene00009583,aagr-3,alpha-glucosidase activity[IMP],WB_REF:WBPaper00036069|PMID:20349118,,WB,hydrolase activity, hydrolyzing O-glycosyl compounds[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR000322,catalytic activity[IEA],INTERPRO:IPR011013,carbohydrate binding[IEA]
    • Sentence: aagr-3 exhibits alpha-glucosidase activity and is predicted to have hydrolase activity, hydrolyzing O-glycosyl compounds, catalytic activity, and carbohydrate binding activity, based on protein domain information.
  • Rule 9: For non-IEA terms beginning with 'structural constituent of <word>', use the words 'is a 'structural constituent of <word>' in the sentence.
  • Example:
    • WBGene00010783,mrpl-36,structural constituent of ribosome[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR000473,WB
    • Sentence: mrpl-36 is a structural constituent of ribosome, based on protein domain information.
    • WBGene00010783,mrpl-36,structural constituent of ribosome[ISS]
    • Sentence: mrpl-36 is a structural constituent of ribosome, based on sequence information.
    • WBGene00010783,mrpl-36,structural constituent of ribosome[IMP] or [IDA]
    • Sentence: mrpl-36 is a structural constituent of ribosome.
  • Rule 10 Replacement
    • For GO term: 'oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor' becomes 'oxidoreductase activity, acting on the aldehyde or oxo group of donors' (drop the text string 'NAD or NADP as acceptor')
  • Rule 11: If a GO term has a comma in it, then add a comma after this GO term and the next ‘and’.

Sub-cellular localization

Source file for Sub-cellular localization data

  • GO data
  • Source 1: gene_association file for C. elegans at:http://www.geneontology.org/GO.current.annotations.shtml?all
    • Need data from these rows:
      • where column 9 has value 'C' (Cellular Component)
      • column 2, associated genes, has UniProt ID or WormBase GeneID, need to translate UniProtID to WormBase ID.
      • column 3: DB_Object symbol, eg, wht-7,
      • column 5: GOID, eg, GO:0000346
      • column 7: Evidence code, eg, IDA

Rules for sub-cellular localization sentence construction

  • Rule 1: Ignore GO terms with the tag 'is_obsolete: true' in the obo file
  • Rule 2: Ignore all IEA and ISS GO terms, use only non-IEA, non-ISS GO terms
  • Rule 4: Ignore IBA and IBD GO terms (PAINT annotations)
  • Rule 5: For identical GO terms with both experimental (EXP, IDA, IPI, IMP, IGI, IEP) and computational evidence codes (ISS, IEA), keep only the non-redundant GO terms from experimental evidence and ignore the computational evidence codes.
  • Rule 6: For 'integral component of ....' terms add the words 'is an';
    • Eg. for Rule 2: WBGene00006319,sup-10,integral component of plasma membrane[IDA],WB_REF:WBPaper00006135|PMID:14534247,,WB,striated muscle dense body[IDA]
    • sup-10 is an integral component of plasma membrane and is localized to striated muscle dense body;
  • Examples
    • WBGene00023405,sor-1,nucleoplasm[IDA],WB_REF:WBPaper00027128|PMID:16501168,,WB,nuclear speck[IDA]
    • Sentence: sor-1 is localized to the nucleoplasm and nuclear speck;
    • WBGene00004681,rsd-2,nucleolus[IDA],WB_REF:WBPaper00044261|PMID:18430922,,WB,endoplasmic reticulum[IDA],cytosol[IDA]
    • Sentence: rsd-2 is localized to the nucleolus, endoplasmic reticulum and cytosol;
    • WBGene00021827,dnc-6,dynactin complex[IDA],WB_REF:WBPaper00037699|PMID:20964796,,WB
    • Sentence: dnc-6 is localized to the dynactin complex;
  • Rule 7: For the GO term 'intracellular [IEA]' structure of the sentence will be different, use 'is intracellular'.
    • Eg.1 WBGene00089742
    • PPA00188 is an ortholog of C. elegans T26A5.8; based on protein domain information, PPA00188 is predicted to have sequence-specific DNA binding activity and protein heterodimerization activity and is intracellular.

Order of sentences

  1. Orthology
  2. GO Process
  3. GO Function/identity
  4. Tissue expression
  5. GO Cell Component

Or the order could be (when a gene has no automated description or has only orthology), the order will be:

  1. Orthology
  2. Expression cluster data - gene regulation
  3. Expression cluster data - molecule regulation
  4. Expression cluster data - tissue enrichment (anatomy)

Tissue expression

Source files for Tissue expression data

Previously: Source files for Tissue expression data

  • Source 1: Expression data
  • OA (exprpat), PG table names:
    • for all these tables where exp_endogenous table value 'endogenous', grab the exp_gene, WBGeneXXXXXXXX
    • exp_name, values look like Expr1005.
    • exp_anatomy for anatomy terms, in the form of Wbt:0003679, translate these IDs using the anatomy ontology file
    • anatomy ontology file from ftp site:ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS244/ONTOLOGY/anatomy_ontology.WS244.obo
    • exp_paper for paper
    • exp_qualifier for the qualifiers 'certain’, ‘uncertain’ and ‘partial’.
  • Contact Person: Daniela


Rules for tissue expression sentence construction

Rule 1: Ignore all data with the qualifier 'Uncertain' and 'NOT'; all data that has the qualifiers 'Certain' and 'Partial' as well as data with no qualifiers will be used.

Explanation for Certain, Partial, Uncertain and NOT, as used in curation (Daniela Raciti)


Partially expressed:
Gene A was observed to be expressed in some cells of a group of cells that include Y. Example 1: "Expressed in 4-5 pairs of amphid neurons." You should select amphid neuron in the ’Partially Expressed in’ box.
Example 2: "Expressed in the anterior intestine." Select Intestine in the ’Partially Expressed in’ box.
 
Uncertain or Possibly Expressed:
Gene A was sometimes observed to be expressed in cell Y OR Gene A was observed to be expressed in a cell that could be Y.
Example 1: "Occasional expression of DDL-2 in one adult intestinal cell." You should select intestinal cell in the ’Possibly expressed in’ box.
Example 2: "Expression was observed less frequently in the PVPL/R interneurons." You should select PVPL and PVPR in the ’Possibly expressed in’ box.

The ‘Not expressed in’ field should contain information about where the gene product is Certainly NOT expressed in.

For automated descriptions only data that has the qualifiers 'Certain' and 'Partial' as well as data with no qualifiers will be used.

Rule 2: Pick an anatomy term only once

  • Sentence: <Gene> is expressed in the <anatomy term1, anatomy term2 and anatomy term3>;
  • Examples:
*Data for alh-10: 
*WBGene00000116	alh-10	Expr5583	nervous system,intestine,tail neuron WBPaper00031006,WBPaper00006525	Endogenous
*Sentence: alh-10 is expressed in the nervous system, intestine and tail neuron;

*Data for asp-5:
*WBGene00000218	asp-5	Expr5817	intestine	WBPaper00031006,WBPaper00006525	Endogenous
*WBGene00000218	asp-5	Expr4352	intestine	WBPaper00028802	Endogenous
*Sentence: asp-5 is expressed in the intestine;

*Data for ccr-4:
*WBGene00000376	ccr-4	Expr4479	pharynx	WBPaper00027076	Endogenous
*WBGene00000376	ccr-4	Expr11132	male,hermaphrodite,somatic cell,germ line	*WBPaper00043886	Endogenous
*WBGene00000376	ccr-4	Expr4479	hyp6,hyp5,hyp4,hyp3,hyp2,hyp1,pharyngeal neuron,head muscle	WBPaper00027076	Endogenous
*WBGene00000376	ccr-4	Expr7174	pharynx,hypodermis,seam cell	*WBPaper00031006,WBPaper00006525	Endogenous
*WBGene00000376	ccr-4	Expr4480	pharynx,body wall musculature,head neuron,tail neuron	*WBPaper00027076	Endogenous
*WBGene00000376	ccr-4	Expr4480	hyp6,hyp5,hyp4,hyp3,hyp2,hyp1,pharyngeal neuron,head muscle
*Sentence: ccr-4 is expressed in the pharynx, somatic cell, germ line, hyp6, hyp5, hyp4, hyp3, hyp2, hyp1, pharyngeal neuron, head muscle, hypodermis, seam cell, body wall musculature and in the head and tail neurons.
  • Rule 3: Replacement Rule
  • 1: If the anatomy term 'cell' occurs by itself, then use the words 'widely expressed' instead of 'cell'.
    • sentence: col-178 is expressed in the Cell;
    • Becomes: col-178 is expressed widely.
  • 2: If the anatomy term 'cell' is present with other anatomy terms, use instead the words 'expressed in several tissues including the' <other anatomy terms>
    • Sentence: frm-1 is expressed in the intestine, pharynx, and the Cell;
    • Becomes: frm-1 is expressed in several tissues including the intestine and pharynx.
  • 3: If the anatomy term 'cell' is present with either the term 'hermaphrodite' or the term 'male'
    • Sentence: <Gene> is expressed in several tissues and in the hermaphrodite.
    • Sentence: <Gene> is expressed in several tissues and in the male.
  • 4. If the anatomy terms 'cell, hermaphrodite and male' are present:
    • Sentence: <Gene> is expressed in several tissues and in the hermaphrodite and the male.
  • 5. If the anatomy term 'neuron' occurs as an anatomy term by itself, use the words 'in the nervous system' instead.
    • Sentence: ceh-82 is expressed in the neuron;
    • Becomes: ceh-82 is expressed in the nervous system;
  • 6: If the word 'neuron' occurs as part of an anatomy term, eg, 'head neuron', pluralize it, except for the exceptions:
    • Exceptions:
    • I3 neuron
    • I4 neuron
    • I5 neuron
    • I6 neuron
    • M1 neuron
    • M4 neuron
    • M5 neuron
    • MI neuron
    • Sentence: nhr-194 is expressed in the amphid neuron, ciliated neuron, head neuron, and the sensory neuron;
    • Becomes: nhr-194 is expressed in the amphid neurons, ciliated neurons, head neurons, and the sensory neurons;

Rule 5: Granularity Rule: When a string of cell names are present, check to see if any parent terms are present in any ExprXXX for that gene, keep the highest common parent term available.

Rule 6: Clustering/Grouping: Group terms according to their immediate parent terms, and use the parent term in the sentence. If a term cannot be grouped, leave as such in sentence.

    • Needs an input file of anatomy term annotations and immediate parents
      • Will contain Gene Id, Gene, anatomy terms, immediate 'instance_of' parent terms

Expression cluster data

  • Contact person for data-related questions: Wen Chen
  • This data will be used when there is no automated description or when there is only orthology data
  • Order for concatenation: gene regulation, molecule regulation, enrichment in tissues (anatomy)

Location of files

Anatomy expression cluster data

  1. Include all data for all genes
  2. Data is in 5 columns: Gene ID, Public Name, Relationship, Anatomy name, Experiment Type
  3. Take the WBGeneID (Column 1) and then map this to the most current Brugia malayi gene name from our latest gene names list
  4. Expand the experiment type 'RNA seq' to 'RNA sequencing'
  5. Template: <Experiment Type> indicates that <gene> <relationship> in the <Anatomy name/s>.
Example 1:
Data:WBGene00220481	Bm220	Enriched in	bodywall	proteomic study
Sentence: Proteomic studies indicate that Bm220 is enriched in the body wall.

Example 2:
Data:WBGene00222508	Bm2247	Enriched in	reproductive tract proteomic study
Sentence: Proteomic studies indicate that Bm2247 is enriched in the reproductive tract.

Example 3:
Data:WBGene00222254	Bma-ckk-1	Enriched in	digestive tract	proteomic study
Sentence: Proteomic studies indicate that Bma-ckk-1 is enriched in the digestive tract.

Rules for sentence construction for C. elegans genes

  1. Use this data when no data is present for a gene automated description.
  2. Use this data when the automated description has only orthology data.
  3. Use this data when the automated description has no tissue expression data
  4. When the data is used, use all the rules for tissue expression pattern data
  5. Group all neurons together in the sentence constructed.
  6. When only 'neuron' is present construct the following: <Experiment type> studies indicate that <gene> is enriched in neurons.
  7. When neuron is present with other specific neuron types, group 'neuron' and specific neuron types together so that the following sentence can be constructed: <Experiment type> studies indicate that <gene> is enriched in neurons including <specific neuron names>.
  8. Expand the experiment type 'RNA seq' to 'RNA sequencing'.
  9. Pluralize 'coelomocyte' to 'coelomocytes'.
  10. In general pluralize cell names, add the definite article 'the' where necessary. Exceptions: 'anchor cell'
  11. For the term 'male-specific', modify to 'male-specific tissues' and place last when other anatomy terms are present.
  12. For the term 'hermaphrodite-specific' modify to 'hermaphrodite-specific tissues' and place last when other anatomy terms are present.

Template: <Experiment type> studies indicate that <gene> is enriched in <Anatomy names>.


Example 1:
Data: WBGene00022866	ZK1240.1	Enriched in	neuron	tiling array
Sentence: Tiling array studies indicate that ZK1240.1 is enriched in neurons.

Example 2:
Data: 
WBGene00000048	acr-9	Enriched in	GABAergic neuron,DA neuron,VA neuron,neuron,AVE,coelomocyte,ventral nerve cord	microarray,tiling array
Sentence: 
Microarray and tiling array studies indicate that acr-9 is enriched in neurons including the GABAergic, DA, VA, and AVE neurons, ventral nerve cord and in the coelomocytes.

Example 3:
Data: WBGene00022803	ZK688.9	Enriched in	body wall muscle cell,AFD,AWB,I5 neuron,DA neuron,SAB,retrovesicular ganglion,neuron	tiling array,microarray
Sentence: Tiling array and microarray studies indicate that ZK688.9 is enriched in body wall muscle cells,  and the neurons including the AFD, AWB, I5, DA, SAB and retrovesicular ganglion.

Example 4:
Data: WBGene00022824	ZK813.5	Enriched in	BAG,NSM	tiling array,RNA-seq
Sentence: Tiling array and RNA sequencing studies indicate that ZK813.5 is enriched in the BAG and NSM neurons.

Example 5:
Data: WBGene00000675	col-101	Enriched in	PVD,OLL,cephalic sheath cell,coelomocyte,dopaminergic neuron,ventral nerve cord,hypodermis,germ line	microarray,tiling array,RNA-seq
Sentence: Microarray, tiling array and RNA sequencing studies indicate that col-101 is enriched in PVD, OLL and dopaminergic neurons and the cephalic sheath cells, coelomocytes, ventral nerve cord, hypodermis and the germ line.

Example 6:
Data: WBGene00022867	ZK1240.2	Enriched in	intestine	tiling array
Sentence: Tiling array studies indicate that ZK1240.2 is enriched in the intestine.

Example 7:
Data: WBGene00008175	C48B4.12	Enriched in	male-specific	microarray
Sentence: Microarray studies indicate that C48B4.12 is enriched in male-specific tissues.

Example 8:
Data: WBGene00008447	E01G4.5	Enriched in	pharynx,body wall muscle cell,male-specific,muscle cell,intestine	microarray,tiling array
Sentence: Microarray and tiling array studies indicate that E01G4.5 is enriched in the pharynx, body wall muscle cells, muscle cells, intestine and male-specific tissues.

Example 9:
Data: WBGene00009453	F36A2.3	Enriched in	germline precursor cell,hypodermis,hermaphrodite-specific	tiling array,microarray
Sentence: Tiling array and microarray studies indicate that F36A2.3 is enriched in germline precursor cells, hypdodermis and hermaphrodite-specific tissues.

Rules for sentence construction for Pristionchus pacificus (ppa) genes

File: ftp://caltech.wormbase.org/pub/wormbase/ExprClusterSummary/WS253/ppaECsummary_anatomy.WS252.txt

  • Include all data for all genes
  • Expand the experiment type 'RNA seq' to 'RNA sequencing'

Example 1:
Data:WBGene00102297	Ppa-dnj-22	Enriched in	germ line	microarray
Sentence: Microarray studies indicate that Ppa-dnj-22 is enriched in the germline.

Gene regulation expression cluster data

File: ftp://caltech.wormbase.org/pub/wormbase/ExprClusterSummary/WS253/ceECsummary_geneReg.WS252.txt

  • Include this data only when a gene has no concise description or no automated description
  • Expand the experiment type 'RNA seq' to 'RNA sequencing'
  • Add the word 'expression' after the gene name:*Add the word 'expression' after gene name in sentence: <Experiment Type> indicate that <gene name> expression is regulated by <Regulator Gene Name 1, Regulator Gene Name 2 and Regulator Gene Name 3>.

Example 1:
Data:WBGene00220086	T23D8.12	Regulated by	fbf-1,rnp-8,gld-1,aak-2,tdp-1,isp-1	microarray
Sentence: Microarray studies indicate that T23D8.12 expression is regulated by fbf-1, rnp-8, gld-1, aak-2, tdp-1 and isp-1.

Example 2:
Data: WBGene00220023	K11C4.14	Regulated by	prg-1,dcr-1 RNA-seq,microarray
Sentence: RNA-sequencing and microarray studies indicate that K11C4.14 expression is regulated by prg-1 and dcr-1.

Molecule regulation expression cluster data

File: ftp://caltech.wormbase.org/pub/wormbase/ExprClusterSummary/WS253/ceECsummary_molReg.WS252.txt

  • Include for genes if only no concise and no automated descriptions
  • Follow capitaliztion of molecules as in data source file
  • Expand the experiment type 'RNA seq' to 'RNA sequencing'
  • Add the word 'expression' after gene name in sentence: <Experiment Type> indicate that <gene name> expression is regulated by <Chemical Name 1, Chemical Name 2 and Chemical Name 3>.
  • Pluralize the Chemical Name 'adsorbable organic bromine compound' to 'adsorbable organic bromine compounds'

Example 1:
Data: WBGene00235256	Y54G11A.19	Regulated by	Chlorpyrifos,Diazinon	microarray
Sentence: Microarray studies indicate that Y54G11.A expression is regulated by Chlorpyrifos and Diazinon.

Example 2:
Data: WBGene00012041	T26E3.8	Regulated by	1-methylnicotinamide,methylmercuric chloride,resveratrol,Atrazine,adsorbable organic bromine compound	RNA-seq,microarray
Sentence: RNA-sequencing and microarray studies indicate that T26E3.8 expression is regulated by 1-methylnicotinamide, methylmercuric chloride, resveratrol, Atrazine and adsorbable organic bromine compounds.

Preliminary results

These descriptions are based on Homology predictions and the GO annotations for Process, Component and Function:

 

*alh-2
alh-2 encodes an ortholog of human dehydrogenase aldehyde dehydrogenase family 1 member dehydrogenase; alh-2 is predicted to have oxidoreductase activity and oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor based on protein domain information.

*asp-5
asp-5 encodes an ortholog of human cathepsin d; asp-5 is involved in cell death and locomotion; asp-5 is predicted to have aspartic-type endopeptidase activity based on protein domain information.

*cng-1
cng-1 encodes an ortholog of human cyclic nucleotide gated channel alpha 3; cng-1 is predicted to have ion channel activity based on protein domain information; cng-1 is localized to the neuronal cell body.

Mapping of automated concise description data to OA fields

Mapping of data to data fields in the OA
OA field
number
OA field name Data to be inserted Example of data
to be inserted
Required or Not OA table name
1 WBGene WBGene WBGene00000376 Required con_wbgene
2 Species Species Onchocerca volvulus Required con_species
3 Curator Name of Curator James Done(first then replace with) Ranjana Kishore
(insert for all rows)
Required con_curator
4 Curator History Name of Curator same as pgid
(insert for all rows)
Required con_curhistory
5 Description Type Automated_concise_description
(insert for all rows)
Automated_concise_description Required con_desctype
6 Description Text the automated concise description asp-19 encodes an ortholog... Required con_desctext
7 Reference WBPaper WBPaper00026979 Required con_paper
8 Accesssion Evidence For Homology, for elegans, use ENSEMBL Gene ID, for non-elegans species use WBGeneID
For Process, Function, use InterPro ID
For elegans: ENSEMBL:ENSG00000103257 (previously used the protein ENSEMBL protein ids) and INTERPRO:IPR002293
For non-elegans species: WBGene00007443 and INTERPRO:IPR002293
(comma separate multiple values)
Not required con_accession
9 Last Updated Date when the descriptions
were last generated
2014-09-11 Required con_lastupdate
10 pgid pgid 1149
(Postgres will generate)
Required

Tab-delimited file for OA insert

  • One tab-delimited file per species
  • Order of the data will be: WBGene, Date, Paper, Accession_evidence, Automated_concise_description, Species, Inferred_automatically text
  • Format: tab-delimited file, comma separate the values when multiple values are present
  • Date is the last date that the script was run to generate the automated descriptions (eg. 2014-05-28)
  • File will be placed on textpresso-dev to be picked up by a cron job by JC

Directory structure for project

Inserting automated descriptions into postgres

Populating script

Run the script to populate from here: /home/acedb/ranjana/concise_testing/populate_automated_concise_descriptions.pl

Use the command 'screen' after ssh-ing into Tazendra, to keep screen alive (takes about 2 hrs)

Script actually at: /home/postgres/work/pgpopulation/concise_description/20140909_automated_concise/populate_automated_concise_descriptions.pl

Script looks at http://textpresso-dev.caltech.edu/concise_descriptions/production_release.txt for release number and

http://textpresso-dev.caltech.edu/concise_descriptions/species.txt for the different species

Script gets data from the following URL for each of the species: For elegans: http://textpresso-dev.caltech.edu/concise_descriptions/release/WS247/c_elegans/descriptions/OA_concise_descriptions.WS247.txt

For briggsae: http://textpresso-dev.caltech.edu/concise_descriptions/release/WS247/c_briggsae/descriptions/OA_concise_descriptions.WS247.txt

and so on.

When populating for each OA row that has con_desctype set to 'Automated_concise_description', it will delete from these tables :

  • con_wbgene
  • con_species
  • con_curator
  • con_curhistory
  • con_desctext
  • con_paper
  • con_accession
  • con_inferredauto
  • con_lastupdate

Leaves

  • con_desctype
  • pgid

Meaning that it's keeping the 'Automated_concise_description' value, so that future runs of the script will reuse existing pgids. When running out of pre-existing pgids, it will create a new one and assign 'Automated_concise_description' to con_desctype

For testing on Mangolassi

Both populating and dumping scripts at:/home/acedb/ranjana/concise_testing

For the polulating script always redirect output to a file: populate_automated_concise_descriptions.pg.<date>

Dumping the automated, concise and provisional descriptions

Dumping script

  • Run the dumping script manually, on tazendra: /home/acedb/kimberly/citace_upload/concise/wrapper.pl
  • The file, concise_dump_new.ace, can be scp-ed for testing from:scp acedb@tazendra.caltech.edu:/home/postgres/public_html/cgi-bin/data/concise_dump_new.ace
  • Concise descriptions dumper at /home/postgres/work/citace_upload/concise/dump_concise.pl
  • /home/postgres/public_html/cgi-bin/data/concise_dump_new.ace

Change for the WS265 data upload, March 2018

Will dump both automated and concise descriptions, script changed on Mangolassi, check before making live on Tazendra.

Script that finds genes with concise descrips that also have an automated description

On Tazendra and Mangolassi: /home/acedb/ranjana/concise_testing/find_concise_with_automated.pl

Script that compares citace test file and Postgres for automated descriptions

Location:/home/acedb/ranjana/concise_testing/compare_concise_postgres_vs_acefile.pl

  • Needs a 'citace_genes_with_automated.ace' file which is the .ace export of all genes in citace with the 'Automated_description' tag (I use Query buildr to do this query and then export the 'Names' to a file 'citace_genes_with_automated.ace').
  • Compares the output from citace (from testing concise_dump_new.ace in empty citace database on Maya) of genes with automated tag, to postgres data, also outputs if a gene has a concise description (which would not be dumped so not in .ace file, but in Postgres).
  • For the WS247 upload, 1 gene not accounted for, WBGene00020108, which has been merged into a dead gene?

Discontinued from WS246 upload

Changes to dumping script

  • Jan 21st, 2016: 'Inferred_automatically' tag
    • Script was splitting on commas for all evidences: paper, inferredauto, person accession, exprtext, rnai microarray and lastupdate, so the text in the 'Inferred_automatically' tag was being split on the commas in the natural language sentence, 'This description was generated automatically by a Textpresso script based on homology/orthology data, Gene Ontology (GO) annotations and tissue expression data from the WS251 version of WormBase.

Text for Automatically_inferred tag

Tag looks like: Evidence Automatically_inferred ?Text

Text will be: This description was automatically generated by Textpresso scripts based on homology/orthology data and Gene Ontology (GO) annotations from the <WS246> version of WormBase.

Rules for dumping the different types of descriptions in the OA into .ace

1. List of tags to be dumped:

  • Automated_description
  • Paper_evidence
  • Accession_evidence
  • Date_last_updated
  • Inferred_automatically

2. For those genes that have both automated and concise descriptions, we dump only the concise descriptions

3. For those genes that have concise descriptions, with the 'NO DUMP' flag, we dump the automated descriptions

4. Dead and merged genes are not dumped (for both concise and automated descriptions)

.ace format example

Gene : "WBGene00009585"
Automated_description	"cal-7 encodes an ortholog of human calmodulin-like 4 (HGNC:CALML4); cal-7 is predicted to have calcium ion binding activity, based on protein domain information."	Date_last_updated	"2012-07-24"
Automated_description	"cal-7 encodes an ortholog of human calmodulin-like 4 (HGNC:CALML4); cal-7 is predicted to have calcium ion binding activity, based on protein domain information."	Paper_evidence	"WBPaper000045688"
Automated_description	"cal-7 encodes an ortholog of human calmodulin-like 4 (HGNC:CALML4); cal-7 is predicted to have calcium ion binding activity, based on protein domain information."	Paper_evidence	"WBPaper000045689"
Automated_description	"cal-7 encodes an ortholog of human calmodulin-like 4 (HGNC:CALML4); cal-7 is predicted to have calcium ion binding activity, based on protein domain information."	Acession_evidence "ENSEMBL" "ENSP00000419081"
Automated_description	"cal-7 encodes an ortholog of human calmodulin-like 4 (HGNC:CALML4); cal-7 is predicted to have calcium ion binding activity, based on protein domain information."	Acession_evidence "INTERPRO" "IPR002048"
Automated_description	"cal-7 encodes an ortholog of human calmodulin-like 4 (HGNC:CALML4); cal-7 is predicted to have calcium ion binding activity, based on protein domain information."	Inferred_automatically "This description was generated automatically by Textpresso scripts based on homology/orthology data and Gene Ontology (GO) annotations, from the WS243 version of WormBase."	

Numbers from citace testing

  • concise_dump_new.ace was scped from: /home/postgres/public_html/cgi-bin/data/concise_dump_new.ace and tested
  • Tested concise_dump_new.ace on empty citaceminus mirrror on local machine with the WS246 models file saved, read in fine!
  • WS246 numbers: Total genes: 9,529 genes, Genes with automated descriptions: 3,362; 85,039 lines

Postgres queries

For genes that have more than one concise description

SELECT con_wbgene, COUNT(*) AS count FROM con_wbgene WHERE joinkey IN (SELECT joinkey FROM con_desctype WHERE con_desctype = 'Concise_description') GROUP BY con_wbgene HAVING COUNT(*) > 1;

WBGene00003979 | pes-5, curated as transposon by P. Davis, has 2 concise, 1 automated 
WBGene00007839 | C31C9.5, curated as transposon by P. Davis, has 2 concise,1 automated
WBGene00000029 | Older concise has 'NO DUMP' flag, valid one is dumpable, 1 automated
WBGene00002992 | Older concise has 'NO DUMP" flag, valid one is dumpable, 1 automated
WBGene00003071 | 
WBGene00004174 |

Number of genes with automated descriptions

SELECT COUNT(*) FROM con_wbgene WHERE joinkey IN (SELECT joinkey FROM con_desctype WHERE con_desctype = 'Automated_description');

Number of genes with both concise and automated descriptions

SELECT DISTINCT(con_wbgene) FROM con_wbgene WHERE joinkey IN (SELECT joinkey FROM con_desctype WHERE con_desctype = 'Automated_description') AND con_wbgene IN (SELECT DISTINCT(con_wbgene) FROM con_wbgene WHERE joinkey IN (SELECT joinkey FROM con_desctype WHERE con_desctype = 'Concise_description') ); (=5241 for WS252)

Genes that have a concise description but no automated description

SELECT DISTINCT(con_wbgene) FROM con_wbgene WHERE joinkey IN (SELECT joinkey FROM con_desctype WHERE con_desctype = 'Concise_description') AND con_wbgene NOT IN (SELECT con_wbgene FROM con_wbgene WHERE joinkey IN (SELECT joinkey FROM con_desctype WHERE con_desctype = 'Automated_description')) ORDER BY con_wbgene;

====Genes that have a concise description, no automated description, excluding NO DUMP==== Genes with dumpable concise, these have no dumpable automated descriptions.
SELECT DISTINCT(con_wbgene) FROM con_wbgene WHERE joinkey IN (SELECT joinkey FROM con_desctype WHERE con_desctype = 'Concise_description') AND joinkey NOT IN (SELECT joinkey FROM con_nodump WHERE con_nodump = 'NO DUMP') AND con_wbgene NOT IN (SELECT con_wbgene FROM con_wbgene WHERE joinkey IN (SELECT joinkey FROM con_desctype WHERE con_desctype = 'Automated_description') AND joinkey NOT IN (SELECT joinkey FROM con_nodump WHERE con_nodump = 'NO DUMP') ) ORDER BY con_wbgene;

same as right above, but excluding dead genes

SELECT DISTINCT(con_wbgene) FROM con_wbgene WHERE joinkey IN (SELECT joinkey FROM con_desctype WHERE con_desctype = 'Concise_description') AND joinkey NOT IN (SELECT joinkey FROM con_nodump WHERE con_nodump = 'NO DUMP') AND con_wbgene NOT IN (SELECT con_wbgene FROM con_wbgene WHERE joinkey IN (SELECT joinkey FROM con_desctype WHERE con_desctype = 'Automated_description') AND joinkey NOT IN (SELECT joinkey FROM con_nodump WHERE con_nodump = 'NO DUMP') ) AND con_wbgene NOT IN (SELECT gin_wbgene FROM gin_wbgene WHERE joinkey IN (SELECT joinkey FROM gin_dead)) ORDER BY con_wbgene;

Reporting numbers

--Currently the automated descriptions are generated for genes without a concise description

--Generate a report for numbers and place on Textpresso-dev, http://textpresso-dev.caltech.edu/concise_descriptions/

  • Report for WS246 upload, Sept/Oct, 2014:
  • Total number of automated descriptions = 3,364
  • Number of automated descriptions with homology = 2,353
  • Number of automated descriptions with process information = 1,206
  • Number of automated descriptions with function information = 2,183
  • Number of automated descriptions with component information = 244

Publications related to Text-mining methods

  • Automatically generating gene summaries from biomedical literature.

Ling X, Jiang J, He X, Mei Q, Zhai C, Schatz B.

Pac Symp Biocomput. 2006:40-51.

PMID:17094226

  • Generating gene summaries from biomedical literature: A study of semi-structured summarization

Xu Ling *, Jing Jiang, Xin He, Qiaozhu Mei, Chengxiang Zhai, Bruce Schatz

Information Processing and Management 43 (2007) 1777–1791

Project milestones by release

WS246

  • Wrote descriptions for elegans included orthology (based on BlastP), and process, function and cellular component (based on GO).
  • Used BlastP for human orthologs and the orthologs file

WS247

  • Added C. briggsae and P. pacificus

WS248

  • Added 5 more species: B. malayi, C. brenneri, C. japonica, C. remanei, O.volvulus and P. pacificus
  • Orthology:switched from using BlastP to using all orthologs, used only those human orthologs confirmed by more than one method, from K. Howe generated orthology file.
  • Used the HGNC symbol as names for the human genes and placed the full name (description) inside parentheses.

WS249

  • Process: Added C. elegans gene experimental process information to non-elegans species, that have no GO annotations.
  • Tissue expression: Began work on tissue expression for C. elegans, wrote the sentences but did not include in the descriptions.

WS250

  • Gene names: Pulled and mapped to the Gene IDs only from the gene names file, not from the source files of orthology or GO (as previously done)
  • Tissue expression: Continued working on tissue expression, did not include in descriptions.

WS251

  • Orthology: Used C. elegans gene classes to group the C.elegans orthologs of non-elegans species

WS252

  • Orthology:Used HGNC human gene families to group the human orthologs for C. elegans.
  • Since the rule was to use the common GO terms, and the cosmid.genes did not have GO terms (less likely that they will have GO terms compared to named genes), we lost GO processes for the orthologs (these are C. elegans genes that we list as orthologs for non-elegans species.
  • Strict separation of sentences by semantic category for all species
  • first sentence will have orthology only, using gene classes and publication popularity to reduce the number of orthologs (for non-elegans species); the second sentence will have orthologs listed by common GO term for process.
  • all cosmid.gene names ignored for finding the GO terms (since the rule is to find the common GO terms for all the orthologs, including even one cosmid.gene without a GO process term would result in loss of all GO processes for the elegans gene orthologs for the process sentences for non-elegans species).
  • Added tissue expression

WS253

  • Added expression cluster data for C. elegans (~10,000 genes) and very few for Brugia and Pristionchus.
  • Expression cluster data is gene regulation ('regulated by gene'), molecule regulation ('regulated by molecule') and anatomy enrichment ('enriched in anatomy_term').

Automated descriptions for C. briggsae

Automated descriptions for C. briggsae

Issues to address by release

  • 1. For the non-elegans species, need to use the WBGeneID as Accession_evidence, so can we use our own database, and do: Accession_evidence "WormBase" "WBGene00005678" (not done)
  • 2. For non-elegans species, need to add in elegans gene experimental data, but with multiple orthologs, which do we pick? (done, using only the common GO terms for orthologs)
  • 3. Gene names should be pulled and mapped to the Gene IDs only from the gene names file, not from the source files of orthology or GO (will be addressed for WS250; done)
  • 4. For WS250, check that the gene names are (somewhat) current; corrected mel-47 to tofu-6 by hand for WS249.
  • 5. For WS252: Issue: When new orthologs get added and they are cosmid.gene names, since the rule was to use the common GO terms, and the cosmid.genes did not have GO terms (less likely that they will have GO terms compared to named genes), we lost GO processes for the orthologs (these are C. elegans genes that we list as orthologs for non-elegans species.

Solution: Separate by semantic category, first sentence will have orthology only, using gene classes and publication popularity to reduce the number of orthologs; second sentence will have orthologs listed by common GO term for process; ignore all cosmid.gene names for finding the GO terms.

  • For WS253 or WS254: Wen needs to separate the different species into different files for expression clusters
Example 1:WBGene00124170
WS250:
CJA04965 is an ortholog of C. elegans msp-10, msp-36, msp-56 and msp-76, which are involved in lipid storage.
WS251:
CJA04965 is an ortholog of the members of the C. elegans msp gene class including msp-76, msp-56 and msp-36, 
and ZK1248.17 and Y50E8A.2.
WS252:
CJA04965 is an ortholog of the members of the C. elegans msp gene class including msp-76, msp-56 and msp-36, 
and ZK1248.17 and Y50E8A.2; in C. elegans, msp-10, msp-36, msp-56 and msp-76 are involved in lipid storage.

Example 2: WBGene00127848
WS250:CJA08645 is an ortholog of C. elegans msp-10, msp-36, msp-56 and msp-76, which are involved in lipid storage.
WS251:CJA08645 is an ortholog of the members of the C. elegans msp gene class including msp-76, msp-56 and 
msp-36, and ZK1248.17 and Y50E8A.2.
WS252:CJA08645 is an ortholog of the members of the C. elegans msp gene class including msp-76, msp-56 and 
msp-36, and ZK1248.17 and Y50E8A.2; in C. elegans, msp-10, msp-36, msp-56 and msp-76, are involved in lipid storage.

Automated descriptions software

Documentation for workflow and scripts

Phasing out the manual annotations

Phasing out the manual annotations

Back To Concise Descriptions