Generation of automated descriptions

1 Querying for gene sets
- 1.1 Obtaining a set of genes and gene names for automated descriptions
- 1.2 Obtaining the set of genes with a concise description=
  - 1.2.1 Set of genes with no concise description
  - 1.2.2 Set of genes with no concise description and at least one published paper
2 Location of project-related files on Textpresso
- 2.1 Location of the manual concise description files:
3 Semantic categories in an Automated Description
4 Orthology/Homology
- 4.1 WS248 upload (March 3rd to citace, 2015)
- 4.2 Template of an Orthology sentence
- 4.3 For WS247 upload
- 4.4 Rules for orthology sentence construction
- 4.5 How to pick orthologs for the description
- 4.6 Explanation of Orthology/homology information in WormBase
5 Process
- 5.1 Template for a process sentence
- 5.2 Source file for Process data
- 5.3 Rules for process sentence construction
- 5.4 Addding elegans process information for non-C.elegans nematodes
6 Molecular function/identity
- 6.1 Source file for molecular function data
- 6.2 Rules for molecular function sentence construction
7 Sub-cellular localization
- 7.1 Source file for Sub-cellular localization data
- 7.2 Rules for sub-cellular localization sentence construction
8 Order of sentences
9 Tissue expression
- 9.1 Source files for Tissue expression data
- 9.2 Rules for tissue expression sentence construction
10 Preliminary results
11 Mapping of automated concise description data to OA fields
12 Tab-delimited file for OA insert
13 Directory structure for project
14 Inserting automated descriptions into postgres
- 14.1 Populating script
- 14.2 For testing on Mangolassi
15 Dumping the automated, concise and provisional descriptions
- 15.1 Dumping script
- 15.2 Script that finds genes with concise descrips that also have an automated description
- 15.3 Script that compares citace test file and Postgres for automated descriptions
- 15.4 Discontinued from WS246 upload
- 15.5 Text for Automatically_inferred tag
- 15.6 Rules for dumping the different types of descriptions in the OA
- 15.7 .ace format
- 15.8 Numbers from citace testing
16 Reporting numbers
17 Publications related to Text-mining methods
18 Project milestones by release
19 Automated descriptions for C. briggsae
20 Issues to address
21 Automated descriptions software

Querying for gene sets

Obtaining a set of genes and gene names for automated descriptions

For any given species:

1. Generate a combined set of unique WBGenes from orthology files and gene association files
2. Subtract the set of WBGenes that have a concise description from the above set, i.e from 1.
3. Query this set of WBGene identifiers against the gin_wbgene to get the gin_locus which is the gene name (do not use gene names from the input files (GO and orthology), since these could have changed).
4. Query the gin_dead postgres table to get dead genes and subtract this from 3. to get the set of live genes for which automated descriptions need to be generated

Obtaining the set of genes with a concise description=

Query for all genes with a concise description from Postgres: Relevant postgres table names:

con_wbgene: Stores the WBGene ID and gene names
con_desctype: Type of description (relevant for us: Concise_description)
con_desctext: Text of the concise description

Query for all WBGenes that have a concise description (in con_desctext AND con_desctype):

SELECT DISTINCT(con_wbgene) FROM con_wbgene WHERE joinkey IN (SELECT joinkey FROM con_desctext WHERE con_desctext IS NOT NULL) AND joinkey IN (SELECT joinkey FROM con_desctype WHERE con_desctype IS NOT NULL) ORDER BY con_wbgene;

Number of genes with a concise description (as of 05.07.2014)=6,624

Set of genes with no concise description

Set of genes with no concise description and at least one published paper

Location of project-related files on Textpresso

http://textpresso-dev.caltech.edu/concise_descriptions/

Location of the manual concise description files:

For viewing the latest dump:

http://tazendra.caltech.edu/~postgres/cgi-bin/data/concise_dump_new.ace

Script: /home/postgres/work/citace_upload/concise/dump_concise.pl
File location: /home/postgres/public_html/cgi-bin/data/concise_dump_new.ace

Semantic categories in an Automated Description

1. Orthology/Similarity/Molecular identity
2. Processes
3. Molecular Function
4. Sub-cellular localiztion
5. Tissue expression (not done yet)

Orthology/Homology

WS248 upload (March 3rd to citace, 2015)

Orthology source files:

Kevin will release new WS247 orthology files with Ensemble genes instead of proteins
ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS246/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS246.orthologs.txt.gz

Species list:

Caenorhabditis elegans (WS246) --use orthology to human
Caenorhabditis briggsae (WS247) --use orthology to elegans
Caenorhabditis japonica (WS248) --use ortholgy to elegans
Caenorhabditis remanei (WS248) --use orthology to elegans
Caenorhabditis brenneri (WS248) --use orthology to elegans
Brugia malayi (WS248) --use orthology to elegans, if not present use orthology to Onchocerca
Onchocerca volvulus (WS248) --use orthology to elegans, if not present, then use orthology to Brugia
Pristionchus pacificus (WS248) --use orthology to elegans
Strongyloides ratti (WS250) --use orthology to elegans, if not present, use orthology to Brugia and Onchocerca

Rule1 We will use only those human genes that have been predicted by more than one orthology prediction method (from Kevin’s new WS247 orthology file).

Example 1: aat-1 (from orthology file)
- Homo sapiens ENSG00000013293 SLC7A14 Panther
- Homo sapiens ENSG00000165349 SLC7A3 Panther
- Homo sapiens ENSG00000139514 SLC7A1 Panther
- Homo sapiens ENSG00000151012 SLC7A11 Panther
- Homo sapiens ENSG00000103064 SLC7A6 Panther
- Homo sapiens ENSG00000155465 SLC7A7 Panther
- Homo sapiens ENSG00000130876 SLC7A10 Panther
- Homo sapiens ENSG00000103257 SLC7A5 Panther;Inparanoid_8
- Homo sapiens ENSG00000092068 SLC7A8 Inparanoid_8

In the above list, we would pick only ENSG00000103257, since it was predicted to be a human ortholog of aat-1 by more than one method, both Panther and Inparanoid_8.

Example 2: nipi-4 from orthology file
- Homo sapiens ENSG00000120899 PTK2B Panther
- Homo sapiens ENSG00000169398 PTK2 Panther

If only one method existed for all human genes, we will have to list them all, so list both PTK2B and PTK2.

Template of an Orthology sentence

<Worm Gene> is an ortholog of <human gene>.

<Worm Gene> is an ortholog of <human gene1>, <human gene2>, <human gene3> and <human gene4>.

We will use the HGNC name outside parentheses, as the gene name, and put the description inside the parentheses.

Example 1

mtp-18 is an ortholog of human <ENSG00000242114> and <ENSG00000249590>;
mtp-18 is an ortholog of human MTFP1 (mitochondrial fission process 1) and <no HGNC symbol/name> (Uncharacterized protein);

Resolves into

mtp-18 is an ortholog of human MTFP1 (protein mitochondrial fission process 1);

Example 2

marc-1 is an ortholog of human <ENSG00000183654>, <ENSG00000144583>, <ENSG00000139266>, <ENSG00000145416>, and <ENSG00000278545>;
marc-1 is an ortholog of human MARCH11 (membrane-associated ring finger (C3HC4) 11), MARCH4 (membrane-associated ring finger (C3HC4) 4, E3 ubiquitin protein ligase), MARCH9 (membrane-associated ring finger (C3HC4) 9), MARCH1 (membrane-associated ring finger (C3HC4) 1, E3 ubiquitin protein ligase) and MARCH8 (membrane-associated ring finger (C3HC4) 8, E3 ubiquitin protein ligase);

Example 3

poml-3 is an ortholog of human <ENSG00000005421>, <ENSG00000105852>, <ENSG00000105854>
poml-3 is an ortholog of human PON1 (paraoxonase 1), PON3 (paraoxonase 3), and PON2 (paraoxonase 2);

For WS247 upload

We used BlastP hits file, and if no ortholog were to be found here we used orthologs file.

1. Best BlastP hits file: ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS246/species/c_elegans/PRJNA13758/c_elegans.PRJNA13758.WS246.best_blastp_hits.txt.gz For WS247 we used the following order: 1. Mapping of elegans genes to proteins: ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS246/species/c_elegans/PRJNA13758/c_elegans.PRJNA13758.WS246.xrefs.txt.gz

Rules for orthology sentence construction

Rule 1: If the words 'family member' occurs in the description between words before it and after it, then ignore 'family member'
- Examples:
- 'aldehyde dehydrogenase 8 family member a1’ becomes 'aldehyde dehydrogenase 8a1'
- 'aldehyde dehydrogenase 9 family member a1' becomes 'aldehyde dehydrogenase 9a1'

Jul 2, 2014:

Rule 2: If the words 'human Uncharacterized protein' occur ignore this homology
- Examples:
- ctg-1 encodes an ortholog of human Uncharacterized protein;
- mtp-18 encodes an ortholog of human Uncharacterized protein;
Rule 3 : If 2 or more of these words occur: 'family', 'subfamily', group', 'member' 'polypeptide' or 'class', ignore them and resolve as in examples:
- olfactory receptor, family 56, subfamily B, member 1 becomes olfactory receptor 56B1
- potassium intermediate/small conductance calcium-activated channel, subfamily N, member 2 becomes potassium intermediate/small conductance calcium-activated channel N2
- potassium inwardly-rectifying channel, subfamily J, member 12 becomes human potassium inwardly-rectifying channel J12
- nuclear receptor subfamily 3, group C, member 2 becomes nuclear receptor 3C2
- nuclear receptor subfamily 5, group A, member 2 becomes nuclear receptor 5A2
- nuclear receptor subfamily 1, group H, member 4 becomes nuclear receptor 1H4
- cytochrome P450, family 3, subfamily A, polypeptide 5 becomes cytochrome P450 3A5
- cytochrome P450, family 21, subfamily A, polypeptide 2 becomes cytochrome P450 21A2
- UDP glycosyltransferase 3 family, polypeptide A1 becomes UDP glycosyltransferase 3A1
- mannosidase, alpha, class 1B, member 1 becomes human mannosidase, alpha, 1B1
- phosphatidylinositol glycan anchor biosynthesis, class V stays as is, because the word 'class' occurs by itself.
- scavenger receptor class B, member 2 becomes scavenger receptor B2
Rule 4: If the word 'homolog' co-occurs with a species name = 'Drosophila', 'S. cerevisiae', 'yeast', inside brackets , ignore 'homolog' and move the species without the brackets.
- salvador homolog 1 (Drosophila) becomes Drosophila and human salvador 1
- SSU72 RNA polymerase II CTD phosphatase homolog (S. cerevisiae) becomes S. cerevisia and human SSU72 RNA polymerase II CTD phosphatase
- translocase of outer mitochondrial membrane 22 homolog (yeast) becomes yeast and human translocase of outer mitochondrial membrane 22
- human vacuolar protein sorting 4 homolog B (S. cerevisiae) becomes S. cerevisiae and human vacuolar protein sorting 4
- unconventional SNARE in the ER 1 homolog (S. cerevisiae) becomes S. cerevisiae and human unconventional SNARE in the ER 1
Rule 5: if the description field (meaning human protein name), has '(C.elegans)' in it (these refer to an elegans gene, making it circular), then ignore the description field and use the HGNC symbol instead (accession number lookup to symbol required).
- Examples:
  - Data: WBGene00017948,mth-1,ENSEMBL:ENSP00000407190,ENSG00000166979,ENST00000435323,eva-1 homolog C (C. elegans) [Source:HGNC Symbol;Acc:13239],EVA1C,EVA1C,EVA1C-005,EVA1C,EVA 1 HOMOLOG C PRECURSOR FAM176C
  - Sentence: mth-1 encodes an ortholog of human EVA1C (HGNC:EVA1C);.
  - Data:WBGene00004895,smu-1,ENSEMBL:ENSP00000380336,ENSG00000122692,ENST00000397149,smu-1 suppressor of mec-8 and unc-52 homolog (C. elegans) [Source:HGNC Symbol;Acc:18247],SMU1,SMU1,SMU1-001,SMU1,WD40 REPEAT CONTAINING SMU1 SMU 1 SUPPRESSOR OF MEC 8 AND UNC 52 HOMOLOG
  - Sentence: smu-1 encodes an ortholog of human SMU1 (HGNC:SMU1);.
  - Data: WBGene00044079,tag-241,ENSEMBL:ENSP00000287482,ENSG00000156876,ENST00000287482,spindle assembly 6 homolog (C. elegans) [Source:HGNC Symbol;Acc:25403],SASS6,SASS6,SASS6-001,SASS6,SPINDLE ASSEMBLY ABNORMAL 6 HOMOLOG
  - Sentence: tag-241 encodes an ortholog of human SASS6 (HGNC:SASS6);.
Rule 6: Ignore the text string '<numeral>kDa at the end of terms and the trailing comma, including within parentheses.
- cleavage and polyadenylation specific factor 4, 30kDa becomes cleavage and polyadenylation specific factor 4
- cleavage stimulation factor, 3' pre-RNA, subunit 1, 50kDa becomes cleavage stimulation factor, 3' pre-RNA, subunit 1
- nucleoporin 153kDa becomes nucleoporin
Rule 7 : Grab the HGNC symbol and the place it on the end of the sentence, within parentheses as ', (HGNC symbol:<HGNC symbol>)'

How to pick orthologs for the description

We have too many C.elegans genes listed as orthologs for a non-elegans gene; these will be pruned using popularity (via number of publications) and (gene class name):

If more then 5 orthologs, in the form of gene names (eg, abu-6, abu-7, abu-8) use the popularity and gene class to prune
Genes that don't have any other members of their class get picked and mentioned first, by popularity, if not alphabetical; if tied by popularity, order by alphabetically
If there is only one gene class, use popularity to pick the top 3 genes, order by popularity, if tied, use numerical
If there is more than one gene class, use popularity to pick the top gene in each class, meaning you would name the leading (in popularity) gene class first; if tied, order alphabetically; in both cases
If cosmid names (eg.,C54D10.9), list them all, as gene class cannot be used
If both genes and cosmids are present, use the gene class and popularity rules for the genes and leave the cosmids as is
If both popularity and gene class cannot be used, leave as such
If genes are tied by popularity, then stop at 5 genes.

 
Example 1:
WBGene00023726
CBG00317 is an ortholog of C. elegans fbxc-16, fbxc-15, fbxc-18, sdz-4, fbxc-28, fbxc-19 and fbxc-12.

Resolves to:
CBG00317 is an ortholog of C. elegans sdz-4 and members of the fbxc gene class including fbxc-28, fbxc-15 and fbxc-18.


Example 2:
Cbr-myo-2 is an ortholog of C. elegans myo-1, myo-2, myo-3, nmy-1, nmy-2, unc-54, myo-6, hum-9 and myo-5; based on protein domain information, Cbr-myo-2 is predicted to have ATP binding activity and motor activity and is localized to the myosin complex.

Resolves to:
Cbr-myo-2 is an ortholog of C. elegans unc-54, hum-9 and members of the C. elegans myo and nmy gene classes including myo-3 and nmy-2;

Example 3:
WBGene00052523
CRE27108 is an ortholog of C. elegans E03H4.2, ZK1025.3, C13A2.9, C13A2.10, C33G8.13, F07G11.1 and F07G11.2.

CRE27108 is an ortholog of C. elegans E03H4.2, ZK1025.3, C13A2.9, C13A2.10, C33G8.13, F07G11.1 and F07G11.2.

Cannot be resolved, will have to stay as such!

Example 4:
CBG03582 is an ortholog of C. elegans fbxb-9, fbxb-102, fbxb-117, fbxb-76, fbxb-79, fbxb-85, fbxb-1, fbxb-115, fbxb-2, fbxb-86, fbxb-22, sdz-33, fbxb-24, fbxb-101, fbxb-93, fbxb-94, fbxb-13, fbxb-98, fbxb-38, C33E10.1, fbxb-88, fbxb-95, fbxb-97, fbxb-96, fbxb-78, fbxb-80, fbxb-35, fbxb-36, fbxb-108, fbxb-105, sdz-10, sdz-9, fbxb-111, sdz-11, F18A12.7, fbxb-106, fbxb-107, F36H5.8, F36H5.9, fbxb-12, fbxb-71, fbxb-48, fbxb-28, fbxb-29, fbxb-30, fbxb-103, fbxb-25, fbxb-26, fbxb-17, fbxb-18, fbxb-19, fbxb-47, fbxb-72, fbxb-49, fbxb-50, fbxb-51, K05F6.4, fbxb-44, fbxb-52, fbxb-54, K05F6.8, fbxb-46, fbxb-39, fbxb-41, fbxb-40, fbxb-45, sdz-25, fbxb-31, fbxb-32, fbxb-74, fbxb-75, fbxb-10, fbxb-37, fbxb-81, fbxb-82, fbxb-83, fbxb-84, fbxb-34, fbxb-14, fbxb-99, fbxb-77, fbxb-42, fbxb-43, fbxb-15, fbxb-21, fbxb-20, fbxb-16, fbxb-87, fbxb-92, fbxb-33, fbxb-91, fbxb-90, W08F4.13 and F49B2.7;


Resolves to:
CBG03582 is an ortholog of members of the C. elegans fbxb  and sdz gene classes including fbxb-x and sdz-x, and C33E10.1, F18A12.7, F36H5.8, F36H5.9, K05F6.4, K05F6.8, W08F4.13 and F49B2.7;

Explanation of Orthology/homology information in WormBase

Orthology, Homology and Paralog data in WormBase

Process

Template for a process sentence

<Gene> is (required, functions, regulates, is involved in, is part of) <process>;

Source file for Process data

Source 1: gene_association file for C.elegans from the WormBase FTP site:
- ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS243/ONTOLOGY/gene_association.WS243.wb.c_elegans
- this file is one build behind, compared to release label for the phenotype2go annotations
Source 2: (from WS250): phenotype2go file, these automated annotations are now 'IEAs', but will be treated like IMPs, if they have the 'WBPhenotype:XXXXXXX' in column 8 (with)
- this file is current with the release label for the phenotype2go annotations
- Source 1 and 2 will have redundant annotations but these will get resolved
- All rows with column 15 (assigned by) with 'WB' are WormBase annotations, those with 'UniProtKB' or 'InterPro' are from those databases
- Need data from these rows:
  - where column 9: has value 'P' (Process),
  - column 2 (DB_Object ID): the associated genes, UniProt ID or WBGene ID,
  - column 3 (DB_Object symbol), eg, wht-7, GO terms are from column 5.
  - column 5: GOID, eg, GO:0000346
  - column 6: DB:Reference (Reference), eg.PMID:12062106, GO_REF:0000002
  - column 7: Evidence code, eg, IMP
  - column 8: With, eg. 'WB:WBRNAi00000785|WBPhenotype:0000050'

Rules for process sentence construction

Rule 1: Ignore all GO terms with the tag 'is_obsolete: true' in the obo file
Rule 2: Ignore all IEA and ISS process terms
Rule 3: For identical GO terms with both experimental (EXP, IDA, IPI, IMP, IGI, IEP) and computational evidence codes (ISS, IEA), keep only the non-redundant GO terms from experimental evidence and ignore the computational evidence codes.
Rule 4: For all IMP terms, add the words 'based on mutant phenotype' at the end of the sentence.
- Examples:
- Data: WBGene00001972,hmg-1.2,cell fate specification[IMP],WB_REF:WBPaper00003453|PMID:10197987,,WB,gonad development[IMP],Wnt signaling pathway[IGI],vulval development[IMP]
- Sentence: hmg-1.2 is involved in cell fat=e specification, gonad development and vulval development, based on mutant phenotypes.
- Data: WBGene00002148,gon-14,muscle contraction[IMP],WB_REF:WBPaper00024213|PMID:15133127,,WB,defecation[IGI],gonad morphogenesis[IMP],multicellular organism growth[IGI],regulation of pharyngeal pumping[IMP]
- Sentence: gon-14 is involved in muscle contraction, gonad morphogenesis and regulation of pharyngeal pumping, based on mutant phenotypes and in defecation and growth.
Rule 5: No exclusions as of 07.07.2014, leave in reproduction:
Rule 6 If a GO term has the words 'involved in' anywhere, beginning, middle or at the end, use the words 'functions in'
- Data: WBGene00006495,cpna-1,striated muscle contraction involved in embryonic body morphogenesis[IMP],WB_REF:WBPaper00041875|PMID:23283987
- Sentence: cpna-1 functions in striated muscle contraction involved in embryonic body morphogenesis;
Rule 7: For all other Process terms the sentence will be:
- <Gene> is involved in <process term>;
- Examples:
- Data: WBGene00014123,elpc-3,tRNA wobble uridine modification[IMP],WB_REF:WBPaper00034713|PMID:19593383,,WB,translation[IMP],spermatogenesis[IGI],olfactory learning[IMP],vulval development[IGI],embryonic morphogenesis[IGI],oocyte development[IGI]
- Sentence: elpc-3 is involved in tRNA wobble uridine modification, translation, spermatogenesis, olfactory learning, vulval development and embryonic morphogenesis.
Rule 8: For the term 'molting cycle, collagen and cuticulin-based cuticle' GO term, drop the comma and the words 'collagen and cuticulin-based cuticle'
- Example:
- Data: WBGene00016643,vps-45,molting cycle, collagen and cuticulin-based cuticle[IMP],WB_REF:WBPaper00029049|PMID:17235359,,WB,vesicle docking involved in exocytosis[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR001619,vesicle-mediated transport[IEA]
- Sentence: vps-45 is involved in the molting cycle;
Rule 9: Replacement rule:
- Replace term 'multicellular organism growth' with 'growth'.
- Replace term 'embryo development ending in birth or egg hatching' with 'embryo development'
- Replace term 'synaptic transmission, <word>' with '<word> synaptic transmission'.
  - Example Data: WBGene00011488,nra-2, synaptic transmission, cholinergic[IMP],WB_REF:WBPaper00034730|PMID:19609303,,WB,locomotion[IMP],protein processing[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR008710,regulation of signal transduction[IEA],INTERPRO:IPR016574
  - Sentence: nra-2 is involved in cholinergic synaptic transmission and locomotion.

Rule 10: Granularity rule:

If a GO term in the list is a parent of (is_a or part_of parent) another GO term in the list, keep the most granular child term and ignore the parent terms.

Addding elegans process information for non-C.elegans nematodes

From WS249: Write process sentences based on experimental process info from elegans, only for those non-elegans genes that have no GO

sentences

Use only non-IEA, non-ISS process terms from elegans
If only one ortholog exists, use all non-IEA, non-ISS, elegans process terms
If more than one ortholog exists, use only the common non-IEA, non-ISS, elegans process terms from the orthologs
Sentence construction:

Cbr-ajm-1 is an ortholog of C. elegans ajm-1 which is involved in cell-cell junction organization and embryo development ending in birth or egg hatching.

Molecular function/identity

Source file for molecular function data

- gene_association file for C.elegans from the WormBase FTP site:
- ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS243/ONTOLOGY/gene_association.WS243.wb.c_elegans
- All rows with column 15 (assigned by) with 'WB' are WormBase annotations, those with 'UniProtKB' or 'InterPro' are from those databases
- Need data from these rows:
  - where column 9 has value 'F' (Molecular Function)
  - column 2, associated genes, has UniProt ID or WormBase GeneID, need to translate UniProtID to WormBase ID.
  - column 3: DB_Object symbol, eg, wht-7,
  - column 5: GOID, eg, GO:0000346
  - column 6: DB:Reference (Reference), eg.PMID:12062106, GO_REF:0000002
  - column 7: Evidence code, eg, IMP
  - column 8: 'With (or) From' eg., INTERPRO:IPR002293,
  - column 15: Assigned By, eg., WB (which database created the annotation)

Rules for molecular function sentence construction

Rule 1: Ignore all GO terms with the tag 'is_obsolete: true'
Rule 2: Exclusion list:
- Ignore the term 'protein binding'
- Ignore the term 'binding'
Rule 3: For identical GO terms with both experimental (EXP, IDA, IPI, IMP, IGI, IEP) and computational evidence codes (ISS, IEA), keep only the non-redundant GO terms from experimental evidence and ignore the computational evidence codes.
Rule 4: Order the experimental GO terms first in the sentence followed by ISS and IEA terms.
Rule 5: If the evidence code is 'IEA' and 'activity' term is present, add the words 'predicted to have <activity term>' and add the words ', based on protein domain information' to the sentence:
- Sentence: <Gene> is predicted to have <activity term>, based on protein domain information.
Examples:
- WBGene00000108,alh-2,oxidoreductase activity[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR015590,WB,oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor[IEA],INTERPRO:IPR016163
- alh-2 is predicted to have oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor, based on protein domain information.
Rule 6: If Evidence code is 'ISS' add the words 'based on sequence information' to the sentence
- WBGene00001951,hlh-4,sequence-specific DNA binding transcription factor activity[ISS],WB_REF:WBPaper00034761|PMID:19632181,,WB,protein dimerization activity[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR011598,DNA binding[IEA],INTERPRO:IPR015660
- Sentence: hlh-4 is predicted to have sequence-specific DNA binding transcription factor activity based on sequence information, protein dimerization activity and DNA binding activity, based on protein domain information.

Rule 7: If a binding term is present add the word 'activity' to it.

Rule 8: If the evidence code is IDA, IMP, IGI, IPI, or IEP add the words 'exhibits' before the GO term and 'based on experimental evidence' after the GO term and/or at the end of the sentence.
- IDA example:
- WBGene00001952,hlh-6,RNA polymerase II distal enhancer sequence-specific DNA binding transcription factor activity[IDA],WB_REF:WBPaper00032277|PMID:18927627,,WB,protein dimerization activity[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR011598,DNA binding[IEA],INTERPRO:IPR015660
- Sentence: hlh-6 exhibits RNA polymerase II distal enhancer sequence-specific DNA binding transcription factor activity and is predicted to have protein dimerization activity and DNA binding activity.
- IMP example:
- WBGene00009583,aagr-3,alpha-glucosidase activity[IMP],WB_REF:WBPaper00036069|PMID:20349118,,WB,hydrolase activity, hydrolyzing O-glycosyl compounds[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR000322,catalytic activity[IEA],INTERPRO:IPR011013,carbohydrate binding[IEA]
- Sentence: aagr-3 exhibits alpha-glucosidase activity and is predicted to have hydrolase activity, hydrolyzing O-glycosyl compounds, catalytic activity, and carbohydrate binding activity, based on protein domain information.

Rule 9: For non-IEA terms beginning with 'structural constituent of <word>', use the words 'is a 'structural constituent of <word>' in the sentence.
Example:
- WBGene00010783,mrpl-36,structural constituent of ribosome[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR000473,WB
- Sentence: mrpl-36 is a structural constituent of ribosome, based on protein domain information.
- WBGene00010783,mrpl-36,structural constituent of ribosome[ISS]
- Sentence: mrpl-36 is a structural constituent of ribosome, based on sequence information.
- WBGene00010783,mrpl-36,structural constituent of ribosome[IMP] or [IDA]
- Sentence: mrpl-36 is a structural constituent of ribosome.

Rule 10 Replacement
- For GO term: 'oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor' becomes 'oxidoreductase activity, acting on the aldehyde or oxo group of donors' (drop the text string 'NAD or NADP as acceptor')

Rule 11: If a GO term has a comma in it, then add a comma after this GO term and the next ‘and’.

Sub-cellular localization

Source file for Sub-cellular localization data

GO data
Source 1: gene_association file for C. elegans at:http://www.geneontology.org/GO.current.annotations.shtml?all
- Need data from these rows:
  - where column 9 has value 'C' (Cellular Component)
  - column 2, associated genes, has UniProt ID or WormBase GeneID, need to translate UniProtID to WormBase ID.
  - column 3: DB_Object symbol, eg, wht-7,
  - column 5: GOID, eg, GO:0000346
  - column 7: Evidence code, eg, IDA

Rules for sub-cellular localization sentence construction

Rule 1: Ignore GO terms with the tag 'is_obsolete: true' in the obo file
Rule 2: Ignore all IEA and ISS GO terms, use only non-IEA, non-ISS GO terms
Rule 4: Ignore IBA and IBD GO terms (PAINT annotations)
Rule 5: For identical GO terms with both experimental (EXP, IDA, IPI, IMP, IGI, IEP) and computational evidence codes (ISS, IEA), keep only the non-redundant GO terms from experimental evidence and ignore the computational evidence codes.

Rule 6: For 'integral component of ....' terms add the words 'is an';
- Eg. for Rule 2: WBGene00006319,sup-10,integral component of plasma membrane[IDA],WB_REF:WBPaper00006135|PMID:14534247,,WB,striated muscle dense body[IDA]
- sup-10 is an integral component of plasma membrane and is localized to striated muscle dense body;
Examples
- WBGene00023405,sor-1,nucleoplasm[IDA],WB_REF:WBPaper00027128|PMID:16501168,,WB,nuclear speck[IDA]
- Sentence: sor-1 is localized to the nucleoplasm and nuclear speck;

- WBGene00004681,rsd-2,nucleolus[IDA],WB_REF:WBPaper00044261|PMID:18430922,,WB,endoplasmic reticulum[IDA],cytosol[IDA]
- Sentence: rsd-2 is localized to the nucleolus, endoplasmic reticulum and cytosol;

- WBGene00021827,dnc-6,dynactin complex[IDA],WB_REF:WBPaper00037699|PMID:20964796,,WB
- Sentence: dnc-6 is localized to the dynactin complex;

Rule 7: For the GO term 'intracellular [IEA]' structure of the sentence will be different, use 'is intracellular'.
- Eg.1 WBGene00089742
- PPA00188 is an ortholog of C. elegans T26A5.8; based on protein domain information, PPA00188 is predicted to have sequence-specific DNA binding activity and protein heterodimerization activity and is intracellular.

Order of sentences

Orthology
Process
Function/identity
Component

Tissue expression

Source files for Tissue expression data

Source 1: Expression data
OA (exprpat), PG table names:
- for all these tables where exp_endogenous table value 'endogenous', grab the exp_gene, WBGeneXXXXXXXX
- exp_name, values look like Expr1005.
- exp_anatomy for anatomy terms, in the form of Wbt:0003679, translate these IDs using the anatomy ontology file
- anatomy ontology file from ftp site:ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS244/ONTOLOGY/anatomy_ontology.WS244.obo
- exp_paper for paper
- exp_qualifier for the qualifiers 'certain’, ‘uncertain’ and ‘partial’.
Contact Person: Daniela

Rules for tissue expression sentence construction

Rule 1: Use only the data that has the qualifiers 'Certain' and 'Partial' and ignore all those data that have 'uncertain'; treat 'Certain' and 'Partial' equally; also use all expression data with no qualifiers as well, qualifiers were added recently.

Rule 2: Pick an anatomy term only once

Sentence: <Gene> is expressed in the <anatomy term1, anatomy term2 and anatomy term3>;

Examples:
Data for alh-10:
WBGene00000116 alh-10 Expr5583 nervous system,intestine,tail neuron WBPaper00031006,WBPaper00006525 Endogenous
Sentence: alh-10 is expressed in the nervous system, intestine and tail neuron;

Data for asp-5:
WBGene00000218 asp-5 Expr5817 intestine WBPaper00031006,WBPaper00006525 Endogenous
WBGene00000218 asp-5 Expr4352 intestine WBPaper00028802 Endogenous
Sentence: asp-5 is expressed in the intestine;

Data for ccr-4:
WBGene00000376 ccr-4 Expr4479 pharynx WBPaper00027076 Endogenous
WBGene00000376 ccr-4 Expr11132 male,hermaphrodite,somatic cell,germ line *WBPaper00043886 Endogenous
WBGene00000376 ccr-4 Expr4479 hyp6,hyp5,hyp4,hyp3,hyp2,hyp1,pharyngeal neuron,head muscle WBPaper00027076 Endogenous
WBGene00000376 ccr-4 Expr7174 pharynx,hypodermis,seam cell *WBPaper00031006,WBPaper00006525 Endogenous
WBGene00000376 ccr-4 Expr4480 pharynx,body wall musculature,head neuron,tail neuron *WBPaper00027076 Endogenous
WBGene00000376 ccr-4 Expr4480 hyp6,hyp5,hyp4,hyp3,hyp2,hyp1,pharyngeal neuron,head muscle
Sentence: ccr-4 is expressed in the pharynx, somatic cell, germ line, hyp6, hyp5, hyp4, hyp3, hyp2, hyp1, pharyngeal neuron, head muscle, hypodermis, seam cell, body wall musculature and in the head and tail neurons.

Rule 3: Replacement Rule
Replacement 1: If the anatomy term 'cell' occurs by itself, then use the words 'widely expressed' instead of 'cell'.
- sentence: col-178 is expressed in the Cell;
- Becomes: col-178 is expressed widely.

Replacement 2: If the anatomy term 'cell' is present with other anatomy terms, use instead the words 'expressed in several tissues including the' <other anatomy terms>
- Sentence: frm-1 is expressed in the intestine, pharynx, and the Cell;
- Becomes: frm-1 is expressed in several tissues including the intestine and pharynx.

Replacement 3: If the anatomy term 'neuron' occurs as an anatomy term by itself, use the words 'in the nervous system' instead.
- Sentence: ceh-82 is expressed in the neuron;
- Becomes: ceh-82 is expressed in the nervous system;

Replacement 4: If the word 'neuron' occurs as part of an anatomy term, eg, 'head neuron', pluralize it, except for the exceptions:
- Exceptions:
- I3 neuron
- I4 neuron
- I5 neuron
- I6 neuron
- M1 neuron
- M4 neuron
- M5 neuron
- MI neuron
- Sentence: nhr-194 is expressed in the amphid neuron, ciliated neuron, head neuron, and the sensory neuron;
- Becomes: nhr-194 is expressed in the amphid neurons, ciliated neurons, head neurons, and the sensory neurons;

Rule 5: Granularity Rule: When a string of cell names are present, check to see if any parent terms are present in any ExprXXX for that gene, keep the highest common parent term available.

Preliminary results

These descriptions are based on Homology predictions and the GO annotations for Process, Component and Function:

 

*alh-2
alh-2 encodes an ortholog of human dehydrogenase aldehyde dehydrogenase family 1 member dehydrogenase; alh-2 is predicted to have oxidoreductase activity and oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor based on protein domain information.

*asp-5
asp-5 encodes an ortholog of human cathepsin d; asp-5 is involved in cell death and locomotion; asp-5 is predicted to have aspartic-type endopeptidase activity based on protein domain information.

*cng-1
cng-1 encodes an ortholog of human cyclic nucleotide gated channel alpha 3; cng-1 is predicted to have ion channel activity based on protein domain information; cng-1 is localized to the neuronal cell body.

Mapping of automated concise description data to OA fields

Mapping of data to data fields in the OA
OA field number	OA field name	Data to be inserted	Example of data to be inserted	Required or Not	OA table name
1	WBGene	WBGene	WBGene00000376	Required	con_wbgene
2	Species	Species	Onchocerca volvulus	Required	con_species
3	Curator	Name of Curator	James Done(first then replace with) Ranjana Kishore (insert for all rows)	Required	con_curator
4	Curator History	Name of Curator	same as pgid (insert for all rows)	Required	con_curhistory
5	Description Type	Automated_concise_description (insert for all rows)	Automated_concise_description	Required	con_desctype
6	Description Text	the automated concise description	asp-19 encodes an ortholog...	Required	con_desctext
7	Reference	WBPaper	WBPaper00026979	Required	con_paper
8	Accesssion Evidence	For Homology, for elegans, use ENSEMBL Gene ID, for non-elegans species use WBGeneID For Process, Function, use InterPro ID	For elegans: ENSEMBL:ENSG00000103257 (previously used the protein ENSEMBL protein ids) and INTERPRO:IPR002293 For non-elegans species: WBGene00007443 and INTERPRO:IPR002293 (comma separate multiple values)	Not required	con_accession
9	Last Updated	Date when the descriptions were last generated	2014-09-11	Required	con_lastupdate
10	pgid	pgid	1149 (Postgres will generate)	Required

Tab-delimited file for OA insert

One tab-delimited file per species
Order of the data will be: WBGene, Date, Paper, Accession_evidence, Automated_concise_description, Species, Inferred_automatically text
Format: tab-delimited file, comma separate the values when multiple values are present
Date is the last date that the script was run to generate the automated descriptions (eg. 2014-05-28)
File will be placed on textpresso-dev to be picked up by a cron job by JC

Directory structure for project

http://textpresso-dev.caltech.edu/concise_descriptions/ Top level parent directory for project
http://textpresso-dev.caltech.edu/concise_descriptions/production_release.txt Indicates what release the file corresponds to
http://textpresso-dev.caltech.edu/concise_descriptions/species.txt Indicates the different species we are producing description files for
http://textpresso-dev.caltech.edu/concise_descriptions/release/WS247/c_elegans/descriptions/OA_concise_descriptions.WS247.txt WS247 elegans file for import into OA

Inserting automated descriptions into postgres

Populating script

Run the script to populate from here: /home/acedb/ranjana/concise_testing/populate_automated_concise_descriptions.pl

Use the command 'screen' after ssh-ing into Tazendra, to keep screen alive

Script actually at: /home/postgres/work/pgpopulation/concise_description/20140909_automated_concise/populate_automated_concise_descriptions.pl

Script looks at http://textpresso-dev.caltech.edu/concise_descriptions/production_release.txt for release number and

http://textpresso-dev.caltech.edu/concise_descriptions/species.txt for the different species

Script gets data from the following URL for each of the species: For elegans: http://textpresso-dev.caltech.edu/concise_descriptions/release/WS247/c_elegans/descriptions/OA_concise_descriptions.WS247.txt

For briggsae: http://textpresso-dev.caltech.edu/concise_descriptions/release/WS247/c_briggsae/descriptions/OA_concise_descriptions.WS247.txt

and so on.

When populating for each OA row that has con_desctype set to 'Automated_concise_description', it will delete from these tables :

con_wbgene
con_species
con_curator
con_curhistory
con_desctext
con_paper
con_accession
con_inferredauto
con_lastupdate

Meaning that it's keeping the 'Automated_concise_description' value, so that future runs of the script will reuse existing pgids. When running out of pre-existing pgids, it will create a new one and assign 'Automated_concise_description' to con_desctype

For testing on Mangolassi

Both populating and dumping scripts at:/home/acedb/ranjana/concise_testing

For the polulating script always redirect output to a file: populate_automated_concise_descriptions.pg.<date>

Dumping the automated, concise and provisional descriptions

Dumping script

Run the dumping script manually, on tazendra: /home/acedb/kimberly/citace_upload/concise/wrapper.pl
The file, concise_dump_new.ace, can be scp-ed for testing from:scp acedb@tazendra.caltech.edu:/home/postgres/public_html/cgi-bin/data/concise_dump_new.ace
Concise descriptions dumper at /home/postgres/work/citace_upload/concise/dump_concise.pl
/home/postgres/public_html/cgi-bin/data/concise_dump_new.ace

Script that finds genes with concise descrips that also have an automated description

On Tazendra and Mangolassi: /home/acedb/ranjana/concise_testing/find_concise_with_automated.pl

Script that compares citace test file and Postgres for automated descriptions

Location:/home/acedb/ranjana/concise_testing/compare_concise_postgres_vs_acefile.pl

Needs a 'citace_genes_with_automated.ace' file which is the .ace export of all genes in citace with the 'Automated_description' tag (I use Query buildr to do this query and then export the 'Names' to a file 'citace_genes_with_automated.ace').
Compares the output from citace (from testing concise_dump_new.ace in empty citace database on Maya) of genes with automated tag, to postgres data, also outputs if a gene has a concise description (which would not be dumped so not in .ace file, but in Postgres).
For the WS247 upload, 1 gene not accounted for, WBGene00020108, which has been merged into a dead gene?

Discontinued from WS246 upload

concise cronjob is on the acedb account : 0 2 * * thu /home/acedb/kimberly/citace_upload/concise/wrapper.pl (turned off for now)
It creates a file at: /home/postgres/public_html/cgi-bin/data/concise_dump_new.ace
which you can see on the web at:http://tazendra.caltech.edu/~postgres/cgi-bin/data/concise_dump_new.ace
Then on spica, login, go to the Data_for_citace/Data_from_Kimberly directory, remove the existing file, and upload the latest file using: wget http://tazendra.caltech.edu/~postgres/cgi-bin/data/concise_dump_new.ace

Text for Automatically_inferred tag

Tag looks like: Evidence Automatically_inferred ?Text

Text will be: This description was automatically generated by Textpresso scripts based on homology/orthology data and Gene Ontology (GO) annotations from the <WS246> version of WormBase.

Rules for dumping the different types of descriptions in the OA

.ace format

List of tags to be dumped:

Automated_description
Paper_evidence
Accession_evidence
Date_last_updated
Inferred_automatically

Gene : "WBGene00009585"
Automated_description	"cal-7 encodes an ortholog of human calmodulin-like 4 (HGNC:CALML4); cal-7 is predicted to have calcium ion binding activity, based on protein domain information."	Date_last_updated	"2012-07-24"
Automated_description	"cal-7 encodes an ortholog of human calmodulin-like 4 (HGNC:CALML4); cal-7 is predicted to have calcium ion binding activity, based on protein domain information."	Paper_evidence	"WBPaper000045688"
Automated_description	"cal-7 encodes an ortholog of human calmodulin-like 4 (HGNC:CALML4); cal-7 is predicted to have calcium ion binding activity, based on protein domain information."	Paper_evidence	"WBPaper000045689"
Automated_description	"cal-7 encodes an ortholog of human calmodulin-like 4 (HGNC:CALML4); cal-7 is predicted to have calcium ion binding activity, based on protein domain information."	Acession_evidence "ENSEMBL" "ENSP00000419081"
Automated_description	"cal-7 encodes an ortholog of human calmodulin-like 4 (HGNC:CALML4); cal-7 is predicted to have calcium ion binding activity, based on protein domain information."	Acession_evidence "INTERPRO" "IPR002048"
Automated_description	"cal-7 encodes an ortholog of human calmodulin-like 4 (HGNC:CALML4); cal-7 is predicted to have calcium ion binding activity, based on protein domain information."	Inferred_automatically "This description was generated automatically by Textpresso scripts based on homology/orthology data and Gene Ontology (GO) annotations, from the WS243 version of WormBase."

Numbers from citace testing

concise_dump_new.ace was scped from: /home/postgres/public_html/cgi-bin/data/concise_dump_new.ace and tested
Tested concise_dump_new.ace on empty citaceminus mirrror on local machine with the WS246 models file saved, read in fine!
WS246 numbers: Total genes: 9,529 genes, Genes with automated descriptions: 3,362; 85,039 lines

Reporting numbers

--Currently the automated descriptions are generated for genes without a concise description

--Generate a report for numbers and place on Textpresso-dev, http://textpresso-dev.caltech.edu/concise_descriptions/

Report for WS246 upload, Sept/Oct, 2014:
Total number of automated descriptions = 3,364
Number of automated descriptions with homology = 2,353
Number of automated descriptions with process information = 1,206
Number of automated descriptions with function information = 2,183
Number of automated descriptions with component information = 244

Publications related to Text-mining methods

Automatically generating gene summaries from biomedical literature.

Ling X, Jiang J, He X, Mei Q, Zhai C, Schatz B.

Pac Symp Biocomput. 2006:40-51.

PMID:17094226

Generating gene summaries from biomedical literature: A study of semi-structured summarization

Xu Ling *, Jing Jiang, Xin He, Qiaozhu Mei, Chengxiang Zhai, Bruce Schatz

Information Processing and Management 43 (2007) 1777–1791

Project milestones by release

WS246
WS247
WS248
WS249
WS250
WS251

Automated descriptions for C. briggsae

Issues to address

1. For the non-elegans species, need to use the WBGeneID as Accession_evidence, so can we use our own database, and do: Accession_evidence "WormBase" "WBGene00005678" (not done)
2. For non-elegans species, need to add in elegans gene experimental data, but with multiple orthologs, which do we pick? (done, using only the common GO terms for orthologs)
3. Gene names should be pulled and mapped to the Gene IDs only from the gene names file, not from the source files of orthology or GO (will be addressed for WS250)
4. For WS250, check that the gene names are (somewhat) current; corrected mel-47 to tofu-6 by hand for WS249.

Automated descriptions software

Documentation for workflow and scripts

Back To Concise Descriptions