Difference between revisions of "Textpresso-based automated extraction of concise descriptions"

From WormBaseWiki
Jump to navigationJump to search
Line 366: Line 366:
 
*'''Sentence''': ccr-4 is expressed in the pharynx, somatic cell, germ line, hyp6, hyp5, hyp4, hyp3, hyp2, hyp1, pharyngeal neuron, head muscle, hypodermis, seam cell, body wall musculature and in the head and tail neurons.
 
*'''Sentence''': ccr-4 is expressed in the pharynx, somatic cell, germ line, hyp6, hyp5, hyp4, hyp3, hyp2, hyp1, pharyngeal neuron, head muscle, hypodermis, seam cell, body wall musculature and in the head and tail neurons.
  
*'''Rule 3: Replacement Rule'''
+
*'''Rule 3''': Replacement Rule'''
 
*Replacement 1: If the anatomy term 'cell' occurs by itself, then use the words 'widely expressed' instead of 'cell'.
 
*Replacement 1: If the anatomy term 'cell' occurs by itself, then use the words 'widely expressed' instead of 'cell'.
 
**sentence: col-178 is expressed in the Cell;
 
**sentence: col-178 is expressed in the Cell;
**Becomes: col-178 is widely expressed.
+
**Becomes: col-178 is expressed widely.
  
 
*Replacement 2: If the anatomy term 'cell' is present with other anatomy terms, use instead the words 'expressed in several tissues including the' <other anatomy terms>
 
*Replacement 2: If the anatomy term 'cell' is present with other anatomy terms, use instead the words 'expressed in several tissues including the' <other anatomy terms>
Line 375: Line 375:
 
**Becomes: frm-1 is expressed in several tissues including the intestine and pharynx.
 
**Becomes: frm-1 is expressed in several tissues including the intestine and pharynx.
  
*Replacement 3: If the anatomy term 'neuron' occurs by itself, use the words 'in the nervous system' instead.
+
*Replacement 3: If the anatomy term 'neuron' occurs as an anatomy term by itself, use the words 'in the nervous system' instead.
 
**Sentence: ceh-82 is expressed in the neuron;
 
**Sentence: ceh-82 is expressed in the neuron;
 
**Becomes: ceh-82 is expressed in the nervous system;
 
**Becomes: ceh-82 is expressed in the nervous system;
  
*Replacement 4: If the anatomy term 'neuron' occurs as part of an anatomy term, eg, 'head neuron', pluralize it, except for the exceptions:
+
 
 +
*Replacement 4: If the word 'neuron' occurs as part of an anatomy term, eg, 'head neuron', pluralize it, except for the exceptions:
 
**Exceptions:  
 
**Exceptions:  
 
**I3 neuron
 
**I3 neuron
Line 393: Line 394:
  
  
'''Rule ''': Granularity Rule: When a string of cell names are present, check to see if any parent terms are present in any ExprXXX for that gene, keep the highest common parent term available.
+
'''Rule 5''': Granularity Rule: When a string of cell names are present, check to see if any parent terms are present in any ExprXXX for that gene, keep the highest common parent term available.
  
 
==Rules for sentence construction: Sub-cellular localization/Component==
 
==Rules for sentence construction: Sub-cellular localization/Component==

Revision as of 21:09, 23 July 2014

Querying for gene sets

Set of genes with a concise description

Query for all genes with a concise description from Postgres: Relevant postgres table names:

  • con_wbgene: Stores the WBGene ID and gene names
  • con_desctype: Type of description (relevant for us: Concise_description)
  • con_desctext: Text of the concise description

Query for all WBGenes that have a concise description (in con_desctext AND con_desctype):

SELECT DISTINCT(con_wbgene) FROM con_wbgene WHERE joinkey IN (SELECT joinkey FROM con_desctext WHERE con_desctext IS NOT NULL) AND joinkey IN (SELECT joinkey FROM con_desctype WHERE con_desctype IS NOT NULL) ORDER BY con_wbgene;

  • Number of genes with a concise description (as of 05.07.2014)=6,624

Set of genes with no concise description

Set of genes with no concise description and at least one published paper

Location of project-related files on Textpresso

http://textpresso-dev.caltech.edu/concise_descriptions/


Semantic categories in a Concise Description

1. Molecular identity
2. Orthology/Similarity
3. Mutant Phenotypes
4. Processes
5. Pathways
6. Genetic Interaction
7. Physical Interaction
8. Gene regulation data
9. Molecular Function
10. Tissue expression (may include life-stage)
11. Sub-cellular localization (may include life-stage)

Sources for mining Homology data

1. Orthology, Homology and Paralog data

  • Ace tags: ?Gene Ortholog_other, Paralog
  • Contact: Michael Paulini

From Michael Paulini:

 
a.) orthology
it is in the ACeDB database on the genes as ortholog/paralog/ortholog_other, but we also dump it since a while each build (as   example, here: ftp://ftp.wormbase.org/pub/wormbase/releases/WS243/species/c_elegans/PRJNA13758/annotation 
/c_elegans.PRJNA13758.WS243.orthologs.txt.gz). 

b.) homology
1. protein homology)
the blastx data is in the GFF files, as well as here: ftp://ftp.wormbase.org/pub/wormbase/releases/WS243/acedb                 /Non_C_elegans_BLAST
for C.elegans the patch file is also loaded during the build, so you can find them as regular Homology_data on the respective  parent sequences in ACeDB.

the blastp data is as Homology_data on the proteins, as well as partially dumped into that one: ftp://ftp.wormbase.org/pub/wormbase/releases/WS243/species/c_elegans/PRJNA13758/c_elegans.PRJNA13758.WS243.best_blastp_hits.txt.gz

We also go protein clusters (mostly eggNOG based ones), which show homoloy and shared function and are connected to the member proteins in ACeDB through the Homology_group tag .

2. nucleotide homology) 
Mostlyy based on blat, but with the current release switched to star
You can find them in the respective GFF files and also similar to the blastx as homology data on the parent sequences in ACeDB

We also got RNASeq, which currently lives as RNASeq features in the GFF and ACeDB, but also as expression level data in the Gene/Transcript/CDS.

And last, but not least, we got pairwise whole genome alignments for selected species, which currently we only show on EnsEMBL Genomes, but you can use the generic Compara API to pull the alignments from there.

As orthology + homology covers such a huge swath of very different data in WormBase, there is no unifying format, except ACeDB and to a certain extent GFF.

Sources for mining Process data

  • Caltech data
  • GO data
  • Source 1: Download the gene_association file for C.elegans from the WormBase FTP site:
    • ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS243/ONTOLOGY/gene_association.WS243.wb.c_elegans
    • All rows with column 15 (assigned by) with 'WB' are WormBase annotations, those with 'UniProtKB' or 'InterPro' are from those databases
    • Need data from these rows:
      • where column 9: has value 'P' (Process),
      • column 2 (DB_Object ID): the associated genes, UniProt ID or WBGene ID,
      • column 3 (DB_Object symbol), eg, wht-7, GO terms are from column 5.
      • column 5: GOID, eg, GO:0000346
      • column 6: DB:Reference (Reference), eg.PMID:12062106, GO_REF:0000002
      • column 7: Evidence code, eg, IMP
      • column 8: With, eg. 'WB:WBRNAi00000785|WBPhenotype:0000050'
  • GO data in postgres
    • Paper -- gop_paper
    • WBGene -- gop_wbgene
    • GO -- gop_goontology
    • GO Term -- gop_goid
  • Contact Person: Kimberly, Ranjana

Not included:

  • Source 2: Topic data
    • OA field: Gene, PG table name:pro_wbgene
    • OA field:WBPaper, PG table name:pro_paper
  • Contact Person: Karen

Sources for mining Molecular function/identity data

  • Caltech and non-caltech data?
  • Source 1: GO data
    • Download the gene_association file for C.elegans from the WormBase FTP site:
    • ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS243/ONTOLOGY/gene_association.WS243.wb.c_elegans
    • All rows with column 15 (assigned by) with 'WB' are WormBase annotations, those with 'UniProtKB' or 'InterPro' are from those databases
    • Need data from these rows:
      • where column 9 has value 'F' (Molecular Function)
      • column 2, associated genes, has UniProt ID or WormBase GeneID, need to translate UniProtID to WormBase ID.
      • column 3: DB_Object symbol, eg, wht-7,
      • column 5: GOID, eg, GO:0000346
      • column 6: DB:Reference (Reference), eg.PMID:12062106, GO_REF:0000002
      • column 7: Evidence code, eg, IMP
      • column 8: 'With (or) From' eg., INTERPRO:IPR002293,
      • column 15: Assigned By, eg., WB (which database created the annotation)

Sources for mining Sub-cellular localization data

  • GO data
  • Source 1: gene_association file for C. elegans at:http://www.geneontology.org/GO.current.annotations.shtml?all
    • Need data from these rows:
      • where column 9 has value 'C' (Cellular Component)
      • column 2, associated genes, has UniProt ID or WormBase GeneID, need to translate UniProtID to WormBase ID.
      • column 3: DB_Object symbol, eg, wht-7,
      • column 5: GOID, eg, GO:0000346
      • column 7: Evidence code, eg, IDA

Sources for Tissue expression data

  • Source 1: Expression data
  • OA (exprpat), PG table names:
    • for all these tables where exp_endogenous table value 'endogenous', grab the exp_gene, WBGeneXXXXXXXX
    • exp_name, values look like Expr1005.
    • exp_anatomy for anatomy terms, in the form of Wbt:0003679, translate these IDs using the anatomy ontology file
    • anatomy ontology file from ftp site:ftp://ftp.sanger.ac.uk/pub/wormbase/releases/WS244/ONTOLOGY/anatomy_ontology.WS244.obo
    • exp_paper for paper
    • exp_qualifier for the qualifiers 'certain’, ‘uncertain’ and ‘partial’.
  • Contact Person: Daniela

Data types not included

Pathway (No database source for now?)

Mutant Phenotype

  • Caltech data
  • Source 1: Phenotype OA, PG table name:(Phenotypes are added to variation and not genes)
  • Source 2: Acedb tag: Under ?Gene, Reference_allele ?Variation and Allele ?Variation and Under ?Variation Phenotype
  • Source 3: phenotype_association file:ftp://ftp.wormbase.org/pub/wormbase/releases/WS243/ONTOLOGY/
    • File name:phenotype_association.WS243.wb
    • Following rows:
      • Rows with 'NOT' should be ignored for this search.
      • column 2, associated genes, WormBase GeneID
      • column 3: DB_Object symbol, eg, aap-1,
      • column 5: Phenotype ID, eg, WBPhenotype:0000674
      • column 6: DB:Reference (Reference), eg.WB_REF:WBPaper00032243, or WB:WBVar00249743
  • Contact Person: Karen

Genetic Interaction and Physical Interaction

  • Caltech data
  • Source 1: gene_association.wb
    • ftp://ftp.wormbase.org/pub/wormbase/releases/WS243/ONTOLOGY/
    • File name:gene_association.WS243.wb.c_elegans
    • Rows with a 'IGI' in column 7 indicate a genetic interaction between the WBgenes in column 2/3 and column 8
    • Rows with a 'IPI" in column 7 indicate a physical interaction between the WBgenes in column 2/3 and column 8
  • Source 2: Interaction OA and tables
    • "Field Name" = Postgres Table:
    • "Paper" = int_paper
    • "Interaction Type" = int_type
    • "Bait overlapping gene" = int_genebait
    • "Target overlapping gene" = int_genetarget
    • "Non-directional Gene(s)" = int_genenondir
    • "Effector Gene(s)" = int_geneone
    • "Affected Gene(s)" = int_genetwo

Example statements:

If int_type = "Physical"

<int_genebait> interacts physically with <int_genetarget> (and vice versa)


If int_type = "Genetic - Synthetic ( Synthetic )"

<int_genenondir> interacts with <other int_genenondir(s)> in a synthetic genetic interaction


If int_type = "Genetic - Suppression ( Suppression )"

<int_geneone> genetically suppresses <int_genetwo>

Gene regulation

  • Caltech data
  • Source: Gene regulation data in genereg OA
    • Positive_regulate Anatomy_term "<grg_pos_anatomy>"
    • Positive_regulate Life_stage "<grg_pos_lifestage>"
    • Positive_regulate Subcellular_localization "<grg_pos_scl>"
    • Positive_regulate Subcellular_localization_text "<grg_pos_scltext>"
    • Negative_regulate Anatomy_term "<grg_neg_anatomy>"
    • Negative_regulate Life_stage "<grg_neg_lifestage>"
    • Negative_regulate Subcellular_localization "<grg_neg_scl>"
    • Negative_regulate Subcellular_localization_text "<grg_neg_scltext>"
    • Does_not_regulate Anatomy_term "<grg_not_anatomy>"
    • Does_not_regulate Life_stage "<grg_not_lifestage>"
    • Does_not_regulate Subcellular_localization "<grg_subcellloc>"
    • Does_not_regulate Subcellular_localization_text "<grg_not_scltext>"
    • Trans_regulated_gene "<grg_transregulated>"
    • Trans_regulator_gene "<grg_transregulator>"
    • No Subdata Result "<grg_result>"

Template for a Concise Description

For the test phase, order of sentences:

  • Orthology
  • Process
  • Function/identity
  • Component


Orthology/Similarity
<Gene> encodes an ortholog of <human protein>;
Phenotypes
<Gene> mutants exhibit the following phenotypes: <phenotypes>.
Process/Pathway
<Gene> is (required, functions, regulates, is involved in, is part of) <process>;
Genetic interaction with respect to Process or Pathway
<Gene> interacts genetically with <gene1, gene2>;
Physical interaction
<Protein> physically interacts with (protein, DNA, RNA) .....;
Molecular Function
<Protein> has <molecular function>..... activity in (in vitro, in vivo);
Tissue Expression and sub-cellular localization
<Gene/Protein> is expressed in <tissue> and localizes to <GO cellular component>;

Note: Not all descriptions may follow the exact order or choice of words, see Rules below.

Rules for sentence construction: Homology

<Gene> encodes an ortholog of human <human protein name>;

  • Rule 1: If the words 'family member' occurs in the description between words before it and after it, then ignore 'family member'
    • Examples:
    • 'aldehyde dehydrogenase 8 family member a1’ becomes 'aldehyde dehydrogenase 8a1'
    • 'aldehyde dehydrogenase 9 family member a1' becomes 'aldehyde dehydrogenase 9a1'

Jul 2, 2014:

  • Rule 2: If the words 'human Uncharacterized protein' occur ignore this homology
    • Examples:
    • ctg-1 encodes an ortholog of human Uncharacterized protein;
    • mtp-18 encodes an ortholog of human Uncharacterized protein;
  • Rule 3 : If 2 or more of these words occur: 'family', 'subfamily', group', 'member' 'polypeptide' or 'class', ignore them and resolve as in examples:
    • olfactory receptor, family 56, subfamily B, member 1 becomes olfactory receptor 56B1
    • potassium intermediate/small conductance calcium-activated channel, subfamily N, member 2 becomes potassium intermediate/small conductance calcium-activated channel N2
    • potassium inwardly-rectifying channel, subfamily J, member 12 becomes human potassium inwardly-rectifying channel J12
    • nuclear receptor subfamily 3, group C, member 2 becomes nuclear receptor 3C2
    • nuclear receptor subfamily 5, group A, member 2 becomes nuclear receptor 5A2
    • nuclear receptor subfamily 1, group H, member 4 becomes nuclear receptor 1H4
    • cytochrome P450, family 3, subfamily A, polypeptide 5 becomes cytochrome P450 3A5
    • cytochrome P450, family 21, subfamily A, polypeptide 2 becomes cytochrome P450 21A2
    • UDP glycosyltransferase 3 family, polypeptide A1 becomes UDP glycosyltransferase 3A1
    • mannosidase, alpha, class 1B, member 1 becomes human mannosidase, alpha, 1B1
    • phosphatidylinositol glycan anchor biosynthesis, class V stays as is, because the word 'class' occurs by itself.
    • scavenger receptor class B, member 2 becomes scavenger receptor B2
  • Rule 4: If the word 'homolog' co-occurs with a species name = 'Drosophila', 'S. cerevisiae', 'yeast', inside brackets , ignore 'homolog' and move the species without the brackets.
    • salvador homolog 1 (Drosophila) becomes Drosophila and human salvador 1
    • SSU72 RNA polymerase II CTD phosphatase homolog (S. cerevisiae) becomes S. cerevisia and human SSU72 RNA polymerase II CTD phosphatase
    • translocase of outer mitochondrial membrane 22 homolog (yeast) becomes yeast and human translocase of outer mitochondrial membrane 22
    • human vacuolar protein sorting 4 homolog B (S. cerevisiae) becomes S. cerevisiae and human vacuolar protein sorting 4
    • unconventional SNARE in the ER 1 homolog (S. cerevisiae) becomes S. cerevisiae and human unconventional SNARE in the ER 1
  • Rule 5: Ignore the text string '<numeral>kDa at the end of terms and the trailing comma, including within parentheses.
    • cleavage and polyadenylation specific factor 4, 30kDa becomes cleavage and polyadenylation specific factor 4
    • cleavage stimulation factor, 3' pre-RNA, subunit 1, 50kDa becomes cleavage stimulation factor, 3' pre-RNA, subunit 1
    • nucleoporin 153kDa becomes nucleoporin
  • Rule 6 : Grab the HGNC symbol and the place it on the end of the sentence, within parentheses as ', (HGNC symbol:<HGNC symbol>)'

Rules for sentence construction: Process

  • Rule 1: Ignore all IEA and ISS process terms
  • Rule 2: For all IMP terms, add the words 'based on mutant phenotype' at the end of the sentence.
    • Examples:
    • Data: WBGene00001972,hmg-1.2,cell fate specification[IMP],WB_REF:WBPaper00003453|PMID:10197987,,WB,gonad development[IMP],Wnt signaling pathway[IGI],vulval development[IMP]
    • Sentence: hmg-1.2 is involved in cell fate specification, gonad development and vulval development, based on mutant phenotypes.
    • Data: WBGene00002148,gon-14,muscle contraction[IMP],WB_REF:WBPaper00024213|PMID:15133127,,WB,defecation[IGI],gonad morphogenesis[IMP],multicellular organism growth[IGI],regulation of pharyngeal pumping[IMP]
    • Sentence: gon-14 is involved in muscle contraction, gonad morphogenesis and regulation of pharyngeal pumping, based on mutant phenotypes and in defecation and growth.

Rule 3: No exclusions as of 07.07.2014, leave in reproduction:

  • Rule 4 If a GO term has the words 'involved in' anywhere, beginning, middle or at the end, use the words 'functions in'
    • Data: WBGene00006495,cpna-1,striated muscle contraction involved in embryonic body morphogenesis[IMP],WB_REF:WBPaper00041875|PMID:23283987
    • Sentence: cpna-1 functions in striated muscle contraction involved in embryonic body morphogenesis;
  • Rule 5: For all other Process terms the sentence will be:
    • <Gene> is involved in <process term>;
    • Examples:
    • Data: WBGene00014123,elpc-3,tRNA wobble uridine modification[IMP],WB_REF:WBPaper00034713|PMID:19593383,,WB,translation[IMP],spermatogenesis[IGI],olfactory learning[IMP],vulval development[IGI],embryonic morphogenesis[IGI],oocyte development[IGI]
    • Sentence: elpc-3 is involved in tRNA wobble uridine modification, translation, spermatogenesis, olfactory learning, vulval development and embryonic morphogenesis.
  • Rule 6: For the term 'molting cycle, collagen and cuticulin-based cuticle' GO term, drop the comma and the words 'collagen and cuticulin-based cuticle'
    • Example:
    • Data: WBGene00016643,vps-45,molting cycle, collagen and cuticulin-based cuticle[IMP],WB_REF:WBPaper00029049|PMID:17235359,,WB,vesicle docking involved in exocytosis[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR001619,vesicle-mediated transport[IEA]
    • Sentence: vps-45 is involved in the molting cycle;
  • Rule 7: Replacement rule:
    • Replace term 'multicellular organism growth' with 'growth'.
    • Replace term 'embryo development ending in birth or egg hatching' with 'embryo development'
    • Replace term 'synaptic transmission, <word>' with '<word> synaptic transmission'.
      • Example Data: WBGene00011488,nra-2, synaptic transmission, cholinergic[IMP],WB_REF:WBPaper00034730|PMID:19609303,,WB,locomotion[IMP],protein processing[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR008710,regulation of signal transduction[IEA],INTERPRO:IPR016574
      • Sentence: nra-2 is involved in cholinergic synaptic transmission and locomotion.

Rule 8: Granularity rule:

  • If a GO term in the list is a parent of (is_a or part_of parent) another GO term in the list, keep the most granular child term and ignore the parent terms.

Rules for sentence construction: Molecular identity/function

  • Rule 1: Exclusion list:
    • Ignore the term 'protein binding'
    • Ignore the term 'binding'
  • Rule 2: Order the IDA and IMP terms first in the sentence followed by ISS and IEA terms.
  • Rule 3: If the evidence code is 'IEA' and 'activity' term is present, add the words 'predicted to have <activity term>' and add the words ', based on protein domain information' to the sentence:
    • Sentence: <Gene> is predicted to have <activity term>, based on protein domain information.
  • Examples:
    • WBGene00000108,alh-2,oxidoreductase activity[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR015590,WB,oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor[IEA],INTERPRO:IPR016163
    • alh-2 is predicted to have oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor, based on protein domain information.
  • Rule 3: If Evidence code is 'ISS' add the words 'based on sequence information' to the sentence
    • WBGene00001951,hlh-4,sequence-specific DNA binding transcription factor activity[ISS],WB_REF:WBPaper00034761|PMID:19632181,,WB,protein dimerization activity[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR011598,DNA binding[IEA],INTERPRO:IPR015660
    • Sentence: hlh-4 is predicted to have sequence-specific DNA binding transcription factor activity based on sequence information, protein dimerization activity and DNA binding activity, based on protein domain information.
  • Rule 4: If a binding term is present add the word 'activity' to it.
  • Rule 5: If the evidence code is IDA, IMP, IGI, IPI, or IEP add the words 'exhibits' before the GO term and 'based on experimental evidence' after the GO term and/or at the end of the sentence.
    • IDA example:
    • WBGene00001952,hlh-6,RNA polymerase II distal enhancer sequence-specific DNA binding transcription factor activity[IDA],WB_REF:WBPaper00032277|PMID:18927627,,WB,protein dimerization activity[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR011598,DNA binding[IEA],INTERPRO:IPR015660
    • Sentence: hlh-6 exhibits RNA polymerase II distal enhancer sequence-specific DNA binding transcription factor activity and is predicted to have protein dimerization activity and DNA binding activity.
    • IMP example:
    • WBGene00009583,aagr-3,alpha-glucosidase activity[IMP],WB_REF:WBPaper00036069|PMID:20349118,,WB,hydrolase activity, hydrolyzing O-glycosyl compounds[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR000322,catalytic activity[IEA],INTERPRO:IPR011013,carbohydrate binding[IEA]
    • Sentence: aagr-3 exhibits alpha-glucosidase activity and is predicted to have hydrolase activity, hydrolyzing O-glycosyl compounds, catalytic activity, and carbohydrate binding activity, based on protein domain information.

Rule 6: For non-IEA terms beginning with 'structural constituent of <word>', use the words 'is a 'structural constituent of <word>' in the sentence.

  • Example:
    • WBGene00010783,mrpl-36,structural constituent of ribosome[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR000473,WB
    • Sentence: mrpl-36 is a structural constituent of ribosome, based on protein domain information.
    • WBGene00010783,mrpl-36,structural constituent of ribosome[ISS]
    • Sentence: mrpl-36 is a structural constituent of ribosome, based on sequence information.
    • WBGene00010783,mrpl-36,structural constituent of ribosome[IMP] or [IDA]
    • Sentence: mrpl-36 is a structural constituent of ribosome.
  • Rule 7 Replacement
    • For GO term: 'oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor' becomes 'oxidoreductase activity, acting on the aldehyde or oxo group of donors' (drop the text string 'NAD or NADP as acceptor')
  • Rule 8: If a GO term has a comma in it, then add a comma after this GO term and the next ‘and’.
  • Rule 9: Always put the GO terms with experimental evidence codes (EXP, IMP, IGI, IPI, IDA, IEP) first in the sentence, followed by the Automatic and Computational Analysis Evidence Codes (IEA, ISS, ISA, ISO, ISM, IGC, IBA, IBD, IKR, IRD and RCA).

Rules for sentence construction: Tissue expression

Rule 1: Use only the data that has the qualifiers 'Certain' and 'Partial' and ignore all those data that have 'uncertain'.

Rule 2: Pick an anatomy term only once

  • Sentence: <Gene> is expressed in the <anatomy term1, anatomy term2 and anatomy term3>;
  • Examples:
  • Data for alh-10:
  • WBGene00000116 alh-10 Expr5583 nervous system,intestine,tail neuron WBPaper00031006,WBPaper00006525 Endogenous
  • Sentence: alh-10 is expressed in the nervous system, intestine and tail neuron;


  • Data for asp-5:
  • WBGene00000218 asp-5 Expr5817 intestine WBPaper00031006,WBPaper00006525 Endogenous
  • WBGene00000218 asp-5 Expr4352 intestine WBPaper00028802 Endogenous
  • Sentence: asp-5 is expressed in the intestine;


  • Data for ccr-4:
  • WBGene00000376 ccr-4 Expr4479 pharynx WBPaper00027076 Endogenous
  • WBGene00000376 ccr-4 Expr11132 male,hermaphrodite,somatic cell,germ line *WBPaper00043886 Endogenous
  • WBGene00000376 ccr-4 Expr4479 hyp6,hyp5,hyp4,hyp3,hyp2,hyp1,pharyngeal neuron,head muscle WBPaper00027076 Endogenous
  • WBGene00000376 ccr-4 Expr7174 pharynx,hypodermis,seam cell *WBPaper00031006,WBPaper00006525 Endogenous
  • WBGene00000376 ccr-4 Expr4480 pharynx,body wall musculature,head neuron,tail neuron *WBPaper00027076 Endogenous
  • WBGene00000376 ccr-4 Expr4480 hyp6,hyp5,hyp4,hyp3,hyp2,hyp1,pharyngeal neuron,head muscle
  • Sentence: ccr-4 is expressed in the pharynx, somatic cell, germ line, hyp6, hyp5, hyp4, hyp3, hyp2, hyp1, pharyngeal neuron, head muscle, hypodermis, seam cell, body wall musculature and in the head and tail neurons.
  • Rule 3: Replacement Rule
  • Replacement 1: If the anatomy term 'cell' occurs by itself, then use the words 'widely expressed' instead of 'cell'.
    • sentence: col-178 is expressed in the Cell;
    • Becomes: col-178 is expressed widely.
  • Replacement 2: If the anatomy term 'cell' is present with other anatomy terms, use instead the words 'expressed in several tissues including the' <other anatomy terms>
    • Sentence: frm-1 is expressed in the intestine, pharynx, and the Cell;
    • Becomes: frm-1 is expressed in several tissues including the intestine and pharynx.
  • Replacement 3: If the anatomy term 'neuron' occurs as an anatomy term by itself, use the words 'in the nervous system' instead.
    • Sentence: ceh-82 is expressed in the neuron;
    • Becomes: ceh-82 is expressed in the nervous system;


  • Replacement 4: If the word 'neuron' occurs as part of an anatomy term, eg, 'head neuron', pluralize it, except for the exceptions:
    • Exceptions:
    • I3 neuron
    • I4 neuron
    • I5 neuron
    • I6 neuron
    • M1 neuron
    • M4 neuron
    • M5 neuron
    • MI neuron
    • Sentence: nhr-194 is expressed in the amphid neuron, ciliated neuron, head neuron, and the sensory neuron;
    • Becomes: nhr-194 is expressed in the amphid neurons, ciliated neurons, head neurons, and the sensory neurons;


Rule 5: Granularity Rule: When a string of cell names are present, check to see if any parent terms are present in any ExprXXX for that gene, keep the highest common parent term available.

Rules for sentence construction: Sub-cellular localization/Component

  • Rule 1: Ignore all IEA and ISS GO terms, use only non-IEA GO terms
    • Sentence: <Gene> is localized to <component term>;
  • Rule 2: For 'integral component of ....' terms add the words 'is an';
    • Eg. for Rule 2: WBGene00006319,sup-10,integral component of plasma membrane[IDA],WB_REF:WBPaper00006135|PMID:14534247,,WB,striated muscle dense body[IDA]
    • sup-10 is an integral component of plasma membrane and is localized to striated muscle dense body;
  • Examples
    • WBGene00023405,sor-1,nucleoplasm[IDA],WB_REF:WBPaper00027128|PMID:16501168,,WB,nuclear speck[IDA]
    • Sentence: sor-1 is localized to the nucleoplasm and nuclear speck;
    • WBGene00004681,rsd-2,nucleolus[IDA],WB_REF:WBPaper00044261|PMID:18430922,,WB,endoplasmic reticulum[IDA],cytosol[IDA]
    • Sentence: rsd-2 is localized to the nucleolus, endoplasmic reticulum and cytosol;
    • WBGene00021827,dnc-6,dynactin complex[IDA],WB_REF:WBPaper00037699|PMID:20964796,,WB
    • Sentence: dnc-6 is localized to the dynactin complex;

Preliminary results

These descriptions are based on Homology predictions and the GO annotations for Process, Component and Function:

 

*alh-2
alh-2 encodes an ortholog of human dehydrogenase aldehyde dehydrogenase family 1 member dehydrogenase; alh-2 is predicted to have oxidoreductase activity and oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor based on protein domain information.

*asp-5
asp-5 encodes an ortholog of human cathepsin d; asp-5 is involved in cell death and locomotion; asp-5 is predicted to have aspartic-type endopeptidase activity based on protein domain information.

*cng-1
cng-1 encodes an ortholog of human cyclic nucleotide gated channel alpha 3; cng-1 is predicted to have ion channel activity based on protein domain information; cng-1 is localized to the neuronal cell body.

Mapping of automated concise description data to OA fields

Mapping of data to data fields in the OA
OA field
number
OA field name Data to be inserted Example of data
to be inserted
Required or Not OA table name
1 WBGene WBGene WBGene000 Required con_wbgene
2 Curator Name of Curator James Done(first then replace with) Ranjana Kishore
(insert for all rows)
Required con_curator
3 Curator History Name of Curator same as pgid
(insert for all rows)
Required con_curhistory
4 Description Type Automated_concise_description
(insert for all rows)
Automated_concise_description Required con_desctype
5 Description Text the automated concise description asp-19 encodes an ortholog... Required con_desctext
6 Reference WBPaper WBPaper00026979 Required con_paper
7 Accesssion Evidence InterPro ID INTERPRO:IPR002293
(comma separate multiple values)
Not required con_accession
8 Last Updated Date when the descriptions
were last generated
2014-09-11 Required con_lastupdate
9 pgid pgid 1149
(Postgres will generate)
Required

Tab-delimited file for OA insert

  • Order of the data will be: WBGene, Date, Paper, Accession_evidence, Automated_concise_description
  • Format: tab-delimited file, comma separate the values when multiple values are present
  • Date will always be the last date that the script was run to generate the automated descriptions
  • File will be placed on textpresso-dev to be picked up by a cron job by JC

Publications related to Text-mining methods

  • Automatically generating gene summaries from biomedical literature.

Ling X, Jiang J, He X, Mei Q, Zhai C, Schatz B.

Pac Symp Biocomput. 2006:40-51.

PMID:17094226

  • Generating gene summaries from biomedical literature: A study of semi-structured summarization

Xu Ling *, Jing Jiang, Xin He, Qiaozhu Mei, Chengxiang Zhai, Bruce Schatz

Information Processing and Management 43 (2007) 1777–1791


Back To Concise Descriptions