Difference between revisions of "Textpresso-based automated extraction of concise descriptions"

From WormBaseWiki
Jump to navigationJump to search
Line 62: Line 62:
 
*Caltech and non-caltech data?
 
*Caltech and non-caltech data?
 
*Source 1: GO data
 
*Source 1: GO data
**Download gene_association.wb.gz from GO consortium
+
**Download gene_association.wb.gz from GO consortium: http://www.geneontology.org/GO.current.annotations.shtml?all
 +
**All rows with column 15 (assigned by) with 'WB' are WormBase annotations, those with 'UniProtKB' or 'InterPro' are from those databases
 +
**Need data from those rows where column 9 has value 'F' (Molecular Function), the associated genes are from column 2 (UniProt ID), column 3 (DB_Object symbol), eg, wht-7, GO terms are from column 5.
 +
*Source 2: Homology data, see #2
  
 
'''2. Orthology, Homology and Paralog data'''
 
'''2. Orthology, Homology and Paralog data'''
Line 70: Line 73:
 
'''3. Processes'''
 
'''3. Processes'''
 
*Caltech data
 
*Caltech data
*Source 1: GO data in postgres
+
*GO data
 +
*Source 1: Download gene_association.wb.gz from GO consortium: http://www.geneontology.org/GO.current.annotations.shtml?all
 +
**All rows with column 15 (assigned by) with 'WB' are WormBase annotations, those with 'UniProtKB' or 'InterPro' are from those databases
 +
**Need data from those rows where column 9 has value 'P' (Process), the associated genes are from column 2 (UniProt ID), column 3 (DB_Object symbol), eg, wht-7, GO terms are from column 5.
 +
*GO data in postgres
 
**Paper -- gop_paper
 
**Paper -- gop_paper
 
**WBGene -- gop_wbgene
 
**WBGene -- gop_wbgene
 
**GO -- gop_goontology
 
**GO -- gop_goontology
 
**GO Term -- gop_goid
 
**GO Term -- gop_goid
*Source 2: GO file from Protein2GO called gp_association.ace
 
**
 
*Source 3: gene_association file
 
**ftp://ftp.wormbase.org/pub/wormbase/releases/WS243/ONTOLOGY/
 
**File name:gene_association.WS243.wb.c_elegans
 
**Rows with a 'P' in column 9 indicates GO Biological Process associated with a gene.
 
 
*Contact Person: Kimberly, Ranjana
 
*Contact Person: Kimberly, Ranjana
*Source 4: Topic data
+
*Source 2: Topic data
 
**OA field: Gene, PG table name:pro_wbgene  
 
**OA field: Gene, PG table name:pro_wbgene  
 
**OA field:WBPaper, PG table name:pro_paper
 
**OA field:WBPaper, PG table name:pro_paper
Line 97: Line 98:
 
**File name:phenotype_association.WS243.wb
 
**File name:phenotype_association.WS243.wb
 
**Rows with 'NOT' should be ignored for this search.
 
**Rows with 'NOT' should be ignored for this search.
 +
**
 
*Contact Person: Karen
 
*Contact Person: Karen
  

Revision as of 21:51, 2 June 2014

Generating gene sets with and without concise descriptions

Set of genes with a concise description

Query for all genes with a concise description from Postgres: Relevant postgres table names:

  • con_wbgene: Stores the WBGene ID and gene names
  • con_desctype: Type of description (relevant for us: Concise_description)
  • con_desctext: Text of the concise description

Query for all WBGenes that have a concise description (in con_desctext AND con_desctype):

SELECT DISTINCT(con_wbgene) FROM con_wbgene WHERE joinkey IN (SELECT joinkey FROM con_desctext WHERE con_desctext IS NOT NULL) AND joinkey IN (SELECT joinkey FROM con_desctype WHERE con_desctype IS NOT NULL) ORDER BY con_wbgene;

  • Number of genes with a concise description (as of 05.07.2014)=6,624

Set of genes with no concise description

Set of genes with no concise description and at least one published paper

Location of project-related files on Textpresso

http://textpresso-dev.caltech.edu/concise_descriptions/


Semantic categories in a Concise Description

1. Molecular identity
2. Orthology/Similarity
3. Mutant Phenotypes
4. Processes
5. Pathways
6. Genetic Interaction
7. Physical Interaction
8. Gene regulation data
9. Molecular Function
10. Tissue expression (may include life-stage)
11. Sub-cellular localization (may include life-stage)

Template for a Concise Description

Molecular identity
<Gene> encodes a <molecular identity>;
Orthology/Similarity
<Gene> is (orthologous, similar) to .....;
Phenotypes
<Gene> mutants exhibit the following phenotypes, <phenotypes>.
Process/Pathway
<Gene> is (required, functions, regulates, is involved in, is part of) ....., as mutants of <gene> exhibit <phenotypes>;
Genetic interaction with respect to Process or Pathway
<Gene> interacts genetically with <gene1, gene2> ..... in <Process, Pathway>;
Physical interaction
<Protein> physically interacts with (protein, DNA, RNA) .....;
Molecular Function
<Protein> has ..... activity in (in vitro, in vivo) assays;
Tissue Expression
<Gene/Protein> is expressed in ..... and expression in ..... is (positively, negatively) 
regulated by <Gene/Protein>.....;
Sub-cellular localization 
<Protein> is localized to <cellular component> and expression in <cellular component> 
is <positively, negatively> regulated by .....

Note: Not all descriptions may follow the exact order or choice of words.

Data mining (mining data from Postgres and/or Acedb) for the semantic categories

1. Molecular identity

  • Caltech and non-caltech data?
  • Source 1: GO data
    • Download gene_association.wb.gz from GO consortium: http://www.geneontology.org/GO.current.annotations.shtml?all
    • All rows with column 15 (assigned by) with 'WB' are WormBase annotations, those with 'UniProtKB' or 'InterPro' are from those databases
    • Need data from those rows where column 9 has value 'F' (Molecular Function), the associated genes are from column 2 (UniProt ID), column 3 (DB_Object symbol), eg, wht-7, GO terms are from column 5.
  • Source 2: Homology data, see #2

2. Orthology, Homology and Paralog data

  • Ace tags: ?Gene Ortholog_other, Paralog
  • Contact: Michael Paulini

3. Processes

  • Caltech data
  • GO data
  • Source 1: Download gene_association.wb.gz from GO consortium: http://www.geneontology.org/GO.current.annotations.shtml?all
    • All rows with column 15 (assigned by) with 'WB' are WormBase annotations, those with 'UniProtKB' or 'InterPro' are from those databases
    • Need data from those rows where column 9 has value 'P' (Process), the associated genes are from column 2 (UniProt ID), column 3 (DB_Object symbol), eg, wht-7, GO terms are from column 5.
  • GO data in postgres
    • Paper -- gop_paper
    • WBGene -- gop_wbgene
    • GO -- gop_goontology
    • GO Term -- gop_goid
  • Contact Person: Kimberly, Ranjana
  • Source 2: Topic data
    • OA field: Gene, PG table name:pro_wbgene
    • OA field:WBPaper, PG table name:pro_paper
  • Contact Person: Karen

4. Pathway (No database source for now?)

5. Mutant Phenotypes

  • Caltech data
  • Source 1: Phenotype OA, PG table name:(Phenotypes are added to variation and not genes)
  • Source 2: Acedb tag: Under ?Gene, Reference_allele ?Variation and Allele ?Variation and Under ?Variation Phenotype
  • Source 3: phenotype_association file:ftp://ftp.wormbase.org/pub/wormbase/releases/WS243/ONTOLOGY/
    • File name:phenotype_association.WS243.wb
    • Rows with 'NOT' should be ignored for this search.
  • Contact Person: Karen

6. Genetic Interaction and 7. Physical Interaction

  • Caltech data
  • Source 1: gene_association.wb
    • ftp://ftp.wormbase.org/pub/wormbase/releases/WS243/ONTOLOGY/
    • File name:gene_association.WS243.wb.c_elegans
    • Rows with a 'IGI' in column 7 indicate a genetic interaction between the WBgenes in column 2/3 and column 8
    • Rows with a 'IPI" in column 7 indicate a physical interaction between the WBgenes in column 2/3 and column 8
  • Source 2: Interaction OA and tables
    • "Field Name" = Postgres Table:
    • "Paper" = int_paper
    • "Interaction Type" = int_type
    • "Bait overlapping gene" = int_genebait
    • "Target overlapping gene" = int_genetarget
    • "Non-directional Gene(s)" = int_genenondir
    • "Effector Gene(s)" = int_geneone
    • "Affected Gene(s)" = int_genetwo

Example statements:

If int_type = "Physical"

<int_genebait> interacts physically with <int_genetarget> (and vice versa)


If int_type = "Genetic - Synthetic ( Synthetic )"

<int_genenondir> interacts with <other int_genenondir(s)> in a synthetic genetic interaction


If int_type = "Genetic - Suppression ( Suppression )"

<int_geneone> genetically suppresses <int_genetwo>

8. Gene regulation

  • Caltech data
  • Source: Gene regulation data in genereg OA
    • Positive_regulate Anatomy_term "<grg_pos_anatomy>"
    • Positive_regulate Life_stage "<grg_pos_lifestage>"
    • Positive_regulate Subcellular_localization "<grg_pos_scl>"
    • Positive_regulate Subcellular_localization_text "<grg_pos_scltext>"
    • Negative_regulate Anatomy_term "<grg_neg_anatomy>"
    • Negative_regulate Life_stage "<grg_neg_lifestage>"
    • Negative_regulate Subcellular_localization "<grg_neg_scl>"
    • Negative_regulate Subcellular_localization_text "<grg_neg_scltext>"
    • Does_not_regulate Anatomy_term "<grg_not_anatomy>"
    • Does_not_regulate Life_stage "<grg_not_lifestage>"
    • Does_not_regulate Subcellular_localization "<grg_subcellloc>"
    • Does_not_regulate Subcellular_localization_text "<grg_not_scltext>"
    • Trans_regulated_gene "<grg_transregulated>"
    • Trans_regulator_gene "<grg_transregulator>"
    • No Subdata Result "<grg_result>"

9. Molecular Function

10. Tissue expression and life stage

  • Caltech data
  • Source 1: Expression data
  • OA (exprpat), PG table names:
    • exp_anatomy for anatomy terms
    • exp_goid for subcell localization
    • exp_lifestage for life stage
    • exp_paper for paper
    • exp_gene for gene
  • Contact Person: Daniela

11. Sub-cellular localization

Publications related to Text-mining methods

  • Automatically generating gene summaries from biomedical literature.

Ling X, Jiang J, He X, Mei Q, Zhai C, Schatz B.

Pac Symp Biocomput. 2006:40-51.

PMID:17094226

  • Generating gene summaries from biomedical literature: A study of semi-structured summarization

Xu Ling *, Jing Jiang, Xin He, Qiaozhu Mei, Chengxiang Zhai, Bruce Schatz

Information Processing and Management 43 (2007) 1777–1791


Back To Concise Descriptions