Textpresso-based automated extraction of concise descriptions
Contents
- 1 Generating gene sets with and without concise descriptions
- 2 Location of project-related files on Textpresso
- 3 Semantic categories in a Concise Description
- 4 Template for a Concise Description
- 5 Data mining (mining data from Postgres and/or Acedb) for the semantic categories
- 6 Publications related to Text-mining methods
Generating gene sets with and without concise descriptions
Set of genes with a concise description
Query for all genes with a concise description from Postgres: Relevant postgres table names:
- con_wbgene: Stores the WBGene ID and gene names
- con_desctype: Type of description (relevant for us: Concise_description)
- con_desctext: Text of the concise description
Query for all WBGenes that have a concise description (in con_desctext AND con_desctype):
SELECT DISTINCT(con_wbgene) FROM con_wbgene WHERE joinkey IN (SELECT joinkey FROM con_desctext WHERE con_desctext IS NOT NULL) AND joinkey IN (SELECT joinkey FROM con_desctype WHERE con_desctype IS NOT NULL) ORDER BY con_wbgene;
- Number of genes with a concise description (as of 05.07.2014)=6,624
Set of genes with no concise description
Set of genes with no concise description and at least one published paper
http://textpresso-dev.caltech.edu/concise_descriptions/
Semantic categories in a Concise Description
1. Molecular identity
2. Orthology/Similarity
3. Mutant Phenotypes
4. Processes
5. Pathways
6. Genetic Interaction
7. Physical Interaction
8. Gene regulation data
9. Molecular Function
10. Tissue expression (may include life-stage)
11. Sub-cellular localization (may include life-stage)
Template for a Concise Description
Molecular identity <Gene> encodes a <molecular identity>; Orthology/Similarity <Gene> is (orthologous, similar) to <human gene>; Phenotypes <Gene> mutants exhibit the following phenotypes: <phenotypes>. Process/Pathway <Gene> is (required, functions, regulates, is involved in, is part of) <process>; Genetic interaction with respect to Process or Pathway <Gene> interacts genetically with <gene1, gene2>; Physical interaction <Protein> physically interacts with (protein, DNA, RNA) .....; Molecular Function <Protein> has <molecular function>..... activity in (in vitro, in vivo); Tissue Expression and sub-cellular localization <Gene/Protein> is expressed in <tissue> and localized to <GO cellular component>;
Note: Not all descriptions may follow the exact order or choice of words.
Data mining (mining data from Postgres and/or Acedb) for the semantic categories
1. Molecular identity
- Caltech and non-caltech data?
- Source 1: GO data
- Download gene_association.wb.gz for C.elegans from GO consortium: http://www.geneontology.org/GO.current.annotations.shtml?all
- All rows with column 15 (assigned by) with 'WB' are WormBase annotations, those with 'UniProtKB' or 'InterPro' are from those databases
- Need data from these rows:
- where column 9 has value 'F' (Molecular Function)
- column 2, associated genes, has UniProt ID or WormBase GeneID, need to translate UniProtID to WormBase ID.
- column 3: DB_Object symbol, eg, wht-7,
- column 5: GOID, eg, GO:0000346
- column 6: DB:Reference (Reference), eg.PMID:12062106, GO_REF:0000002
- column 7: Evidence code, eg, IMP
- column 15: Assigned By, eg., WB (which database created the annotation)
- For translation of UniProt IDs to WormBase Gene Ids use the file gp2protein.wb.gz at http://www.geneontology.org/gp2protein/
- Source 2: Homology data, see #2
2. Orthology, Homology and Paralog data
- Ace tags: ?Gene Ortholog_other, Paralog
- Contact: Michael Paulini
3. Processes
- Caltech data
- GO data
- Source 1: Download gene_association.wb.gz for C. elegans from GO consortium: http://www.geneontology.org/GO.current.annotations.shtml?all
- All rows with column 15 (assigned by) with 'WB' are WormBase annotations, those with 'UniProtKB' or 'InterPro' are from those databases
- Need data from these rows:
- where column 9: has value 'P' (Process),
- column 2 (DB_Object ID): the associated genes, UniProt ID or WBGene ID,
- column 3 (DB_Object symbol), eg, wht-7, GO terms are from column 5.
- column 5: GOID, eg, GO:0000346
- column 6: DB:Reference (Reference), eg.PMID:12062106, GO_REF:0000002
- column 7: Evidence code, eg, IMP
- GO data in postgres
- Paper -- gop_paper
- WBGene -- gop_wbgene
- GO -- gop_goontology
- GO Term -- gop_goid
- Contact Person: Kimberly, Ranjana
- Source 2: Topic data
- OA field: Gene, PG table name:pro_wbgene
- OA field:WBPaper, PG table name:pro_paper
- Contact Person: Karen
4. Pathway (No database source for now?)
5. Mutant Phenotypes
- Caltech data
- Source 1: Phenotype OA, PG table name:(Phenotypes are added to variation and not genes)
- Source 2: Acedb tag: Under ?Gene, Reference_allele ?Variation and Allele ?Variation and Under ?Variation Phenotype
- Source 3: phenotype_association file:ftp://ftp.wormbase.org/pub/wormbase/releases/WS243/ONTOLOGY/
- File name:phenotype_association.WS243.wb
- Following rows:
- Rows with 'NOT' should be ignored for this search.
- column 2, associated genes, WormBase GeneID
- column 3: DB_Object symbol, eg, aap-1,
- column 5: Phenotype ID, eg, WBPhenotype:0000674
- column 6: DB:Reference (Reference), eg.WB_REF:WBPaper00032243, or WB:WBVar00249743
- Contact Person: Karen
6. Genetic Interaction and 7. Physical Interaction
- Caltech data
- Source 1: gene_association.wb
- ftp://ftp.wormbase.org/pub/wormbase/releases/WS243/ONTOLOGY/
- File name:gene_association.WS243.wb.c_elegans
- Rows with a 'IGI' in column 7 indicate a genetic interaction between the WBgenes in column 2/3 and column 8
- Rows with a 'IPI" in column 7 indicate a physical interaction between the WBgenes in column 2/3 and column 8
- Source 2: Interaction OA and tables
- "Field Name" = Postgres Table:
- "Paper" = int_paper
- "Interaction Type" = int_type
- "Bait overlapping gene" = int_genebait
- "Target overlapping gene" = int_genetarget
- "Non-directional Gene(s)" = int_genenondir
- "Effector Gene(s)" = int_geneone
- "Affected Gene(s)" = int_genetwo
Example statements:
If int_type = "Physical"
<int_genebait> interacts physically with <int_genetarget> (and vice versa)
If int_type = "Genetic - Synthetic ( Synthetic )"
<int_genenondir> interacts with <other int_genenondir(s)> in a synthetic genetic interaction
If int_type = "Genetic - Suppression ( Suppression )"
<int_geneone> genetically suppresses <int_genetwo>
8. Gene regulation
- Caltech data
- Source: Gene regulation data in genereg OA
- Positive_regulate Anatomy_term "<grg_pos_anatomy>"
- Positive_regulate Life_stage "<grg_pos_lifestage>"
- Positive_regulate Subcellular_localization "<grg_pos_scl>"
- Positive_regulate Subcellular_localization_text "<grg_pos_scltext>"
- Negative_regulate Anatomy_term "<grg_neg_anatomy>"
- Negative_regulate Life_stage "<grg_neg_lifestage>"
- Negative_regulate Subcellular_localization "<grg_neg_scl>"
- Negative_regulate Subcellular_localization_text "<grg_neg_scltext>"
- Does_not_regulate Anatomy_term "<grg_not_anatomy>"
- Does_not_regulate Life_stage "<grg_not_lifestage>"
- Does_not_regulate Subcellular_localization "<grg_subcellloc>"
- Does_not_regulate Subcellular_localization_text "<grg_not_scltext>"
- Trans_regulated_gene "<grg_transregulated>"
- Trans_regulator_gene "<grg_transregulator>"
- No Subdata Result "<grg_result>"
9. Molecular Function
- Caltech data: GO Molecular Function
- Source 1: Download gene_association.wb.gz for C. elegans from GO consortium: http://www.geneontology.org/GO.current.annotations.shtml?all
- All rows with column 15 (assigned by) with 'WB' are WormBase annotations, those with 'UniProtKB' or 'InterPro' are from those databases
- Need data from these rows:
- where column 9: has value 'F' (Molecular Function),
- column 2 (DB_Object ID): the associated genes, UniProt ID or WBGene ID,
- column 3 (DB_Object symbol), eg, wht-7, GO terms are from column 5.
- column 5: GOID, eg, GO:0000346
- column 6: DB:Reference (Reference), eg.PMID:12062106, GO_REF:0000002
- column 7: Evidence code, eg, IDA
- Contact Person: Kimberly, Ranjana
10. Tissue expression and life stage
- Caltech data
- Source 1: Expression data
- OA (exprpat), PG table names:
- exp_anatomy for anatomy terms
- exp_goid for subcell localization
- exp_lifestage for life stage
- exp_paper for paper
- exp_gene for gene
- Contact Person: Daniela
11. Sub-cellular localization
- Caltech data
- GO data
- Source 1: gene_association file for C. elegans at:http://www.geneontology.org/GO.current.annotations.shtml?all
- Need data from these rows:
- where column 9 has value 'C' (Cellular Component)
- column 2, associated genes, has UniProt ID or WormBase GeneID, need to translate UniProtID to WormBase ID.
- column 3: DB_Object symbol, eg, wht-7,
- column 5: GOID, eg, GO:0000346
- column 7: Evidence code, eg, IDA
- Need data from these rows:
- Automatically generating gene summaries from biomedical literature.
Ling X, Jiang J, He X, Mei Q, Zhai C, Schatz B.
Pac Symp Biocomput. 2006:40-51.
PMID:17094226
- Generating gene summaries from biomedical literature: A study of semi-structured summarization
Xu Ling *, Jing Jiang, Xin He, Qiaozhu Mei, Chengxiang Zhai, Bruce Schatz
Information Processing and Management 43 (2007) 1777–1791
Back To Concise Descriptions