|
|
(96 intermediate revisions by the same user not shown) |
Line 1: |
Line 1: |
− | ==Generating gene sets with and without concise descriptions==
| |
| | | |
− | ====Set of genes with a concise description====
| |
− | Query for all genes with a concise description from Postgres:
| |
− | Relevant postgres table names:
| |
− | *con_wbgene: Stores the WBGene ID and gene names
| |
− | *con_desctype: Type of description (relevant for us: Concise_description)
| |
− | *con_desctext: Text of the concise description
| |
− |
| |
− | Query for all WBGenes that have a concise description (in con_desctext AND con_desctype):
| |
− |
| |
− | SELECT DISTINCT(con_wbgene) FROM con_wbgene WHERE joinkey IN (SELECT joinkey FROM con_desctext WHERE con_desctext IS NOT NULL) AND joinkey IN (SELECT joinkey FROM con_desctype WHERE con_desctype IS NOT NULL) ORDER BY con_wbgene;
| |
− |
| |
− | *Number of genes with a concise description (as of 05.07.2014)=6,624
| |
− |
| |
− | ====Set of genes with no concise description====
| |
− | ====Set of genes with no concise description and at least one published paper====
| |
− |
| |
− | ==Location of project-related files on Textpresso==
| |
− | http://textpresso-dev.caltech.edu/concise_descriptions/
| |
− |
| |
− |
| |
− | ==Semantic categories in a Concise Description==
| |
− | 1. Molecular identity <br \>
| |
− | 2. Orthology/Similarity <br \>
| |
− | 3. Mutant Phenotypes <br \>
| |
− | 4. Processes <br \>
| |
− | 5. Pathways <br \>
| |
− | 6. Genetic Interaction<br \>
| |
− | 7. Physical Interaction <br \>
| |
− | 8. Gene regulation data <br \>
| |
− | 9. Molecular Function <br \>
| |
− | 10. Tissue expression (may include life-stage) <br \>
| |
− | 11. Sub-cellular localization (may include life-stage) <br \>
| |
− |
| |
− | ==Data mining (mining data from Postgres and/or Acedb) for the semantic categories==
| |
− | '''1. Molecular identity'''
| |
− | *Caltech and non-caltech data?
| |
− | *Source 1: GO data
| |
− | **Download gene_association.wb.gz for C.elegans from GO consortium: http://www.geneontology.org/GO.current.annotations.shtml?all
| |
− | **All rows with column 15 (assigned by) with 'WB' are WormBase annotations, those with 'UniProtKB' or 'InterPro' are from those databases
| |
− | **Need data from these rows:
| |
− | *** where column 9 has value 'F' (Molecular Function)
| |
− | ***column 2, associated genes, has UniProt ID or WormBase GeneID, need to translate UniProtID to WormBase ID.
| |
− | ***column 3: DB_Object symbol, eg, wht-7,
| |
− | ***column 5: GOID, eg, GO:0000346
| |
− | ***column 6: DB:Reference (Reference), eg.PMID:12062106, GO_REF:0000002
| |
− | ***column 7: Evidence code, eg, IMP
| |
− | ***column 15: Assigned By, eg., WB (which database created the annotation)
| |
− | **For translation of UniProt IDs to WormBase Gene Ids use the file gp2protein.wb.gz at http://www.geneontology.org/gp2protein/
| |
− | *Source 2: Homology data, see #2
| |
− |
| |
− | '''2. Orthology, Homology and Paralog data'''
| |
− | *Ace tags: ?Gene Ortholog_other, Paralog
| |
− | *Contact: Michael Paulini
| |
− |
| |
− | '''3. Processes'''
| |
− | *Caltech data
| |
− | *GO data
| |
− | *Source 1: Download gene_association.wb.gz for C. elegans from GO consortium: http://www.geneontology.org/GO.current.annotations.shtml?all
| |
− | **All rows with column 15 (assigned by) with 'WB' are WormBase annotations, those with 'UniProtKB' or 'InterPro' are from those databases
| |
− | **Need data from these rows:
| |
− | ***where column 9: has value 'P' (Process),
| |
− | ***column 2 (DB_Object ID): the associated genes, UniProt ID or WBGene ID,
| |
− | ***column 3 (DB_Object symbol), eg, wht-7, GO terms are from column 5.
| |
− | ***column 5: GOID, eg, GO:0000346
| |
− | ***column 6: DB:Reference (Reference), eg.PMID:12062106, GO_REF:0000002
| |
− | ***column 7: Evidence code, eg, IMP
| |
− | ***column 8: With, eg. 'WB:WBRNAi00000785|WBPhenotype:0000050'
| |
− |
| |
− | *GO data in postgres
| |
− | **Paper -- gop_paper
| |
− | **WBGene -- gop_wbgene
| |
− | **GO -- gop_goontology
| |
− | **GO Term -- gop_goid
| |
− | *Contact Person: Kimberly, Ranjana
| |
− | *Source 2: Topic data
| |
− | **OA field: Gene, PG table name:pro_wbgene
| |
− | **OA field:WBPaper, PG table name:pro_paper
| |
− | *Contact Person: Karen
| |
− |
| |
− | '''4. Pathway'''
| |
− | (No database source for now?)
| |
− |
| |
− | '''5. Mutant Phenotypes'''
| |
− | *Caltech data
| |
− | *Source 1: Phenotype OA, PG table name:(Phenotypes are added to variation and not genes)
| |
− | *Source 2: Acedb tag: Under ?Gene, Reference_allele ?Variation and Allele ?Variation and Under ?Variation Phenotype
| |
− |
| |
− | *Source 3: phenotype_association file:ftp://ftp.wormbase.org/pub/wormbase/releases/WS243/ONTOLOGY/
| |
− | **File name:phenotype_association.WS243.wb
| |
− | **Following rows:
| |
− | ***Rows with 'NOT' should be ignored for this search.
| |
− | ***column 2, associated genes, WormBase GeneID
| |
− | ***column 3: DB_Object symbol, eg, aap-1,
| |
− | ***column 5: Phenotype ID, eg, WBPhenotype:0000674
| |
− | ***column 6: DB:Reference (Reference), eg.WB_REF:WBPaper00032243, or WB:WBVar00249743
| |
− | *Contact Person: Karen
| |
− |
| |
− | '''6. Genetic Interaction and 7. Physical Interaction'''
| |
− | *Caltech data
| |
− | *Source 1: gene_association.wb
| |
− | **ftp://ftp.wormbase.org/pub/wormbase/releases/WS243/ONTOLOGY/
| |
− | **File name:gene_association.WS243.wb.c_elegans
| |
− | **Rows with a 'IGI' in column 7 indicate a genetic interaction between the WBgenes in column 2/3 and column 8
| |
− | **Rows with a 'IPI" in column 7 indicate a physical interaction between the WBgenes in column 2/3 and column 8
| |
− | *Source 2: Interaction OA and tables
| |
− | **"Field Name" = Postgres Table:
| |
− | **"Paper" = int_paper
| |
− | **"Interaction Type" = int_type
| |
− | **"Bait overlapping gene" = int_genebait
| |
− | **"Target overlapping gene" = int_genetarget
| |
− | **"Non-directional Gene(s)" = int_genenondir
| |
− | **"Effector Gene(s)" = int_geneone
| |
− | **"Affected Gene(s)" = int_genetwo
| |
− |
| |
− | Example statements:
| |
− |
| |
− | If int_type = "Physical"
| |
− |
| |
− | <int_genebait> interacts physically with <int_genetarget> (and vice versa)
| |
− |
| |
− |
| |
− | If int_type = "Genetic - Synthetic ( Synthetic )"
| |
− |
| |
− | <int_genenondir> interacts with <other int_genenondir(s)> in a synthetic
| |
− | genetic interaction
| |
− |
| |
− |
| |
− | If int_type = "Genetic - Suppression ( Suppression )"
| |
− |
| |
− | <int_geneone> genetically suppresses <int_genetwo>
| |
− |
| |
− | '''8. Gene regulation'''
| |
− | *Caltech data
| |
− | *Source: Gene regulation data in genereg OA
| |
− | **Positive_regulate Anatomy_term "<grg_pos_anatomy>"
| |
− | **Positive_regulate Life_stage "<grg_pos_lifestage>"
| |
− | **Positive_regulate Subcellular_localization "<grg_pos_scl>"
| |
− | **Positive_regulate Subcellular_localization_text "<grg_pos_scltext>"
| |
− | **Negative_regulate Anatomy_term "<grg_neg_anatomy>"
| |
− | **Negative_regulate Life_stage "<grg_neg_lifestage>"
| |
− | **Negative_regulate Subcellular_localization "<grg_neg_scl>"
| |
− | **Negative_regulate Subcellular_localization_text "<grg_neg_scltext>"
| |
− | **Does_not_regulate Anatomy_term "<grg_not_anatomy>"
| |
− | **Does_not_regulate Life_stage "<grg_not_lifestage>"
| |
− | **Does_not_regulate Subcellular_localization "<grg_subcellloc>"
| |
− | **Does_not_regulate Subcellular_localization_text "<grg_not_scltext>"
| |
− | **Trans_regulated_gene "<grg_transregulated>"
| |
− | **Trans_regulator_gene "<grg_transregulator>"
| |
− | **No Subdata Result "<grg_result>"
| |
− |
| |
− | '''9. Molecular Function'''
| |
− | *Caltech data: GO Molecular Function
| |
− | *Source 1: Download gene_association.wb.gz for C. elegans from GO consortium: http://www.geneontology.org/GO.current.annotations.shtml?all
| |
− | **All rows with column 15 (assigned by) with 'WB' are WormBase annotations, those with 'UniProtKB' or 'InterPro' are from those databases
| |
− | **Need data from these rows:
| |
− | ***where column 9: has value 'F' (Molecular Function),
| |
− | ***column 2 (DB_Object ID): the associated genes, UniProt ID or WBGene ID,
| |
− | ***column 3 (DB_Object symbol), eg, wht-7, GO terms are from column 5.
| |
− | ***column 5: GOID, eg, GO:0000346
| |
− | ***column 6: DB:Reference (Reference), eg.PMID:12062106, GO_REF:0000002
| |
− | ***column 7: Evidence code, eg, IDA
| |
− | *Contact Person: Kimberly, Ranjana
| |
− |
| |
− | '''10. Tissue expression and life stage'''
| |
− | *Caltech data
| |
− | *Source 1: Expression data
| |
− | *OA (exprpat), PG table names:
| |
− | **exp_anatomy for anatomy terms
| |
− | **exp_goid for subcell localization
| |
− | **exp_lifestage for life stage
| |
− | **exp_paper for paper
| |
− | **exp_gene for gene
| |
− | *Contact Person: Daniela
| |
− |
| |
− | '''11. Sub-cellular localization'''
| |
− | *Caltech data
| |
− | *GO data
| |
− | *Source 1: gene_association file for C. elegans at:http://www.geneontology.org/GO.current.annotations.shtml?all
| |
− | **Need data from these rows:
| |
− | *** where column 9 has value 'C' (Cellular Component)
| |
− | ***column 2, associated genes, has UniProt ID or WormBase GeneID, need to translate UniProtID to WormBase ID.
| |
− | ***column 3: DB_Object symbol, eg, wht-7,
| |
− | ***column 5: GOID, eg, GO:0000346
| |
− | ***column 7: Evidence code, eg, IDA
| |
− |
| |
− | Rule 1:
| |
− |
| |
− | ==Template for a Concise Description==
| |
− | For the test phase, order of sentences:
| |
− | *Orthology
| |
− | *Process
| |
− | *Component
| |
− | *Function/identity
| |
− |
| |
− |
| |
− |
| |
− | '''Orthology/Similarity'''
| |
− | <Gene> is (orthologous, similar) to <human gene>;
| |
− | '''Phenotypes'''
| |
− | <Gene> mutants exhibit the following phenotypes: <phenotypes>.
| |
− | '''Process/Pathway'''
| |
− | <Gene> is (required, functions, regulates, is involved in, is part of) <process>;
| |
− | '''Genetic interaction with respect to Process or Pathway'''
| |
− | <Gene> interacts genetically with <gene1, gene2>;
| |
− | '''Physical interaction'''
| |
− | <Protein> physically interacts with (protein, DNA, RNA) .....;
| |
− | '''Molecular Function'''
| |
− | <Protein> has <molecular function>..... activity in (in vitro, in vivo);
| |
− | '''Tissue Expression and sub-cellular localization'''
| |
− | <Gene/Protein> is expressed in <tissue> and localized to <GO cellular component>;
| |
− |
| |
− | Note: Not all descriptions may follow the exact order or choice of words.
| |
− |
| |
− | ==Rules for automated sentence construction==
| |
− | '''Homology'''
| |
− |
| |
− | <Gene> encodes an ortholog of human <human protein name>;
| |
− |
| |
− | '''Process'''
| |
− | *Rule 1: Ignore all IEA and ISS process terms
| |
− | *Rule 2: Exclusions:
| |
− | **Ignore the term 'reproduction'[IMP]
| |
− | **Ignore the term 'embryo development ending in birth or egg hatching[IMP]'
| |
− | *Rule 3: If a GO term has the words 'involved in' anywhere, beginning, middle or at the end, use the words 'functions in'
| |
− | **Data: WBGene00006495,cpna-1,striated muscle contraction involved in embryonic body morphogenesis[IMP],WB_REF:WBPaper00041875|PMID:23283987
| |
− | **Sentence: cpna-1 functions in striated muscle contraction involved in embryonic body morphogenesis;
| |
− | *Rule 4: For the GO term 'synaptic transmission, <word>' switch the order of words to make it '<word> synaptic transmission'.
| |
− | *Example for Rule 4:
| |
− | **WBGene00011488,nra-2, synaptic transmission, cholinergic[IMP],WB_REF:WBPaper00034730|PMID:19609303,,WB,locomotion[IMP],protein processing[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR008710,regulation of signal transduction[IEA],INTERPRO:IPR016574
| |
− | **nra-2 is involved in cholinergic synaptic transmission and locomotion.
| |
− |
| |
− | *Rule 5: For all other Process terms the sentence will be:
| |
− | **<Gene> is involved in <process term>;
| |
− | **Examples:
| |
− | **WBGene00014123,elpc-3,tRNA wobble uridine modification[IMP],WB_REF:WBPaper00034713|PMID:19593383,,WB,translation[IMP],spermatogenesis[IGI],olfactory learning[IMP],vulval development[IGI],embryonic morphogenesis[IGI],oocyte development[IGI]
| |
− | **Sentence: elpc-3 is involved in tRNA wobble uridine modification, translation, spermatogenesis, olfactory learning, vulval development and embryonic morphogenesis.
| |
− |
| |
− | *Rule 6: For the term 'molting cycle, collagen and cuticulin-based cuticle' GO term, drop the comma and the words 'collagen and cuticulin-based cuticle'
| |
− | **Example:
| |
− | **WBGene00016643,vps-45,molting cycle, collagen and cuticulin-based cuticle[IMP],WB_REF:WBPaper00029049|PMID:17235359,,WB,vesicle docking involved in exocytosis[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR001619,vesicle-mediated transport[IEA]
| |
− | **Sentence: vps-45 is involved in '''the''' molting cycle;
| |
− |
| |
− | '''Molecular identity/function'''
| |
− | *Rule 1: If the evidence code is 'IEA' and 'activity' term is present, add the words 'based on protein domain information' to the sentence:
| |
− | **Sentence: <Gene> is predicted to have <activity term>, based on protein domain information.
| |
− | *Examples:
| |
− | **WBGene00000108,alh-2,oxidoreductase activity[IEA],PMID:12520011|PMID:12654719,INTERPRO:IPR015590,WB,oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor[IEA],INTERPRO:IPR016163
| |
− | alh-2 is predicted to have
| |
− |
| |
− |
| |
− | *Rule 2: If Evidence code is 'ISS' add the words 'based on sequence information' to the sentence
| |
− | *Rule 3: If evidence code is 'IDA' or 'IMP' add the words 'based on experimental evidence' to the sentence.
| |
− | Rule 3: Exclusion list:
| |
− | **Ignore the term 'protein binding'
| |
− |
| |
− | **Sentence: Gene exhibits <activity term, binding term>, based on experimental evidence.
| |
− |
| |
− |
| |
− | '''Component'''
| |
− | *Rule 1: Ignore all IEA and ISS GO terms, use only non-IEA GO terms
| |
− | **Sentence: <Gene> is localized to <component term>;
| |
− | *Rule 2: For 'integral component of ....' terms add the words 'is an';
| |
− | **Eg. for Rule 2: WBGene00006319,sup-10,integral component of plasma membrane[IDA],WB_REF:WBPaper00006135|PMID:14534247,,WB,striated muscle dense body[IDA]
| |
− | **sup-10 is an integral component of plasma membrane and is localized to striated muscle dense body;
| |
− | *Examples
| |
− | **WBGene00023405,sor-1,nucleoplasm[IDA],WB_REF:WBPaper00027128|PMID:16501168,,WB,nuclear speck[IDA]
| |
− | **Sentence: sor-1 is localized to the nucleoplasm and nuclear speck;
| |
− |
| |
− | **WBGene00004681,rsd-2,nucleolus[IDA],WB_REF:WBPaper00044261|PMID:18430922,,WB,endoplasmic reticulum[IDA],cytosol[IDA]
| |
− | **Sentence: rsd-2 is localized to the nucleolus, endoplasmic reticulum and cytosol;
| |
− |
| |
− | **WBGene00021827,dnc-6,dynactin complex[IDA],WB_REF:WBPaper00037699|PMID:20964796,,WB
| |
− | **Sentence: dnc-6 is localized to the dynactin complex;
| |
− |
| |
− | ==Publications related to Text-mining methods==
| |
− | *Automatically generating gene summaries from biomedical literature.
| |
− |
| |
− | Ling X, Jiang J, He X, Mei Q, Zhai C, Schatz B.
| |
− |
| |
− | Pac Symp Biocomput. 2006:40-51.
| |
− |
| |
− | PMID:17094226
| |
− |
| |
− | *Generating gene summaries from biomedical literature: A study of semi-structured summarization
| |
− |
| |
− | Xu Ling *, Jing Jiang, Xin He, Qiaozhu Mei, Chengxiang Zhai, Bruce Schatz
| |
− |
| |
− | Information Processing and Management 43 (2007) 1777–1791
| |
− |
| |
− |
| |
− | Back To [[Concise Descriptions]]
| |