Difference between revisions of "Textpresso-based automated extraction of concise descriptions"
Line 67: | Line 67: | ||
*2 sources: GO data and Topic data | *2 sources: GO data and Topic data | ||
*Sources for GO data: GO OA and GO file from Protein2GO | *Sources for GO data: GO OA and GO file from Protein2GO | ||
− | *GO OA, Postgres (PG) table name: | + | **GO OA, Postgres (PG) table name: |
*Easier to use the gene_association file:ftp://ftp.wormbase.org/pub/wormbase/releases/WS243/ONTOLOGY/ | *Easier to use the gene_association file:ftp://ftp.wormbase.org/pub/wormbase/releases/WS243/ONTOLOGY/ | ||
− | *File name:gene_association.WS243.wb.c_elegans | + | **File name:gene_association.WS243.wb.c_elegans |
− | *Rows with a 'P' in column 9 indicates GO Biological Process associated with a gene. | + | **Rows with a 'P' in column 9 indicates GO Biological Process associated with a gene. |
*Contact Person: Kimberly, Ranjana | *Contact Person: Kimberly, Ranjana | ||
*Topic OA: | *Topic OA: | ||
− | *OA field:Gene, PG table name:pro_wbgene | + | **OA field:Gene, PG table name:pro_wbgene |
− | *OA field:WBPaper, PG table name:pro_paper | + | **OA field:WBPaper, PG table name:pro_paper |
*Contact Person: Karen | *Contact Person: Karen | ||
Line 87: | Line 87: | ||
*Acedb tag: Under ?Gene, Reference_allele ?Variation and Allele ?Variation and Under ?Variation Phenotype | *Acedb tag: Under ?Gene, Reference_allele ?Variation and Allele ?Variation and Under ?Variation Phenotype | ||
*Might be easier to use the phenotype_association file:ftp://ftp.wormbase.org/pub/wormbase/releases/WS243/ONTOLOGY/ | *Might be easier to use the phenotype_association file:ftp://ftp.wormbase.org/pub/wormbase/releases/WS243/ONTOLOGY/ | ||
− | *File name:phenotype_association.WS243.wb | + | **File name:phenotype_association.WS243.wb |
− | |||
*Contact Person: Karen | *Contact Person: Karen | ||
Line 94: | Line 93: | ||
*Caltech data | *Caltech data | ||
*Source: Interaction OA and tables | *Source: Interaction OA and tables | ||
− | *"Field Name" = Postgres Table: | + | **"Field Name" = Postgres Table: |
− | *"Paper" = int_paper | + | **"Paper" = int_paper |
− | *"Interaction Type" = int_type | + | **"Interaction Type" = int_type |
− | *"Bait overlapping gene" = int_genebait | + | **"Bait overlapping gene" = int_genebait |
− | *"Target overlapping gene" = int_genetarget | + | **"Target overlapping gene" = int_genetarget |
− | *"Non-directional Gene(s)" = int_genenondir | + | **"Non-directional Gene(s)" = int_genenondir |
− | *"Effector Gene(s)" = int_geneone | + | **"Effector Gene(s)" = int_geneone |
− | *"Affected Gene(s)" = int_genetwo | + | **"Affected Gene(s)" = int_genetwo |
Example statements: | Example statements: | ||
Line 125: | Line 124: | ||
*File for GO data (for protein-encoding genes only): | *File for GO data (for protein-encoding genes only): | ||
*GO OA, PG table name: | *GO OA, PG table name: | ||
− | *Might be easier to use the gene_association.WSXXX.c_elegans file: | + | *Might be easier to use the gene_association.WSXXX.c_elegans file: |
− | *Rows with a 'F' in column 9 indicates GO molecular function associated with a gene. | + | **ftp://ftp.wormbase.org/pub/wormbase/releases/WS243/ONTOLOGY/ |
+ | **Rows with a 'F' in column 9 indicates GO molecular function associated with a gene. | ||
*Contact Person: Kimberly, Ranjana | *Contact Person: Kimberly, Ranjana | ||
Line 133: | Line 133: | ||
*Caltech data | *Caltech data | ||
*Expression OA (exprpat), PG table names: | *Expression OA (exprpat), PG table names: | ||
− | * | + | **exp_anatomy for anatomy terms |
− | *exp_anatomy for anatomy terms | + | **exp_goid for subcell localization |
− | *exp_goid for subcell localization | + | **exp_lifestage for life stage |
− | *exp_lifestage for life stage | + | **exp_paper for paper |
− | *exp_paper for paper | + | **exp_gene for gene |
− | *exp_gene for gene | ||
*Contact Person: Daniela | *Contact Person: Daniela | ||
Line 145: | Line 144: | ||
*Caltech data | *Caltech data | ||
*Source: GO cellular component | *Source: GO cellular component | ||
− | *Source 1 | + | *Source 1: gene_association file |
− | *File name:gene_association.WS243.wb.c_elegans | + | **ftp://ftp.wormbase.org/pub/wormbase/releases/WS243/ONTOLOGY/ |
− | + | **File name:gene_association.WS243.wb.c_elegans | |
− | |||
− | |||
− | |||
− | |||
==Publications related to Text-mining methods== | ==Publications related to Text-mining methods== |
Revision as of 20:56, 28 May 2014
Contents
- 1 Generating gene sets with and without concise descriptions
- 2 Location of project-related files on Textpresso
- 3 Semantic categories in a Concise Description
- 4 Template for a Concise Description
- 5 Data mining (mining data from Postgres and/or Acedb) for the semantic categories
- 6 Publications related to Text-mining methods
Generating gene sets with and without concise descriptions
Set of genes with a concise description
Query for all genes with a concise description from Postgres: Relevant postgres table names:
- con_wbgene: Stores the WBGene ID and gene names
- con_desctype: Type of description (relevant for us: Concise_description)
- con_desctext: Text of the concise description
Query for all WBGenes that have a concise description (in con_desctext AND con_desctype):
SELECT DISTINCT(con_wbgene) FROM con_wbgene WHERE joinkey IN (SELECT joinkey FROM con_desctext WHERE con_desctext IS NOT NULL) AND joinkey IN (SELECT joinkey FROM con_desctype WHERE con_desctype IS NOT NULL) ORDER BY con_wbgene;
- Number of genes with a concise description (as of 05.07.2014)=6,624
Set of genes with no concise description
Set of genes with no concise description and at least one published paper
http://textpresso-dev.caltech.edu/concise_descriptions/
Semantic categories in a Concise Description
1. Molecular identity
2. Orthology/Similarity
3. Mutant Phenotypes
4. Processes
5. Pathways
6. Genetic Interaction
7. Physical Interaction
8. Molecular Function
9. Tissue expression (may include life-stage)
10. Sub-cellular localization (may include life-stage)
Template for a Concise Description
Molecular identity <Gene> encodes .....; Orthology/Similarity <Gene> is (orthologous, similar) to .....; Phenotypes <Gene> mutants exhibit the following phenotypes, <phenotypes>. Process/Pathway <Gene> is (required, functions, regulates, is involved in, is part of) ....., as mutants of <gene> exhibit <phenotypes>; Genetic interaction with respect to Process or Pathway <Gene> interacts genetically with <gene1, gene2> ..... in <Process, Pathway>; Physical interaction <Protein> physically interacts with (protein, DNA, RNA) .....; Molecular Function <Protein> has ..... activity in (in vitro, in vivo) assays; Tissue Expression <Gene/Protein> is expressed in ..... and expression in ..... is (positively, negatively) regulated by <Gene/Protein>.....; Sub-cellular localization <Protein> is localized to <cellular component> and expression in <cellular component> is <positively, negatively> regulated by .....
Note: Not all descriptions may follow the exact order or choice of words.
Data mining (mining data from Postgres and/or Acedb) for the semantic categories
1. Molecular identity and Orthology data (Homology, Orthology and Paralog data)
- Non-caltech data, curated at Hinxton
- Ace tags: ?Gene Ortholog_other, Paralog
- Contact: Michael Paulini
3. Processes
- Caltech data
- 2 sources: GO data and Topic data
- Sources for GO data: GO OA and GO file from Protein2GO
- GO OA, Postgres (PG) table name:
- Easier to use the gene_association file:ftp://ftp.wormbase.org/pub/wormbase/releases/WS243/ONTOLOGY/
- File name:gene_association.WS243.wb.c_elegans
- Rows with a 'P' in column 9 indicates GO Biological Process associated with a gene.
- Contact Person: Kimberly, Ranjana
- Topic OA:
- OA field:Gene, PG table name:pro_wbgene
- OA field:WBPaper, PG table name:pro_paper
- Contact Person: Karen
4. Pathway (No database source for now?)
5. Mutant Phenotypes
- Caltech data
- Phenotype OA, PG table name:(Phenotypes are added to variation and not genes)
- Acedb tag: Under ?Gene, Reference_allele ?Variation and Allele ?Variation and Under ?Variation Phenotype
- Might be easier to use the phenotype_association file:ftp://ftp.wormbase.org/pub/wormbase/releases/WS243/ONTOLOGY/
- File name:phenotype_association.WS243.wb
- Contact Person: Karen
6. Genetic Interaction and 7. Physical Interaction
- Caltech data
- Source: Interaction OA and tables
- "Field Name" = Postgres Table:
- "Paper" = int_paper
- "Interaction Type" = int_type
- "Bait overlapping gene" = int_genebait
- "Target overlapping gene" = int_genetarget
- "Non-directional Gene(s)" = int_genenondir
- "Effector Gene(s)" = int_geneone
- "Affected Gene(s)" = int_genetwo
Example statements:
If int_type = "Physical"
<int_genebait> interacts physically with <int_genetarget> (and vice versa)
If int_type = "Genetic - Synthetic ( Synthetic )"
<int_genenondir> interacts with <other int_genenondir(s)> in a synthetic genetic interaction
If int_type = "Genetic - Suppression ( Suppression )"
<int_geneone> genetically suppresses <int_genetwo>
8. Molecular Function
- Caltech data: GO Molecular Function
- File for GO data (for protein-encoding genes only):
- GO OA, PG table name:
- Might be easier to use the gene_association.WSXXX.c_elegans file:
- ftp://ftp.wormbase.org/pub/wormbase/releases/WS243/ONTOLOGY/
- Rows with a 'F' in column 9 indicates GO molecular function associated with a gene.
- Contact Person: Kimberly, Ranjana
9. Tissue expression and life stage
- Caltech data
- Expression OA (exprpat), PG table names:
- exp_anatomy for anatomy terms
- exp_goid for subcell localization
- exp_lifestage for life stage
- exp_paper for paper
- exp_gene for gene
- Contact Person: Daniela
10. Sub-cellular localization
- Caltech data
- Source: GO cellular component
- Source 1: gene_association file
- ftp://ftp.wormbase.org/pub/wormbase/releases/WS243/ONTOLOGY/
- File name:gene_association.WS243.wb.c_elegans
- Automatically generating gene summaries from biomedical literature.
Ling X, Jiang J, He X, Mei Q, Zhai C, Schatz B.
Pac Symp Biocomput. 2006:40-51.
PMID:17094226
- Generating gene summaries from biomedical literature: A study of semi-structured summarization
Xu Ling *, Jing Jiang, Xin He, Qiaozhu Mei, Chengxiang Zhai, Bruce Schatz
Information Processing and Management 43 (2007) 1777–1791
Back To Concise Descriptions