Difference between revisions of "UserGuide:SimpleMine"

From WormBaseWiki
Jump to navigationJump to search
(Created page with "SimpleMine Users' Guide")
 
 
(11 intermediate revisions by 2 users not shown)
Line 1: Line 1:
SimpleMine Users' Guide
+
== SimpleMine Users' Guide ==
 +
 
 +
SimpleMine is designed for biologists who want to get essential information for a list of genes without any command-line or programming skill. We consider the following as "essential information" based on user feedback. Please feel free to contact us if you want to include more information on the list. 
 +
 
 +
Input: Users can submit CGC names, sequence names, WormBase Gene IDs, WormPep IDs, UniProt IDs, TreeFam IDs, and RefSeq IDs. 
 +
Output: Users can opt for an HTML display or download a tab-delimited file. One row per gene, each cell contains one data field. In each cell, information is divided according to the following tier: comma-separated, bar-separated, and semicolon-separated. Please see the following explanations for each data field about how information is organized for a particular type of data.   
 +
 
 +
 
 +
'''Names, Identifiers, Sequences, Species'''
 +
 
 +
WormBase Gene ID: Unique Gene identifiers used by WormBase
 +
 
 +
Public Name: Official gene names specified by WormBase. A public name can be a CGC name or a sequence name.
 +
 
 +
Species: Each gene can only be associated with one species.
 +
 
 +
Biotype: Classification of genes such as protein_coding_gene, ncRNA_gene, pseudogene ... etc.
 +
 
 +
Sequence Name: Sequence name of the gene.
 +
 
 +
Other Name: All names that have been used by the gene in publications.
 +
 
 +
Transcript: Transcript names of the gene.
 +
 
 +
Operon: A set of genes transcribed under the control of an operator gene.
 +
 
 +
WormPep: Protein IDs used by WormBase.
 +
 
 +
Protein Domain: Protein Domains associated with the gene. If a gene has multiple peptides, peptides are comma-separated. Each peptide entry lists all predicted protein domains bar-separated.
 +
 
 +
UniProt: Official Protein Identifiers used by the UniProt database.
 +
 
 +
Reference UniProt ID: There has been an attempt to assign a single, unique UniProt ID for each gene to act as a "reference" UniProt ID for the gene. This is called the Gene-Centric Reference Proteome (GCRP) ID and is usually an existing SwissProt or TrEMBL ID that has been selected to represent the gene.
 +
 
 +
TreeFam: Official gene identifiers used by the TreeFam database.
 +
 
 +
RefSeq_mRNA: Sequence IDs used by the RefSeq database
 +
 
 +
RefSeq_protein: Protein IDs used by the RefSeq database
 +
 
 +
 
 +
'''Genetics, Phenotypes, Interactions'''
 +
 
 +
Genetic Map Position: Display the chromosome and chromosomal position of the gene.
 +
 
 +
Chromosome Coordinates: Chromosome, strand, left and right position of the gene on the genome. These came from the Genome Feature File c_elegans.PRJNA13758.WS*.annotations.gff3.gz
 +
 
 +
Transcription Factors: Transcription factors located within 2kp upstream of each transcript of the gene. These came from the GFF3 tracks of "TF_binding_site" and "TF_binding_site_region"
 +
 
 +
RNAi Clone: RNAi library clones for the gene and their locations (if available), for example, sjj_F54D12.3|II-1H24, mv_F54D12.3, mv_F54D12.f, cenix:67-g4.
 +
 
 +
RNAi Phenotype Observed: Display the phenotype ontology names sorted in case-insensitive alphabetical order.
 +
 
 +
Allele Phenotype Observed: Display the phenotype ontology names sorted in case-insensitive alphabetical order.
 +
 
 +
Coding_exon Non_silent Allele: Display a list of alleles that fall in any coding exon. Polymorphisms and silent alleles are excluded. Alleles are sorted and displayed according to the following order of their molecular changes: Deletion, Insertion, Substitution, Tandem_duplication. In each molecular change category, alleles are sorted in alphabetical order. Each allele entry contains three bar-separated fields: allele name, molecular change, and protein effect.
 +
 
 +
Interacting Gene: We only display experimentally confirmed gene interactions (Physical, Genetic, and Regulatory). The genes are displayed in the following order: genes with all three types of interactions detected, genes with two out of three types of interactions detected, and genes with one type of interaction detected. In each category, gene names are sorted in case-insensitive alphabetical order. Each gene entry contains two bar-separated fields: gene name and interaction types. The interaction types are separated with semicolons.
 +
 
 +
Interacting Gene Exclude High-throughput: Same as "Interacting Gene" but without high-throughput experiments.
 +
 
 +
'''Expression'''
 +
 
 +
Expr_pattern Tissue: Anatomical expression based on GFP, immunoprecipitation, In_situ, etc. Anatomy names are displayed in case-insensitive alphabetical order.
 +
 
 +
Genomic Study Tissue: Tissue enrichment based on the microarray, RNA-Seq, and proteomics studies. Anatomy names are displayed in case-insensitive alphabetical order.
 +
 
 +
Expr_pattern LifeStage: Developmental expression based on GFP, immunoprecipitation, In_situ, etc. Life stages are displayed in case-insensitive alphabetical order.
 +
 
 +
Genomic Study LifeStage: Developmental expression based on the microarray, RNA-Seq, and proteomics studies. Life stages are displayed in case-insensitive alphabetical order.
 +
 
 +
 
 +
'''Human Orthologs and Disease'''
 +
 
 +
Disease Info: Display the disease names associated with the gene. Each disease entry contains two bar-separated fields: disease name and the evidence (By Orthology or By Experiment). 
 +
 
 +
Human Ortholog: Display the human orthologs of the gene. Each ortholog entry contains two bar-separated fields: ortholog name and algorithms that predicted the orthology. The algorithms are separated with semicolons.
 +
 
 +
 
 +
'''Functional Annotation and References'''
 +
 
 +
Gene Ontology Association: Display the names of gene ontology terms that were annotated to the gene, sorted in alphabetic order.
 +
 
 +
Concise Description: Outdated manually written descriptions of the gene functions.
 +
 
 +
Automated Description: Up-to-date gene description machine generated based on the current WormBase data.
 +
 
 +
Expression Cluster Summary: Gene regulation, molecular regulation, and tissue enrichment summary based on the microarray, RNA-Seq, and proteomics studies.
 +
 
 +
Reference: Primary research articles that studied the gene.

Latest revision as of 23:51, 9 May 2023

SimpleMine Users' Guide

SimpleMine is designed for biologists who want to get essential information for a list of genes without any command-line or programming skill. We consider the following as "essential information" based on user feedback. Please feel free to contact us if you want to include more information on the list.

Input: Users can submit CGC names, sequence names, WormBase Gene IDs, WormPep IDs, UniProt IDs, TreeFam IDs, and RefSeq IDs. Output: Users can opt for an HTML display or download a tab-delimited file. One row per gene, each cell contains one data field. In each cell, information is divided according to the following tier: comma-separated, bar-separated, and semicolon-separated. Please see the following explanations for each data field about how information is organized for a particular type of data.


Names, Identifiers, Sequences, Species

WormBase Gene ID: Unique Gene identifiers used by WormBase

Public Name: Official gene names specified by WormBase. A public name can be a CGC name or a sequence name.

Species: Each gene can only be associated with one species.

Biotype: Classification of genes such as protein_coding_gene, ncRNA_gene, pseudogene ... etc.

Sequence Name: Sequence name of the gene.

Other Name: All names that have been used by the gene in publications.

Transcript: Transcript names of the gene.

Operon: A set of genes transcribed under the control of an operator gene.

WormPep: Protein IDs used by WormBase.

Protein Domain: Protein Domains associated with the gene. If a gene has multiple peptides, peptides are comma-separated. Each peptide entry lists all predicted protein domains bar-separated.

UniProt: Official Protein Identifiers used by the UniProt database.

Reference UniProt ID: There has been an attempt to assign a single, unique UniProt ID for each gene to act as a "reference" UniProt ID for the gene. This is called the Gene-Centric Reference Proteome (GCRP) ID and is usually an existing SwissProt or TrEMBL ID that has been selected to represent the gene.

TreeFam: Official gene identifiers used by the TreeFam database.

RefSeq_mRNA: Sequence IDs used by the RefSeq database

RefSeq_protein: Protein IDs used by the RefSeq database


Genetics, Phenotypes, Interactions

Genetic Map Position: Display the chromosome and chromosomal position of the gene.

Chromosome Coordinates: Chromosome, strand, left and right position of the gene on the genome. These came from the Genome Feature File c_elegans.PRJNA13758.WS*.annotations.gff3.gz

Transcription Factors: Transcription factors located within 2kp upstream of each transcript of the gene. These came from the GFF3 tracks of "TF_binding_site" and "TF_binding_site_region"

RNAi Clone: RNAi library clones for the gene and their locations (if available), for example, sjj_F54D12.3|II-1H24, mv_F54D12.3, mv_F54D12.f, cenix:67-g4.

RNAi Phenotype Observed: Display the phenotype ontology names sorted in case-insensitive alphabetical order.

Allele Phenotype Observed: Display the phenotype ontology names sorted in case-insensitive alphabetical order.

Coding_exon Non_silent Allele: Display a list of alleles that fall in any coding exon. Polymorphisms and silent alleles are excluded. Alleles are sorted and displayed according to the following order of their molecular changes: Deletion, Insertion, Substitution, Tandem_duplication. In each molecular change category, alleles are sorted in alphabetical order. Each allele entry contains three bar-separated fields: allele name, molecular change, and protein effect.

Interacting Gene: We only display experimentally confirmed gene interactions (Physical, Genetic, and Regulatory). The genes are displayed in the following order: genes with all three types of interactions detected, genes with two out of three types of interactions detected, and genes with one type of interaction detected. In each category, gene names are sorted in case-insensitive alphabetical order. Each gene entry contains two bar-separated fields: gene name and interaction types. The interaction types are separated with semicolons.

Interacting Gene Exclude High-throughput: Same as "Interacting Gene" but without high-throughput experiments.

Expression

Expr_pattern Tissue: Anatomical expression based on GFP, immunoprecipitation, In_situ, etc. Anatomy names are displayed in case-insensitive alphabetical order.

Genomic Study Tissue: Tissue enrichment based on the microarray, RNA-Seq, and proteomics studies. Anatomy names are displayed in case-insensitive alphabetical order.

Expr_pattern LifeStage: Developmental expression based on GFP, immunoprecipitation, In_situ, etc. Life stages are displayed in case-insensitive alphabetical order.

Genomic Study LifeStage: Developmental expression based on the microarray, RNA-Seq, and proteomics studies. Life stages are displayed in case-insensitive alphabetical order.


Human Orthologs and Disease

Disease Info: Display the disease names associated with the gene. Each disease entry contains two bar-separated fields: disease name and the evidence (By Orthology or By Experiment).

Human Ortholog: Display the human orthologs of the gene. Each ortholog entry contains two bar-separated fields: ortholog name and algorithms that predicted the orthology. The algorithms are separated with semicolons.


Functional Annotation and References

Gene Ontology Association: Display the names of gene ontology terms that were annotated to the gene, sorted in alphabetic order.

Concise Description: Outdated manually written descriptions of the gene functions.

Automated Description: Up-to-date gene description machine generated based on the current WormBase data.

Expression Cluster Summary: Gene regulation, molecular regulation, and tissue enrichment summary based on the microarray, RNA-Seq, and proteomics studies.

Reference: Primary research articles that studied the gene.