Difference between revisions of "FAQs"

From WormBaseWiki
Jump to navigationJump to search
(updated the URL)
 
(108 intermediate revisions by 9 users not shown)
Line 1: Line 1:
== General Questions ==
+
[[File:Warning.jpg|100px]]
 +
<span style="color:red">
 +
'''Warning:''' Most of this pages content has been transferred to the [http://www.wormbase.org/about/Frequently_asked_questions FAQ pages] on the new www.wormbase.org site.
 +
</span>[[File:Warning.jpg|100px]]
  
=== How should I cite WormBase? ===
 
  
see [[Citing_WormBase Citing_WormBase]
+
== General Questions  ==
  
=== What database technology are you using for WormBase? ===
+
=== How should I cite WormBase? ===
  
1) At the back end, WormBase data are deposited in an object-oriented database, ACeDB, which is the "master" database containing all data. ACeDB can be accessed both remotely and locally, through both commandline and web server.
+
see [[Citing and Acknowledging WormBase|Citing and Acknowledging WormBase]]<br>
  
2) Some data (especially sequence data including genomics sequence, ESTs, OSTs, SNPs, genes, RNAs etc) are extracted from ACeDB and are deposited in a "slave" MySQL database, to support some key features like gbrowse (see below);
+
=== What database technology are you using for WormBase?  ===
  
3) At the front end sits the apache server with mod_perl. Wormbase software package containing configuration files and a series of CGI scripts runs on the apache server. The CGI scripts provide users with a number of ways to browse and search WormBase.
+
1) At the back end, WormBase data are deposited in an object-oriented database, ACeDB, which is the "master" database containing all data. ACeDB can be accessed both remotely and locally, through both commandline and web server.  
  
4) Some key features of the WormBase package: i. gbrowse (http://www.wormbase.org/db/seq/gbrowse?source=wormbase): developed by Lincoln Stein for the GMOD consortium and is widely used for other model organisms. It allows users to browse through the whole genome for feature tracks corresponding to specific genome regions. gbrowse is highly configuarable and support multiple foreign languages. ii. synteny browser(http://www.wormbase.org/db/seq/ebsyn?name=CBG22984): recently developed by Lincoln Stein for the GMOD consortium as well. It allows comparative view of two genomes side by side, focusing on the syntenic regions.
+
2) Some data (especially sequence data including genomics sequence, ESTs, OSTs, SNPs, genes, RNAs etc) are extracted from ACeDB and are deposited in a "slave" MySQL database, to support some key features like gbrowse (see below);
  
=== How are the WormBase entries created and maintained? ===
+
3) At the front end sits the apache server with mod_perl. Wormbase software package containing configuration files and a series of CGI scripts runs on the apache server. The CGI scripts provide users with a number of ways to browse and search WormBase.
  
There is no simple answer to that. WormBase has a team of about 30 people who generate and curate data in many different ways. The genome sequence of C. elegans was determined at two of the four WormBase groups, and so a lot of data pertaining to gene predictions and other features annotated on to the genome are created and maintained by those groups.
+
4) Some key features of the WormBase package: i. gbrowse (http://www.wormbase.org/db/seq/gbrowse?source=wormbase): developed by Lincoln Stein for the GMOD consortium and is widely used for other model organisms. It allows users to browse through the whole genome for feature tracks corresponding to specific genome regions. gbrowse is highly configuarable and support multiple foreign languages. ii. synteny browser(http://www.wormbase.org/db/seq/ebsyn?name=CBG22984): recently developed by Lincoln Stein for the GMOD consortium as well. It allows comparative view of two genomes side by side, focusing on the syntenic regions.  
  
The group at Caltech do a lot of literature curation and extract all sorts of information from the published literature (from hand-curated descriptions of gene function to details of individual RNAi experiments).
+
=== How are the WormBase entries created and maintained?  ===
  
Also a lot of data comes from 3rd party collaborators who submit bulk datasets direct to WormBase (e.g. Orfeome data, 'knockout' deletion alleles). In contrast we also get directly submitted data from users at a very small level, e.g. individual allele submissions.
+
There is no simple answer to that. WormBase has a team of about 30 people who generate and curate data in many different ways. The genome sequence of C. elegans was determined at two of the four WormBase groups, and so a lot of data pertaining to gene predictions and other features annotated on to the genome are created and maintained by those groups.  
  
Finally, we also generate data de novo as part of the database build procedure, e.g. calculating molecular weights of proteins.
+
The group at Caltech do a lot of literature curation and extract all sorts of information from the published literature (from hand-curated descriptions of gene function to details of individual RNAi experiments).  
  
=== Can you give me medical advice on how to deal with infectious or parasitic worms? ===
+
Also a lot of data comes from 3rd party collaborators who submit bulk datasets direct to WormBase (e.g. Orfeome data, 'knockout' deletion alleles). In contrast we also get directly submitted data from users at a very small level, e.g. individual allele submissions.
 +
 
 +
Finally, we also generate data de novo as part of the database build procedure, e.g. calculating molecular weights of proteins.
 +
 
 +
=== Can you give me medical advice on how to deal with infectious or parasitic worms? ===
  
 
Unfortunately, no; WormBase is specifically dedicated to the biology of ''Caenorhabditis elegans'', is staffed by Ph.D.s rather than physicians, and would not lawfully be able to provide medical advice over the Internet even if it were a M.D.-staffed database oriented towards pathogenic worms. Please consult your local physician for all medical advice.
 
Unfortunately, no; WormBase is specifically dedicated to the biology of ''Caenorhabditis elegans'', is staffed by Ph.D.s rather than physicians, and would not lawfully be able to provide medical advice over the Internet even if it were a M.D.-staffed database oriented towards pathogenic worms. Please consult your local physician for all medical advice.
  
== [http://www.wormbase.org/db/gene/gene Gene Summary Page] questions ==
+
== [http://www.wormbase.org/db/gene/gene Gene Summary Page] questions ==
 
 
=== What does the "% length" mean in the Best Blast Hits table? ===
 
  
BLAST queries can have matches with multiple regions on the same hit. WormBase attempts to reconcile this information and present a value which represents the extent of coverage of all matches on the target sequence.
+
=== What does the "% length" mean in the Best Blast Hits table?  ===
  
--[[User:Tharris Tharris] 15:13, 9 February 2006 (EST)
+
BLAST queries can have matches with multiple regions on the same hit. WormBase attempts to reconcile this information and present a value which represents the extent of coverage of all matches on the target sequence.
  
=== What do the different gene model Status lines (ie confirmed) shown on the Gene Page and in the GFF files mean? ===
+
-- [[Tharris|Tharris]] 15:13, 9 February 2006 (EST)
  
confirmed_inconsistent - means that a curator has decided that the intron doesn't fit with what we consider to be a valid transcript or there appears to be something wrong with the Transcript that confirms the intron.
+
=== What do the different gene model Status lines (ie confirmed) shown on the Gene Page and in the GFF files mean?  ===
  
confirmed_est - an intron confirmed by EST transcript sequence data
+
confirmed_inconsistent - means that a curator has decided that the intron doesn't fit with what we consider to be a valid transcript or there appears to be something wrong with the Transcript that confirms the intron.
  
confirmed_cdna - an intron confirmed by cDNA/mRNA transcript sequence data
+
confirmed_est - an intron confirmed by EST transcript sequence data  
  
confirmed_false - these are where curators have confirmed that the confirmed intron is false or artefactual.
+
confirmed_cdna - an intron confirmed by cDNA/mRNA transcript sequence data
  
other confirmed_* types that a curator can add are:
+
confirmed_false - these are where curators have confirmed that the confirmed intron is false or artefactual.
  
confirmed_UTR - used when a confirmed intron looks like it is in the UTR of a gene.
+
other confirmed_* types that a curator can add are:
  
confirmed_Homology - although I don't think this has ever been used.
+
confirmed_UTR - used when a confirmed intron looks like it is in the UTR of a gene.  
  
== Fetching Sequences ==
+
confirmed_Homology - although I don't think this has ever been used.
  
=== I'd like to fetch the DNA sequence of (some feature or coordinates). How do I go about this? ===
+
== Fetching Sequences  ==
  
WormBase offers many ways to fetch sequences of features.
+
=== I'd like to fetch the DNA sequence of (some feature or coordinates). How do I go about this?  ===
  
* Using the Genome Browser
+
WormBase offers many ways to fetch sequences of features.
  
# Enter the desired coordinates into the [http://www.wormbase.org/db/seq/gbrowse/wormbase Genome Browser] using the format "CHROMOSOME:START..STOP" (eg. X:10..1000). If you don't know the coordinates of the feature of interest, just search for the feature itself.
+
*Using the Genome Browser
# Select "Display Decorated FASTA File" from the "Reports &amp; Analysis" popupmenu. Click Go to retrieve the sequence of the region. You can also specify optional formatiting of features contained within the sequence by clicking "configure...".
 
# '''HINT'''<nowiki>: You can adjust the coordinates of the segment to be retrieved manually or by zooming in or out. </nowiki>
 
  
* Using WormMart
+
#Enter the desired coordinates into the [http://www.wormbase.org/db/seq/gbrowse/wormbase Genome Browser] using the format "CHROMOSOME:START..STOP" (eg. X:10..1000). If you don't know the coordinates of the feature of interest, just search for the feature itself.
 +
#Select "Display Decorated FASTA File" from the "Reports &amp; Analysis" popupmenu. Click Go to retrieve the sequence of the region. You can also specify optional formatiting of features contained within the sequence by clicking "configure...".
 +
#'''HINT'''<nowiki>: You can adjust the coordinates of the segment to be retrieved manually or by zooming in or out. </nowiki>
 +
-- [[Tharris|Tharris]]11:15, 16 February 2006 (EST)
 +
*Using the Genome Browser for a batch of sequences
 +
# On Genome Browser page, under "Reports & Analysis:" select from the pull down list "Download Sequence File"; click on "Configure"; paste in your coordinates (eg. X:10..1000, one per line) in the "Sequence IDs" box; select the choices of output you desire; hit go and enjoy
 +
--[[User:Raymond|Raymond]] 13:01, 14 July 2009 (EDT)
 +
*Using WormMart
  
Click [http://www.wormbase.org/wiki/index.php?title=FAQs#WormMart_questions Here] for some example WormMart Queries.
+
Click [http://www.wormbase.org/wiki/index.php?title=FAQs#WormMart_questions Here] for some example WormMart Queries.  
  
--[[User:Tharris Tharris] 11:15, 16 February 2006 (EST)
+
-- [[Tharris|Tharris]]11:15, 16 February 2006 (EST)
  
=== How can I download all the [3' UTR] sequences from the C. elegans genome? ===
+
=== How can I download all the [3' UTR] sequences from the C. elegans genome? ===
  
The best way to download all sequences (for example the 3' UTR sequences) is through WormMart.
+
The best way to download all sequences (for example the 3' UTR sequences) is through WormMart.  
  
Here are the steps: Go to:
+
Here are the steps: Go to:  
  
http://www.wormbase.org/biomart/martview
+
http://www.wormbase.org/biomart/martview  
  
 
  - Select the most recent version of WormBase and then select the 'Gene' dataset.  Hit 'next'.
 
  - Select the most recent version of WormBase and then select the 'Gene' dataset.  Hit 'next'.
Line 87: Line 96:
 
  - Let the page reload and then select '3' UTR' from the sequence type menu.  Hit 'export'.
 
  - Let the page reload and then select '3' UTR' from the sequence type menu.  Hit 'export'.
  
=== Where can I get repeatmasked genomic sequences for ''C. elegans'', ''C. briggsae'', or ''C. remanei''? ===
+
=== Where can I get repeatmasked genomic sequences for ''C. elegans'', ''C. briggsae'', or ''C. remanei''? ===
  
For ''C. elegans'', using the current (17 July 2007) most recent archival release of the database, you can get repeatmasked chromosomal sequences here:
+
[[How are the repeats determined?|How are the repeats determined?]]
 +
 
 +
For ''C. elegans'', using the current (17 July 2007) most recent archival release of the database, you can get repeatmasked chromosomal sequences here:  
  
 
     ftp://ftp.sanger.ac.uk/pub2/wormbase/WS170/CHROMOSOMES/CHROMOSOME_I_masked.dna.gz
 
     ftp://ftp.sanger.ac.uk/pub2/wormbase/WS170/CHROMOSOMES/CHROMOSOME_I_masked.dna.gz
Line 97: Line 108:
 
     ftp://ftp.sanger.ac.uk/pub2/wormbase/WS170/CHROMOSOMES/CHROMOSOME_V_masked.dna.gz
 
     ftp://ftp.sanger.ac.uk/pub2/wormbase/WS170/CHROMOSOMES/CHROMOSOME_V_masked.dna.gz
 
     ftp://ftp.sanger.ac.uk/pub2/wormbase/WS170/CHROMOSOMES/CHROMOSOME_X_masked.dna.gz
 
     ftp://ftp.sanger.ac.uk/pub2/wormbase/WS170/CHROMOSOMES/CHROMOSOME_X_masked.dna.gz
 +
  
You will need to uncompress the relevant files with "gunzip CHROMOSOME_*_masked.dna.gz" or a similar command.
+
You will need to uncompress the relevant files with "gunzip CHROMOSOME_*_masked.dna.gz" or a similar command.  
  
The ''C. briggsae'' repeatmasked genomic sequence is here, for assembly cb25.agp8:
+
The ''C. briggsae'' repeatmasked genomic sequence is here, for assembly cb25.agp8:  
  
 
     ftp://ftp.wormbase.org/pub/wormbase/genomes/briggsae/sequences/dna/cb25.agp8.supercontigs.masked.fa.gz
 
     ftp://ftp.wormbase.org/pub/wormbase/genomes/briggsae/sequences/dna/cb25.agp8.supercontigs.masked.fa.gz
 +
  
At some point there should be a repeatmasked version of the cb3 assembly, but as of 17 July 2007 there isn't yet.
+
At some point there should be a repeatmasked version of the cb3 assembly, but as of 17 July 2007 there isn't yet.  
  
''C. remanei'' is here:
+
''C. remanei'' is here:  
  
 
     ftp://ftp.wormbase.org/pub/wormbase/genomes/remanei/preliminary_analysis/assembly/C_remanei_masked.tar.gz
 
     ftp://ftp.wormbase.org/pub/wormbase/genomes/remanei/preliminary_analysis/assembly/C_remanei_masked.tar.gz
 +
  
 
Note that in all of these cases, the sequences are '''hardmasked'''<nowiki>: i.e., the repeat sequences have been replaced by stretches of "N" residues, instead of being marked in some less information-destroying way. By contrast, </nowiki>'''softmasked''' sequences would keep the repeat sequences but distinguish them by changing their case: non-repeat sequences would be ''UPPERCASE'', while repeat sequences embedded between the non-repeat sequences would be ''lowercase''.
 
Note that in all of these cases, the sequences are '''hardmasked'''<nowiki>: i.e., the repeat sequences have been replaced by stretches of "N" residues, instead of being marked in some less information-destroying way. By contrast, </nowiki>'''softmasked''' sequences would keep the repeat sequences but distinguish them by changing their case: non-repeat sequences would be ''UPPERCASE'', while repeat sequences embedded between the non-repeat sequences would be ''lowercase''.
  
=== How do I get the sequence of a ''Brugia malayi'' protein? ===
+
=== How do I get the sequence of a ''Brugia malayi'' protein? ===
 +
 
 +
'''For instance, ''[http://wormbase.org/db/gene/gene?name=sna-1;class=Gene sna-1]'' is annotated as being orthologous to ''B. malayi'' 13258.m00169, and paralogous to ''B. malayi'' 14704.m00455; but where do I go to get these sequences?'''
 +
 
 +
For the time being, the best method for getting B. malayi sequences quickly (and without having to download the entire predicted B. malayi proteome by 'licensed FTP') is to do a BlastP search, against the "BMA1_pep" protein sequence set, on TIGR's B. malayi Blast server at:
  
'''For instance, ''[http://wormbase.org/db/gene/gene?name=sna-1;class=Gene sna-1]'' is annotated as being orthologous to ''B. malayi'' 13258.m00169, and paralogous to ''B. malayi'' 14704.m00455; but where do I go to get these sequences?'''
+
http://tigrblast.tigr.org/er-blast/index.cgi?project=bma1
  
For the time being, the best method for getting B. malayi sequences quickly (and without having to download the entire predicted B. malayi proteome by 'licensed FTP') is to do a BlastP search, against the "BMA1_pep" protein sequence set, on TIGR's B. malayi Blast server at:
+
A successful BlastP search will give a report that has hypertext links to individual protein sequences such as [http://tigrblast.tigr.org/er-blast/getSeq.cgi?id=13258.m00169&db=/usr/local/db/euk/private/b_malayi/annotation_dbs/BMA1.pep 13258.m00169] and [http://tigrblast.tigr.org/er-blast/getSeq.cgi?id=14704.m00455&db=/usr/local/db/euk/private/b_malayi/annotation_dbs/BMA1.pep 14704.m00455].
  
    http://tigrblast.tigr.org/er-blast/index.cgi?project=bma1
+
This workaround should stop being needed when TIGR and WormBase have worked out some agreement for WormBase to present individual protein sequences through WormBase's own FTP site and interface; but, as of 20 July 2007, this is the best we can (legally) do.  
  
A successful BlastP search will give a report that has hypertext links to individual protein sequences such as [http://tigrblast.tigr.org/er-blast/getSeq.cgi?id=13258.m00169&db=/usr/local/db/euk/private/b_malayi/annotation_dbs/BMA1.pep 13258.m00169] and [http://tigrblast.tigr.org/er-blast/getSeq.cgi?id=14704.m00455&db=/usr/local/db/euk/private/b_malayi/annotation_dbs/BMA1.pep 14704.m00455].
+
Another option would be to do a bulk download of the relevant sequence data from TIGR itself. See TIGR's data release policy at http://www.tigr.org/tdb/e2k1/bma1/ for more details.  
  
This workaround should stop being needed when TIGR and WormBase have worked out some agreement for WormBase to present individual protein sequences through WormBase's own FTP site and interface; but, as of 20 July 2007, this is the best we can (legally) do.
+
To give you a glimpse of the things to come, ''Brugia malayi'' data imported from Genbank which will soon appear on the main site can be found at:
  
Another option would be to do a bulk download of the relevant sequence data from TIGR itself. See TIGR's data release policy at http://www.tigr.org/tdb/e2k1/bma1/ for more details.
+
*[ftp://anonymous@ftp.sanger.ac.uk:21/pub2/wormbase/bmalayi/brugia_cdses_WS187.fa.bz2 cds sequences] (fasta) based on WS187
 +
*[ftp://anonymous@ftp.sanger.ac.uk:21/pub2/wormbase/bmalayi/brugia_WS187.gff3.bz2 GFF3 genome annotation] based on WS187
 +
*[ftp://anonymous@ftp.sanger.ac.uk:21/pub2/wormbase/bmalayi/brugpep185.pep Protein Set] based on translations of the GenBank submissions (first imported for WS185)
 +
*GBrowse of the updated data is soon to come(tm)
 +
*orthology predictions of ''Brugia malayi'' are included in the TreeView pages of C.elegans / C.briggsae and C.remanei genes starting with WS187
  
== Gene structures and gene predictions ==
+
== Gene structures and gene predictions ==
  
=== I think there should be a gene at the end of clone 'X' but WormBase doesn't show any genes in this region. Why not? ===
+
=== I think there should be a gene at the end of clone 'X' but WormBase doesn't show any genes in this region. Why not? ===
  
A full and complete description of all C. elegans genes is not known (and may not accurately be known for many years). WormBase attempts to represent all genes that have good experimental evidence plus a number of genes which have less experimental evidence but which were generated using gene finding software. If there are any publically available transcript data (EST, mRNA etc.) then WormBase should nearly always have attempted to make a gene prediction in that region. However, many poorly expressed genes may not have any transcript evidence and so may not be represented in WormBase at this time. Please help us by [http://www.wormbase.org/db/misc/feedback letting us know] if you have any evidence for a gene that is currently not displayed in WormBase. Aside from transcript evidence (for which we would always encourage people to submit to GenBank/EMBL/DDBJ) a strong case can be made for creating a new gene if there is good conservation with other species (particularly C. briggsae or C. remanei) and if there is other supporting data (such as a positive RNAi phenotype).
+
A full and complete description of all C. elegans genes is not known (and may not accurately be known for many years). WormBase attempts to represent all genes that have good experimental evidence plus a number of genes which have less experimental evidence but which were generated using gene finding software. If there are any publically available transcript data (EST, mRNA etc.) then WormBase should nearly always have attempted to make a gene prediction in that region. However, many poorly expressed genes may not have any transcript evidence and so may not be represented in WormBase at this time. Please help us by [http://www.wormbase.org/db/misc/feedback letting us know] if you have any evidence for a gene that is currently not displayed in WormBase. Aside from transcript evidence (for which we would always encourage people to submit to GenBank/EMBL/DDBJ) a strong case can be made for creating a new gene if there is good conservation with other species (particularly C. briggsae or C. remanei) and if there is other supporting data (such as a positive RNAi phenotype).  
  
Please also note that your gene may be there but may not be represented in the standard set of tracks in the [http://www.wormbase.org/db/seq/gbrowse/wormbase/ Genome Browser]. Check alternative gene predictions by turning on tracks for the 'GeneFinder' and 'Twinscan' gene predictions. Also consider turning on the 'Obsolete gene models' track as the gene may have existed in WormBase in the past but may have been removed.
+
Please also note that your gene may be there but may not be represented in the standard set of tracks in the [http://www.wormbase.org/db/seq/gbrowse/wormbase/ Genome Browser]. Check alternative gene predictions by turning on tracks for the 'GeneFinder' and 'Twinscan' gene predictions. Also consider turning on the 'Obsolete gene models' track as the gene may have existed in WormBase in the past but may have been removed.  
  
--[[User:Kbradnam Kbradnam] 17:47, 12 February 2006 (EST)
+
-- [[Kbradnam|Kbradnam]] 17:47, 12 February 2006 (EST)
  
=== I have found experimentally that the transcript of gene X is different from the gene model reported in WormBase. We would therefore like to update the gene model on WormBase. ===
+
=== I have found experimentally that the transcript of gene X is different from the gene model reported in WormBase. We would therefore like to update the gene model on WormBase. ===
  
<br /> Please send the new transcript sequence with a brief description of the required gene model change to [http://www.wormbase.org/db/misc/feedback wormbase-help@wormbase.org] and a curator will make the appropriate change.
+
<br> Please send the new transcript sequence with a brief description of the required gene model change to [http://www.wormbase.org/db/misc/feedback wormbase-help@wormbase.org] and a curator will make the appropriate change.  
  
Please also submit your sequence to the EMBL/GenBank/DDBJ database. This helps in the confirmation and evidence for the wormbase gene prediction as we routinely retrieve sequence data from the public databases. This also makes the data public, allowing appropriate reference and acknowledgement to yourself.
+
Please also submit your sequence to the EMBL/GenBank/DDBJ database. This helps in the confirmation and evidence for the wormbase gene prediction as we routinely retrieve sequence data from the public databases. This also makes the data public, allowing appropriate reference and acknowledgement to yourself.  
  
--[/wiki/index.php?title=User:Gw3&action=edit Gw3] 09:22, 17 August 2007 (GMT)
+
--[/wiki/index.php?title=User:Gw3&amp;action=edit Gw3] 09:22, 17 August 2007 (GMT)  
  
=== What criteria does WormBase use to classify a gene as a Pseudogene? ===
+
=== What criteria does WormBase use to classify a gene as a Pseudogene? ===
  
See entry in [[Glossary_of_terms#P]]
+
See entry in [[Glossary of terms#P|Glossary_of_terms#P]]  
  
=== I'd like to create a diagram of a genomic region similar to that shown on the Genome Browser. How can I do this? ===
+
=== I'd like to create a diagram of a genomic region similar to that shown on the Genome Browser. How can I do this? ===
  
One approach is to write your own scripts in Perl using the Bio::Graphics modules that are part of BioPerl. A second approach is to use the [http://wormbase.org/db/seq/frend web interface] to this software.
+
One approach is to write your own scripts in Perl using the Bio::Graphics modules that are part of BioPerl. A second approach is to use the [http://wormbase.org/db/seq/frend web interface] to this software.  
  
=== In the gff files I see confirmed_inconsistent and other confirmed statuses for introns, what do they all mean? ===
+
=== In the gff files I see confirmed_inconsistent and other confirmed statuses for introns, what do they all mean? ===
  
confirmed_est - an intron confirmed by EST transcript sequence data
+
confirmed_est - an intron confirmed by EST transcript sequence data  
  
confirmed_cdna - an intron confirmed by cDNA/mRNA transcript sequence data
+
confirmed_cdna - an intron confirmed by cDNA/mRNA transcript sequence data  
  
confirmed_inconsistent - means that a curator has decided that the intron doesn't fit with what we consider to be a valid transcript or there appears to be something wrong with the Transcript that confirms the intron.
+
confirmed_inconsistent - means that a curator has decided that the intron doesn't fit with what we consider to be a valid transcript or there appears to be something wrong with the Transcript that confirms the intron.  
  
confirmed_false - these are where curators have confirmed that the confirmed intron is false or artefactual.
+
confirmed_false - these are where curators have confirmed that the confirmed intron is false or artefactual.  
  
confirmed_UTR - used when a confirmed intron looks like it is in the UTR of a gene.
+
confirmed_UTR - used when a confirmed intron looks like it is in the UTR of a gene.  
  
 
confirmed_Homology - where protein homology looks to confirm the intron, this has seen limited use.
 
confirmed_Homology - where protein homology looks to confirm the intron, this has seen limited use.
  
== Gene Model Naming ==
+
=== what are seg, signalp and tmhmm motifs on the Protein report page? ===
 +
 
 +
* seg - low complexity regions e.g. homopolymer runs - [http://www.hku.hk/bruhk/gcgdoc/seg.html explanation]
 +
 
 +
* signalp - predicts the presence and location of signal peptide cleavage sites [http://www.nature.com/nprot/journal/v2/n4/abs/nprot.2007.131.html Emanuelsson <i>et al</i> 2007]
 +
 
 +
* tmhmm - predicts transmembrane helices in proteins [http://www.ncbi.nlm.nih.gov/pubmed/11152613 Kogh<i>et al</i> 2001]
 +
 
 +
== Old database releases ==
 +
 
 +
=== How do I remap the chromosomal coordinates between releases? ===
 +
 
 +
There is a page describing a perl script and the data to change the coordinates of GFF files [http://www.wormbase.org/wiki/index.php/Converting_Coordinates_between_releases here]
  
=== What do all the different gene names mean? ===
+
== Gene Model Naming  ==
  
* All genes have a corresponding sequence name, which are derived from the cosmid, fosmid or YAC clone on which they reside.
+
=== What do all the different gene names mean?  ===
  
: For instance the gene [http://www.wormbase.org/db/gene/gene?name=WBGene00000254;class=Gene bli-4] has a sequence name of [http://www.wormbase.org/db/seq/sequence?name=K04F10.4;class=Gene_name K04F10.4], indicating it was identified when the cosmid [http://www.wormbase.org/db/seq/clone?name=K04F10;class=Clone K04F10] was sequenced and annotated, and there are at least 3 other genes associated with that cosmid.
+
*All genes have a corresponding sequence name, which are derived from the cosmid, fosmid or YAC clone on which they reside.
  
* Any gene can code for multiple proteins (CDS) as a result of alternative splicing. In the case of [http://www.wormbase.org/db/gene/gene?name=WBGene00000254;class=Gene bli-4] there are 6 known isoforms, called K04F10.4a, K04F10.4b, ..... K04F10.4f.
+
:For instance the gene [http://www.wormbase.org/db/gene/gene?name=WBGene00000254;class=Gene bli-4] has a sequence name of [http://www.wormbase.org/db/seq/sequence?name=K04F10.4;class=Gene_name K04F10.4], indicating it was identified when the cosmid [http://www.wormbase.org/db/seq/clone?name=K04F10;class=Clone K04F10] was sequenced and annotated, and there are at least 3 other genes associated with that cosmid.
* The corresponding transcripts for the isoforms are called [http://www.wormbase.org/db/seq/sequence?name=K04F10.4a.1;class=Sequence K04F10.4a.1], [http://www.wormbase.org/db/seq/sequence?name=K04F10.4b.1;class=Sequence K04F10.4b.1], ..... K04F10.4f.1
 
* However if there is alternative splicing in the UTRs, which doesn't change the protein sequence, the alternatively-spliced transcripts are named [http://www.wormbase.org/db/seq/sequence?name=K04F10.4a.1;class=Sequence K04F10.4a.1] and [http://www.wormbase.org/db/seq/sequence?name=K04F10.4a.2;class=Sequence K04F10.4a.2].
 
* ... and if there are no isoforms of the coding gene, for example [http://wormbase.sanger.ac.uk/db/gene/gene?name=WBGene00007071;class=Gene AC3.5], but there is alternative splicing in the UTRs, there will be multiple transcripts named [http://wormbase.sanger.ac.uk/db/seq/sequence?name=AC3.5.1;class=Transcript AC3.5.1] and [http://wormbase.sanger.ac.uk/db/seq/sequence?name=AC3.5.2;class=Transcript AC3.5.2], etc.
 
* !!But if there are no alternate UTR transcripts the single coding_transcript is named the same as the CDS and does not have the .1 appended. As in the case of [http://www.wormbase.org/db/seq/sequence?name=K04F10.4f;class=Sequence K04F10.4f]
 
  
== ''C. remanei'' ==
+
*Any gene can code for multiple proteins (CDS) as a result of alternative splicing. In the case of [http://www.wormbase.org/db/gene/gene?name=WBGene00000254;class=Gene bli-4] there are 6 known isoforms, called K04F10.4a, K04F10.4b, ..... K04F10.4f.
 +
*The corresponding transcripts for the isoforms are called [http://www.wormbase.org/db/seq/sequence?name=K04F10.4a.1;class=Sequence K04F10.4a.1], [http://www.wormbase.org/db/seq/sequence?name=K04F10.4b.1;class=Sequence K04F10.4b.1], ..... K04F10.4f.1
 +
*However if there is alternative splicing in the UTRs, which doesn't change the protein sequence, the alternatively-spliced transcripts are named [http://www.wormbase.org/db/seq/sequence?name=K04F10.4a.1;class=Sequence K04F10.4a.1] and [http://www.wormbase.org/db/seq/sequence?name=K04F10.4a.2;class=Sequence K04F10.4a.2].
 +
*... and if there are no isoforms of the coding gene, for example [http://wormbase.sanger.ac.uk/db/gene/gene?name=WBGene00007071;class=Gene AC3.5], but there is alternative splicing in the UTRs, there will be multiple transcripts named [http://wormbase.sanger.ac.uk/db/seq/sequence?name=AC3.5.1;class=Transcript AC3.5.1] and [http://wormbase.sanger.ac.uk/db/seq/sequence?name=AC3.5.2;class=Transcript AC3.5.2], etc.
 +
*&nbsp;!!But if there are no alternate UTR transcripts the single coding_transcript is named the same as the CDS and does not have the .1 appended. As in the case of [http://www.wormbase.org/db/seq/sequence?name=K04F10.4f;class=Sequence K04F10.4f]
  
=== I see ''C. remanei'' homologies in the Best Blast Hits table on the Gene Page, but I can't find the sequence. What gives? ===
+
== ''C. remanei'' ==
  
The ''C. remanei'' predictions are a very preliminary set and are not completely integrated into the database yet. Currently we are only using these sequences to generate blast homologies that are displayed on the Genome Browser. The actual ''C. remanei'' genome, gene predictions and peptide sequences are not in the database, so not accessible through the website.
+
[[C.remanei|state of C.remanei integration]]
  
The gene predictions used are available from the [ftp://ftp.wormbase.org/pub/wormbase/genomes/remanei/preliminary_analysis WormBase FTP site]. These gene predictions are generated by John Spieth and colleagues at the Genome Sequencing Center at Washington University by merging multiple gene prediction algorithms into a single hybrid set. The merged set sequences in FASTA format are available at:
+
== Interpolated Map Positions  ==
  
[ftp://ftp.wormbase.org/pub/wormbase/genomes/remanei/preliminary_analysis/WU_MERGED/merged_set.dna.fa.gz DNA], [ftp://ftp.wormbase.org/pub/wormbase/genomes/remanei/preliminary_analysis/WU_MERGED/merged_set.protein.fa.gz protein]
+
We calculate for Genes without known genetic map positions a theoretical interpolated position based on a linear interpolation of surrounding genetic markers. [ A gene is considered a genetic marker if:  
  
=== When will the complete sequence of ''C. remanei'' be available? ===
+
*it has a physical map position
 +
*it has a CGC name
 +
*it has a genetic map position (experimental or promoted)
  
''C. remanei'' is currently undergoing an additional round of sequencing and finishing. It should be available in fall 2007.
+
Promoted map positions are made for genes, that fulfil all other genetic marker requirements and were interpolated during prevbious builds. The logical order of the genetic markers is checked and curated during the build process to accomodate new experimental data as well as changes in the genomic sequence.  
  
== Interpolated Map Positions ==
+
== Obtaining WormBook articles  ==
  
We calculate for Genes without known genetic map positions a theoretical interpolated position based on a linear interpolation of surrounding genetic markers. [ A gene is considered a genetic marker if:
+
WormBook articles are available directly from the [http://www.WormBook.org WormBook] website, the [http://dev.WormBook.org development] site, the [http://wormbook.sanger.ac.uk Sanger WormBook] mirror, as well as via [http://www.textpresso.org/cgi-bin/wb/textpressoforwormbase.cgi?allabstracts=on&searchmode=sentence&searchtargets=Paper&searchtargets=Abstract&searchtargets=Title Textpresso] searches in WormBase.
  
* it has a physical map position
+
The entire WormBook may also be downloaded as a [http://dev.wormbook.org/WormBook.zip zip archive] (~400 Mb).
* it has a CGC name
 
* it has a genetic map position (experimental or promoted)
 
  
Promoted map positions are made for genes, that fulfil all other genetic marker requirements and were interpolated during prevbious builds. The logical order of the genetic markers is checked and curated during the build process to accomodate new experimental data as well as changes in the genomic sequence.
+
== How do I get a list of genes with alternatively spliced isoforms?  ==
  
== Obtaining WormBook articles ==
+
Go to the [http://www.wormbase.org/db/searches/wb_query WormBase Query Language Search page] and enter the query ''''find CDS; Isoform; follow Gene ''''
  
WormBook articles are available directly from the [http://www.WormBook.org WormBook] website, the [http://dev.WormBook.org development] site, the [http://wormbook.sanger.ac.uk Sanger WormBook] mirror, as well as via [http://www.textpresso.org/cgi-bin/wb/textpressoforwormbase.cgi?allabstracts=on&searchmode=sentence&searchtargets=Paper&searchtargets=Abstract&searchtargets=Title Textpresso] searches in WormBase.
+
If you want to do further investigation of this group of genes set the output to 'Text' and copy the list to use in [http://www.wormbase.org/biomart/martview WormMart]  
  
The entire WormBook may also be downloaded as a [http://dev.wormbook.org/WormBook.zip zip archive] (~400 Mb).
+
== How do I get a list of coding gene transcripts together with the supporting cDNA evidence for them?  ==
  
== How do I get a list of genes with alternatively spliced isoforms? ==
+
Go to the [http://www.wormbase.org/db/searches/wb_query WormBase Query Language Search page] and enter the query:
  
Go to the [http://www.wormbase.org/db/searches/wb_query WormBase Query Language Search page] and enter the query '''' find CDS; Isoform; follow Gene ''''
+
''''select l, cdna from l in class Transcript where l->Method = "Coding_transcript", cdna in l->Matching_cDNA ''''  
  
If you want to do further investigation of this group of genes set the output to 'Text' and copy the list to use in [http://www.wormbase.org/biomart/martview WormMart]
 
  
== What are WormBase's precomputed BLAST parameters? ==
+
== What are WormBase's precomputed BLAST parameters? ==
  
These parameters refer to the precomputed BLAST results shown on the gene and protein report pages.
+
These parameters refer to the precomputed BLAST results shown on the gene and protein report pages.  
  
For BLASTP and BLASTX we use [http://blast.wustl.edu/ WU-BLAST2.0] with the following parameters
+
For BLASTP and BLASTX we use [http://blast.wustl.edu/ WU-BLAST2.0] with the following parameters<br>
  
Z=10000000 (sets size of database in letters)
+
{|
V=1000   (sets the number of one line summaries)
+
|-
B=1000000 (sets number of database hits to report)
+
| Z=10000000 (sets size of database in letters)
E=0.1   (E from the Karlin-Altschul equation - will not report hits with E-value greater than this)
+
|-
cpus=1   (sets number of processors to use)
+
| V=1000 (sets the number of one line summaries)
hitdist=40 (Max distance between word hits for 2-word seeding algorithm)
+
|-
 +
| B=1000000 (sets number of database hits to report)
 +
|-
 +
| E=0.1 (E from the Karlin-Altschul equation - will not report hits with E-value greater than this)
 +
|-
 +
| cpus=1 (sets number of processors to use)
 +
|-
 +
| hitdist=40 (Max distance between word hits for 2-word seeding algorithm)
 +
|}
  
 
No low-complexity filtering is done for BLASTP. DNA sequences are masked with RepeatMasker and TRF before BLASTing.
 
No low-complexity filtering is done for BLASTP. DNA sequences are masked with RepeatMasker and TRF before BLASTing.
  
== About Cell and Anatomy ==
+
== About Cell and Anatomy ==
  
=== How do I find out cell lineage pedigrees? ===
+
=== How do I find out cell lineage pedigrees? ===
  
There are two kinds of pedigree display. The Cell pedigree tree (located on the Cell Page) or the Lineage pedigree tree (located on the Pedigree Browser). The Cell Page is simple and easy to use, with a full description of the cell lay out, while the advantage of the Pedigree Browser is that it displays complete lineage pathways (from P0) with user-interested cell(s) highlighted.
+
There are two kinds of pedigree display. The Cell pedigree tree (located on the Cell Page) or the Lineage pedigree tree (located on the Pedigree Browser). The Cell Page is simple and easy to use, with a full description of the cell lay out, while the advantage of the Pedigree Browser is that it displays complete lineage pathways (from P0) with user-interested cell(s) highlighted.  
  
Starting from the Search on the WormBase home page. Select from the pull-down menu "Cell" and enter the cell name. A "cell summary" display will appear with a Cell pedigree display box showing three generations of cells. Your cell will appear red on the pedigree. Users can move the pedigree tree up or down in the lineage by clicking on the parent cell or daughter cells. Another way to access pedigree is from "Cell and Pedigree Search" (under More Searches menu), which searches for specific cells, cell groups, or lineages.
+
Starting from the Search on the WormBase home page. Select from the pull-down menu "Cell" and enter the cell name. A "cell summary" display will appear with a Cell pedigree display box showing three generations of cells. Your cell will appear red on the pedigree. Users can move the pedigree tree up or down in the lineage by clicking on the parent cell or daughter cells. Another way to access pedigree is from "Cell and Pedigree Search" (under More Searches menu), which searches for specific cells, cell groups, or lineages.  
  
=== What's the nomenclature for ''C. elegans'' cells? ===
+
=== What's the nomenclature for ''C. elegans'' cells? ===
  
There is a very good article explaining everything about embryonic cell lineage and nomenclature:
+
There is a very good article explaining everything about embryonic cell lineage and nomenclature:  
  
Sulston JE et al (1983) Dev Biol. "The embryonic cell lineage of the nematode C. elegans."
+
Sulston JE et al (1983) Dev Biol. "The embryonic cell lineage of the nematode C. elegans."  
  
That article is the "dictionary" everyone refers.
+
That article is the "dictionary" everyone refers.  
  
P0 is the founder cell for C. elegans. It is the zygote after fertilization. The first few rounds of divisions produce six "founder cells": E, MS, AB, C, D and P4. Each of these founder cells generate different tissues. From then on, cells are named after these founder cells. For example, the daughters of AB are called ABa ('a' means anterior) and ABp ('p' means posterior). ABa will generate daughters ABal ('l' means left) and ABar ('r' means right)... If cell divides dorsal-ventrally, 'd' or 'v' will be added to the name of daughters.
+
P0 is the founder cell for C. elegans. It is the zygote after fertilization. The first few rounds of divisions produce six "founder cells": E, MS, AB, C, D and P4. Each of these founder cells generate different tissues. From then on, cells are named after these founder cells. For example, the daughters of AB are called ABa ('a' means anterior) and ABp ('p' means posterior). ABa will generate daughters ABal ('l' means left) and ABar ('r' means right)... If cell divides dorsal-ventrally, 'd' or 'v' will be added to the name of daughters.  
  
Now you know when you see ABalppp , it comes from:
+
Now you know when you see ABalppp , it comes from:  
  
P0-&gt;AB-&gt;ABa-&gt;ABal-&gt;ABalp-&gt;ABalpp-&gt;ABalppp
+
P0-&gt;AB-&gt;ABa-&gt;ABal-&gt;ABalp-&gt;ABalpp-&gt;ABalppp  
  
Not only will you see the lineage pathway from the cell name, you will also see in which direction cells have divided and what the sister cells are for each step of the division.
+
Not only will you see the lineage pathway from the cell name, you will also see in which direction cells have divided and what the sister cells are for each step of the division.  
  
=== How can I know each C. elegans cell's function and exactly at which stage of the embryonic lineage it appears? ===
+
=== How can I know each C. elegans cell's function and exactly at which stage of the embryonic lineage it appears? ===
  
Most of the information you need for a cell should be contained on Cell Report, which can be located by "Cell and Pedigree" search. In WormBase, if you read the Tree Display of a Cell Report, there is a tag called "Embryo_division_time"; it is the time when the cell divides or dies. Unfortunately, for cells generated after hatch, there is no such information in WormBase.
+
Most of the information you need for a cell should be contained on Cell Report, which can be located by "Cell and Pedigree" search. In WormBase, if you read the Tree Display of a Cell Report, there is a tag called "Embryo_division_time"; it is the time when the cell divides or dies. Unfortunately, for cells generated after hatch, there is no such information in WormBase.  
  
=== What is the connection between the cell P0, and the cells P1, P2, P3, ..., P7, P8, etc? ===
+
=== What is the connection between the cell P0, and the cells P1, P2, P3, ..., P7, P8, etc? ===
  
There are two sets of P cells. One arises from early embryonic divisions, and are called P0, P1', P2', P3' ... in WormBase; these are the lineage names. The other set is called P1, P2, P3, ... These are postembryonic blast cells, which are not related to the embryonic founder cells.
+
There are two sets of P cells. One arises from early embryonic divisions, and are called P0, P1', P2', P3' ... in WormBase; these are the lineage names. The other set is called P1, P2, P3, ... These are postembryonic blast cells, which are not related to the embryonic founder cells.  
  
P1, P2, P3.. are adult names for post embryonic blast cells preset from hatching until the middle of the first lalval stage (L1), . A lot of cells have two names: lineage name and adult name. Adult name is the name people give to some cells that become terminal and differentiate (such as neurons) or not differentiate but will divide into an important lineage (such as P1, P2 ... lineages). Adult names are given by cell position and function, so it is a different naming system. Cells with the same adult name could come from different lineages depending on how bilateral symmetry is broken, for example: P7 can be developed either from AB.plappapp or AB.prappapp.
+
P1, P2, P3.. are adult names for post embryonic blast cells preset from hatching until the middle of the first lalval stage (L1), . A lot of cells have two names: lineage name and adult name. Adult name is the name people give to some cells that become terminal and differentiate (such as neurons) or not differentiate but will divide into an important lineage (such as P1, P2 ... lineages). Adult names are given by cell position and function, so it is a different naming system. Cells with the same adult name could come from different lineages depending on how bilateral symmetry is broken, for example: P7 can be developed either from AB.plappapp or AB.prappapp.  
  
Lineage name is accurate, unique, but hard to remember for most people, so adult names are usually for researchers to use and do the query. That is why in WormBase cell nomenclature, whenever there exists adult name, we use it to call a cell, and bury its lineage name inside data field.
+
Lineage name is accurate, unique, but hard to remember for most people, so adult names are usually for researchers to use and do the query. That is why in WormBase cell nomenclature, whenever there exists adult name, we use it to call a cell, and bury its lineage name inside data field.  
  
=== How to get all the cell types (neurons, actually) in which a gene is expressed? ===
+
=== How to get all the cell types (neurons, actually) in which a gene is expressed? ===
  
When we curate a gene, we enter all the cells and cell groups that express the gene. This information can be easily viewed by clicking the "details" button at the gene page. For example, if you search for eat-16, which is expressed in neurons:
+
When we curate a gene, we enter all the cells and cell groups that express the gene. This information can be easily viewed by clicking the "details" button at the gene page. For example, if you search for eat-16, which is expressed in neurons:  
  
1. At the WormBase home page, select "Any gene" and search for "eat-16", and select "Exact Match", this will take you to the Gene Summary page for eat-16. 2. In the Function section, you will see "Anatomimic Expression Pattern". Here you will see some information about the eat-16 expression pattern, at the very end of the entry, you will see a link "Details". 3. If you click here, you will be brought to the Expression Pattern page for eat-16. On this page you will see the detailed cell and cell group information associated with eat-16. (To keep annotation easy, when a gene is expressed in lots of cells, we enter cell group name instead of all the cell names one by one. Each cell group will include the list of cells associated.)
+
1. At the WormBase home page, select "Any gene" and search for "eat-16", and select "Exact Match", this will take you to the Gene Summary page for eat-16. 2. In the Function section, you will see "Anatomimic Expression Pattern". Here you will see some information about the eat-16 expression pattern, at the very end of the entry, you will see a link "Details". 3. If you click here, you will be brought to the Expression Pattern page for eat-16. On this page you will see the detailed cell and cell group information associated with eat-16. (To keep annotation easy, when a gene is expressed in lots of cells, we enter cell group name instead of all the cell names one by one. Each cell group will include the list of cells associated.)  
  
=== Is there a file showing the lineage map of the worm? ===
+
=== Is there a file showing the lineage map of the worm? ===
  
Leon Avery has something like that on the Web: http://elegans.swmed.edu/parts/
+
Leon Avery has something like that on the Web: http://elegans.swmed.edu/parts/  
  
== About Orthologs and Homologs ==
+
== About Orthologs and Homologs ==
  
=== How do I find the ortholog / paralog / etc. of gene X? ===
+
=== How do I find the ortholog / paralog / etc. of gene X? ===
  
We do not explicitly make ortholog assignments in WormBase. This is a non-trivial task and something that we leave to external experts whose results we try to make available. There are several sources that may be useful in WormBase. NCBI COGS, InParanoid and TreeFam are all programs that attempt to predict orthologous relationships. InParanoid and TreeFam are visible from the gene pages (see [http://www.wormbase.org/db/gene/gene?name=WBGene00000405;class=Gene cdk-1] page for eg). The COGs are found on the respective [http://www.wormbase.org/db/seq/protein?name=WP%3ACE00315;class=Protein protein page.]
+
We do not explicitly make ortholog assignments in WormBase. This is a non-trivial task and something that we leave to external experts whose results we try to make available. There are several sources that may be useful in WormBase. NCBI COGS, InParanoid and TreeFam are all programs that attempt to predict orthologous relationships. InParanoid and TreeFam are visible from the gene pages (see [http://www.wormbase.org/db/gene/gene?name=WBGene00000405;class=Gene cdk-1] page for eg). The COGs are found on the respective [http://www.wormbase.org/db/seq/protein?name=WP%3ACE00315;class=Protein protein page.]  
  
[[Glossary_of_terms#inparanoid Glossary_of_terms#inparanoid]
+
[[Glossary of terms#I|inparanoid]]<br>
  
[[Glossary_of_terms#KOGs Glossary_of_terms#KOGs]
+
[[Glossary of terms#K|KOGs]]  
  
 
[http://www.treefam.org TreeFam]
 
[http://www.treefam.org TreeFam]
  
There are also the precomputed BLAST results that are summarised on the gene pages. Each release we also produce a file of best blastp hits for each worm protein which can be found on the [ftp://ftp.wormbase.org/pub/wormbase/acedb/current_release FTP site] called best_blastp_hits.WSXXX.gz
+
<br> There are also the precomputed BLAST results that are summarised on the gene pages. Each release we also produce a file of best blastp hits for each worm protein which can be found on the [ftp://ftp.wormbase.org/pub/wormbase/acedb/current_release FTP site] called best_blastp_hits.WSXXX.gz  
 +
 
 +
In addition we include since WS164 predicted orthologue assignemts based on [http://www.ensembl.org/info/software/compara/index.html Ensembl COMPARA] which predicts orthology of the longest isoform based on homology as well as conserved gene order.
  
In addition we include since WS164 predicted orthologue assignemts based on [http://www.ensembl.org/info/software/compara/index.html Ensembl COMPARA] wich predicts orthology of the longest isoform based on homology as well as conserved gene order.
+
You can run this prepared query in WormMart for [http://tinyurl.com/worm-orthologs compara orthologs]
  
=== How do I get a list of all ''C. elegans'' orthologs of ''H. sapiens'' disease genes? ===
+
=== How do I get a list of all ''C. elegans'' orthologs of ''H. sapiens'' disease genes? ===
  
One possible solution is to use EnsMart from Ensembl to query the EnsEMBL databases which include C.elegans. Go to [http://www.biomart.org/ BioMart ] and pick Caenorhabdits elegans (homology) and as set C.elegans and H.sapiens. Select orthologues_gene and filter as needed by the different types of orthology (one2one and/or one2many). Then pick as second dataset H.sapiens and select "associated with disease". That should return a list of all orthologues of human disease genes. The only possible problem is that due to the different release cycles between EnsEMBL and WormBase the EnsEMBL data might be slightly out of date (the CELXXX on the EnsEMBL pages refer to the corresponding WSXXX WormBase release).
+
One possible solution is to use EnsMart from Ensembl to query the EnsEMBL databases which include C.elegans. Go to [http://www.biomart.org/ BioMart ] and pick Caenorhabdits elegans (homology) and as set C.elegans and H.sapiens. Select orthologues_gene and filter as needed by the different types of orthology (one2one and/or one2many). Then pick as second dataset H.sapiens and select "associated with disease". That should return a list of all orthologues of human disease genes. The only possible problem is that due to the different release cycles between EnsEMBL and WormBase the EnsEMBL data might be slightly out of date (the CELXXX on the EnsEMBL pages refer to the corresponding WSXXX WormBase release).  
  
=== How can I retrieve nematode specific genes with no homology to yeast, fly, mouse, and human? ===
+
=== How can I retrieve nematode specific genes with no homology to yeast, fly, mouse, and human? ===
  
From the "advanced query" WormBase page, construct the following query:
+
/// out dated (Raymond 20080214) /// From the "advanced query" WormBase page, construct the following query:  
  
find predicted_gene NOT Pep_homol
+
find predicted_gene NOT Pep_homol  
  
=== What is the meaning of several abbreviations for proteins that are used by WormBase, like "Protein SW"? ===
+
=== What is the meaning of several abbreviations for proteins that are used by WormBase, like "Protein SW"? ===
  
'''Or "Protein TR" and "Protein WP"? In addition, using the TR Database, sometimes the species origin (e.g., ''C. elegans'') is missing - how can I find out? Furthermore, how can I get from a TR Database entry to the corresponding predicted gene in the ''C. elegans'' genome?'''
+
'''Or "Protein TR" and "Protein WP"? In addition, using the TR Database, sometimes the species origin (e.g., ''C. elegans'') is missing - how can I find out? Furthermore, how can I get from a TR Database entry to the corresponding predicted gene in the ''C. elegans'' genome?'''  
  
SW stands for Swiss-Prot, TR stands for TrEMBL and WP stands for WormPep. In case you're not familiar with any of theses protein databases you can go to: http://www.expasy.org/sprot/ and http://www.sanger.ac.uk/Projects/C_elegans/wormpep/ for an explanation and access to them.
+
SW stands for Swiss-Prot, TR stands for TrEMBL and WP stands for WormPep. In case you're not familiar with any of theses protein databases you can go to: http://www.expasy.org/sprot/ and http://www.sanger.ac.uk/Projects/C_elegans/wormpep/ for an explanation and access to them.  
  
Inside Protein SW or Protein TR, you may find the accession number of Swiss-Prot or TrEMBL. You can get all details of the protein (including species origin..) by going to http://www.expasy.org/sprot/ and entering the accession numbers,
+
Inside Protein SW or Protein TR, you may find the accession number of Swiss-Prot or TrEMBL. You can get all details of the protein (including species origin..) by going to http://www.expasy.org/sprot/ and entering the accession numbers,  
  
=== What do those colorful bars for ''C. briggsae'' alignments mean? ===
+
=== What do those colorful bars for ''C. briggsae'' alignments mean? ===
  
Dark blue bars are regions of strong similarity. Light blue bars are regions of weak similarity. Dashed areas don't match.
+
Dark blue bars are regions of strong similarity. Light blue bars are regions of weak similarity. Dashed areas don't match.  
  
When there are multiple bars in the same region, it means that there are several ''C. briggsae'' clones that all match the region.
+
When there are multiple bars in the same region, it means that there are several ''C. briggsae'' clones that all match the region.  
  
=== How can I retrieve the best blast_p scored homologies of worm genes? ===
+
=== How can I retrieve the best blast_p scored homologies of worm genes? ===
  
(I.e., the homologies produced automatically with each WormBase build -- roughly every 2 weeks?)
+
(I.e., the homologies produced automatically with each WormBase build -- roughly every 2 weeks?)  
  
a. Go to the Wormbase ftp site by following the "Bulk Downloads" link in the "Web Site Directory" section of the Wormbase homepage or by entering the following URL in your browser: ftp://ftp.sanger.ac.uk/pub/wormbase? Select the most current Wormbase release (i.e. WS130).
+
a. Go to the Wormbase ftp site by following the "Bulk Downloads" link in the "Web Site Directory" section of the Wormbase homepage or by entering the following URL in your browser: ftp://ftp.sanger.ac.uk/pub/wormbase? Select the most current Wormbase release (i.e. WS130).  
  
b. Download the two best blastp files in this folder: best_blastp_hits.WS130.gz (elegans homolgies) best_blastp_hits_brigpep.WS130.gz (briggsae homologies)
+
b. Download the two best blastp files in this folder: best_blastp_hits.WS130.gz (elegans homolgies) best_blastp_hits_brigpep.WS130.gz (briggsae homologies)  
  
c. Unpack the compressed files using a suitable software e.g. gunzip (linux)
+
c. Unpack the compressed files using a suitable software e.g. gunzip (linux)  
  
d. The files have 15 columns delmited by a comma. The contents of the columns are as follows:
+
d. The files have 15 columns delmited by a comma. The contents of the columns are as follows:  
  
 
     Column 1¬†: Wormbase peptide accession number for elegans peptide
 
     Column 1¬†: Wormbase peptide accession number for elegans peptide
Line 358: Line 401:
 
   
 
   
 
     Column 15: e value for best elegans peptide/TrEMBL sequence hit
 
     Column 15: e value for best elegans peptide/TrEMBL sequence hit
 +
  
 
e. You might also want a file that maps Wormbase peptide accession numbers to the corresponding Gene in Wormbase (warning, a single gene may correspond to multiple peptides). For this you will have to perform an AQL query on Wormbase: - on the banner at the top of the Wormbase homepage select "Searches" - select the top search from the resulting list, "Acedb Searches(AQL)" - copy paste the following text into the search text box: select a, a-&gt;Cgc_name, c from a in class Gene, c in a-&gt;Molecular_name where c like "CE*" order by :1 asc - choose the "Text output" radio button and click Query ACeDB(the search may take a few minutes) - the resulting file contains a tab delimited mapping of Wormbase gene accession numbers to the CGC approved name for that gene (if it has one) to the peptide accession number for that gene. save the results file to your hard drive
 
e. You might also want a file that maps Wormbase peptide accession numbers to the corresponding Gene in Wormbase (warning, a single gene may correspond to multiple peptides). For this you will have to perform an AQL query on Wormbase: - on the banner at the top of the Wormbase homepage select "Searches" - select the top search from the resulting list, "Acedb Searches(AQL)" - copy paste the following text into the search text box: select a, a-&gt;Cgc_name, c from a in class Gene, c in a-&gt;Molecular_name where c like "CE*" order by :1 asc - choose the "Text output" radio button and click Query ACeDB(the search may take a few minutes) - the resulting file contains a tab delimited mapping of Wormbase gene accession numbers to the CGC approved name for that gene (if it has one) to the peptide accession number for that gene. save the results file to your hard drive
  
=== How can I download the ''C. elegans''-human gene homology map? ===
+
=== How can I download the ''C. elegans''-human gene homology map? ===
  
You can download a file that lists best blastp match to human, fly, yeast, ''C. briggsae'', SwissProt, and TrEMBL proteins for every C. elegans protein form the wormbase ftp site:
+
You can download a file that lists best blastp match to human, fly, yeast, ''C. briggsae'', SwissProt, and TrEMBL proteins for every C. elegans protein form the wormbase ftp site:  
  
ftp://ftp.sanger.ac.uk/pub/wormbase/current_release
+
ftp://ftp.sanger.ac.uk/pub/wormbase/current_release
  
 
The file name is best_blastp_hits.WSXXX.gz where XXX is the release number.
 
The file name is best_blastp_hits.WSXXX.gz where XXX is the release number.
  
=== How can I download ''C. elegans''-''C.briggsae'' orthologs and their protein-coding DNA sequences? ===
+
=== How can I download ''C. elegans''-''C.briggsae'' orthologs and their protein-coding DNA sequences? ===
 +
 
 +
One possible way to retrieve those would be to download a C. elegans-C.briggsae ortholog file:
 +
 
 +
ftp://ftp.wormbase.org/pub/wormbase/briggsae/supporting_data_stein_2003/orthologs_and_orphans/orthologs.txt
 +
and C. briggsae gene sequences in fasta format (briggenes.fa.gz):
 +
 +
ftp://ftp.wormbase.org/pub/wormbase/briggsae/
 +
 
 +
and write a script that would parse ''C. briggsae'' ortholog sequences based on ''C. elegans'' gene names.
 +
 
 +
Another way would be to use WormMart to get a list of genes with orthologs (filter by Homolog/Ortholog -&gt; Homolog[Compara Orholog]). in the Attribute part you can select if you want to have the sequences or just a table of orthologs.
 +
 
 +
== About the User Interface  ==
  
One possible way to retrieve those would be to download a C. elegans-C.briggsae ortholog file:
+
=== What's the difference between the sequence displayed in "Sequence Report" and that in "Genome Browser"?  ===
  
ftp://ftp.wormbase.org/pub/wormbase/briggsae/supporting_data_stein_2003/orthologs_and_orphans/orthologs.txt
+
(For example: Sequence F35E8) The coordinates given in the Sequence Report under 'Genomic Location' are for the sequence F35E8, which is not the full sequence of the clone F35E8. The clone is represented under the diagram of the sequence features and has an arrow point off the left end indicating the clone extends to the left.  
  
and C. briggsae gene sequences in fasta format (briggenes.fa.gz):
+
When you click to the Genome Browser your seeing the sequence of F35E8 with the clone again represented under the diagram of sequence features with an arrow pointing left. You have to zoom out to see the full extent of the clone F35E8.  
  
  ftp://ftp.wormbase.org/pub/wormbase/briggsae/
+
== About Nomenclature ==
  
and write a script that would parse ''C. briggsae'' ortholog sequences based on ''C. elegans'' gene names.
+
=== How can I register a new lab?  ===
  
Another way would be to use WormMart to get a list of genes with orthologs (filter by Homolog/Ortholog -&gt; Homolog[Compara Orholog]). in the Attribute part you can select if you want to have the sequences or just a table of orthologs.
+
New lab and allele designation should be registered directly with Jonathan Hodgkin (jah@bioch.ox.ac.uk), WormBase genetic nomenclature coordinator.
 +
 
 +
== About Reagents (such as cosmid clones)  ==
 +
 
 +
=== What are the different types of clones in WormBase?  ===
 +
 
 +
1. There are seven different types of "Clone" objects in WormBase:<br>
 +
 
 +
{|
 +
|-
 +
| Type
 +
| Nomenclature
 +
|-
 +
| Cosmid
 +
| A*, B*, C*, D*, F*,J*,K*,M*,R*,T*,W*,Z*
 +
|-
 +
| Fosmid
 +
| H* WRM*<br>
 +
|-
 +
| YAC
 +
| Y*
 +
|-
 +
| cDNA
 +
| yk*, EC*, EB*, OST*, CK*, EF*, CEE*, CEM*, CES*, CB*, CN*, cm*<br>
 +
|-
 +
| Plasmid: PCR clones
 +
| V*,EGAP*
 +
|-
 +
| Other
 +
| telo clones, 1 BAC, plasmid<br>
 +
|}
 +
 
 +
<br> Most cosmids, fosmids, YACs can be requested from Sanger, cDNA (yk*) from Dr. Yuji Kohara. The EGAP* plasmids can be obtained from MRC Geneservice. The V* plasmids are no longer available.
 +
 
 +
=== Whom could I contact about getting a cDNA clone?  ===
 +
 
 +
All of the cDNA clones with a yk prefix can be ordered by the following method. All other cDNA clones will have to be requested from the submitting&nbsp; party (found by looking at the EMBL/GenBank entry.)
 +
 
 +
Please go to NextDB(http://nematode.lab.nig.ac.jp), Yuji Kohara cDNA database repository. You can obtain cDNA clones from Yuji Kohara at the National Institute of Genetics, Mishima, Japan: ykohara@LAB.nig.ac.jp
 +
 
 +
=== Useful Clone info  ===
 +
 
 +
'''1) How do I find out about the vectors used in the genome sequencing project?'''
 +
 
 +
If you want the actual [ftp://ftp.sanger.ac.uk/pub/vector_sequences/ sequence of the vectors] used they are on the Sanger FTP site. and this should help you [http://www.sanger.ac.uk/Projects/C_elegans/DOCS/vectors.shtml identify the vector] for the clone you are interested in
 +
 
 +
=== How do I order C. elegans Cosmids Fosmids and Yacs?  ===
 +
 
 +
Cosmids and Fosmids are available to the community via these routes:
 +
 
 +
==== '''1) How do I obtain C.elegans cosmids/Yacs?'''  ====
 +
 
 +
Information can be found here: [[Cosmids/YACs]]
 +
 
 +
==== '''2) How do I obtain C.elegans fosmid from the Moerman fosmid library?'''  ====
 +
 
 +
Information can be found here: [[Fosmids]]
 +
 
 +
==== '''3) How do I obtain C. elegans fosmid from the Incyte Genomics Inc. fosmid library?'''  ====
 +
 
 +
 
 +
Information can be found [http://wiki.wormbase.org/index.php/Fosmids#C._elegans_fosmid_from_the_Incyte_Genomics_Inc._fosmid_library. Here]
 +
 
 +
=== How do I view alternate C. elegans clones that contain my gene? Similar to a physical map containing sequences and un-sequenced clones from the Sequencing project.  ===
 +
 
 +
There is a view available on the website that displays the ACeDB graphical display for the physical map.
 +
 
 +
It is available from the clone summary page for the clone that your gene resides on.
 +
 
 +
Long Example:
 +
 
 +
* I'm interested in lgc-4 (WBGene00017580)
 +
 
 +
* On the [http://www.wormbase.org/db/gene/gene?name=WBGene00017580;class=Gene Gene Summary] page I can click on the [http://www.wormbase.org/db/seq/sequence?name=F18G5.4;class=Gene_name Sequence name: F18G5.4] in the IDs: table.
 +
 
 +
* This takes me to the [http://www.wormbase.org/db/seq/sequence?name=F18G5.4;class=Gene_name Sequence Summary] page for F18G5.4.
 +
 
 +
* On here I can click on [http://www.wormbase.org/db/seq/clone?name=F18G5;class=Clone Source clone: F18G5].
 +
 
 +
* Finally we are on the Clone report page.
 +
 
 +
* On the F18G5 [http://www.wormbase.org/db/seq/clone?name=F18G5;class=Clone Clone Report] page you can click on [http://www.wormbase.org/db/misc/epic?name=F18G5;class=Clone "Acedb Image"] in the yellow banner near the top of the page.
 +
 
 +
* This takes you to the view you require (The highlighted clones in yellow and red are the "golden path" the rest are unsequenced clones not used in the assembly).
 +
 
 +
<b>or</b>
  
== About the User Interface ==
+
Alternatively you can just input a url into your browser directly.
  
=== What's the difference between the sequence displayed in "Sequence Report" and that in "Genome Browser"? ===
+
Example:
  
(For example: Sequence F35E8) The coordinates given in the Sequence Report under 'Genomic Location' are for the sequence F35E8, which is not the full sequence of the clone F35E8. The clone is represented under the diagram of the sequence features and has an arrow point off the left end indicating the clone extends to the left.
+
<pre>
 +
If you are interested in AH6.1 then this url would return a cosmid map view of the old cosmids,
 +
fosmids and yacs including un-sequenced clones.
  
When you click to the Genome Browser your seeing the sequence of F35E8 with the clone again represented under the diagram of sequence features with an arrow pointing left. You have to zoom out to see the full extent of the clone F35E8.
+
http://www.wormbase.org/db/misc/epic?name=AH6;class=Clone
  
== About Nomenclature ==
+
</pre>
  
=== How can I register a new lab? ===
+
------
  
New lab and allele designation should be registered directly with Jonathan Hodgkin (jah@bioch.ox.ac.uk), WormBase genetic nomenclature coordinator.
+
=== How can I find allele information?  ===
  
=== What are the different types of clones in WormBase? ===
+
We do have information on many thousands of alleles in WormBase. We have also tried to extract the molecular details of the mutations (where known) and add those to WormBase. Some examples:
  
1. There are seven different types of "Clone" objects in WormBase:
+
Go to a gene page: http://www.wormbase.org/db/gene/gene?name=unc-71;class=Locus Then click on the link to the 'ay47' allele (near the bottom of the page), this takes you to: http://www.wormbase.org/db/gene/allele?name=ay47;class=Allele You can see that there is a 'c' to 't' substitution in this gene. If you go to the genome browser display for this gene: http://www.wormbase.org/db/seq/gbrowse/wormbase?name=unc-71 Then turn on the 'SNPs, Knockouts, and other Alleles' track and you will see the positions of the alleles in this gene.
  
Cosmid : cosmids (A*,B*,C*,D*,F*,J*,K*,M*,R*,T*,W*,Z*)
+
To find other alleles, you can go to the query page: http://www.wormbase.org/db/searches/wb_query and type the following queries (everything between the single quotes): 'Find Allele; Substitution' 'Find Allele; Deletion' 'Find Allele; Insertion'
Fosmid : fosmids (H*)
 
YAC : yacs (Y*)
 
cDNA : Yuji et al (yk*)
 
Plasmid : PCR clones (V*,EGAP*)
 
Other Text : telo clones, 1 BAC
 
  
Most cosmids, fosmids, YACs can be requested from Sanger, cDNA (yk*) from Dr. Yuji Kohara. The EGAP* plasmids can be obtained from MRC Geneservice. The V* plasmids are no longer available.
+
=== I'm interested in the CB4858 pas* snp data can I get a bulk download??  ===
  
== About Reagents (such as cosmid clones) ==
+
A complete dataset of pas snp data is available from [ftp://ftp.sanger.ac.uk/pub/wormbase/misc_datasets/pas_data.txt.gz Here]
  
=== What are the different types of clones in WormBase? ===
+
Explanation of dataset:
  
[/wiki/index.php?title=FAQs:Name&action=edit Answered here]
+
Substitution - the snp sits between Flank1 and Flank2 (gggtAtcg) and this makes up the N2 genomic hit. This is a 1bp feature as the snp is contained in the N2 genomic.  
  
=== Whom could I contact about getting a cDNA clone? ===
+
ID              Type            N2/CB  Chrom  Coordinate1    Coordinate2    Flank1            Flank2
 +
pas10021        Substitution    A/G    V      19689782        19689782        -cut-aattttgggt    tcgaccttgaaa-cut-
 +
  
Please go to NextDB(http://nematode.lab.nig.ac.jp/db/index.html), Yuji Kohara cDNA database repository. You can obtain cDNA clones from Yuji Kohara at the National Institute of Genetics, Mishima, Japan: ykohara@LAB.nig.ac.jp
+
Deletion - the snp sits between Flank1 and Flank2 (ttttCacacttt) and this makes up the N2 genomic hit. This is a 1bp feature as the snp is contained in the N2 genomic.  
  
=== How do I order C. elegans Cosmids Fosmids and Yacs? ===
+
ID              Type            N2      Chrom  Coordinate1    Coordinate2    Flank1            Flank2
 +
pas44643        Deletion        C       X      193667          193667          -cut-aaccattttt    acactttttggctta-cut-
 +
  
Cosmids and Fosmids are available to the community via these routes:
+
Insertion - The insertion is in CB4858 so the 2 flanking sequences abut each other (accttaaaaaaaa) and so you get a 2bp feature as the N2 base to the left and right are marked up (Notice the pair of coordinates) In this case, CB4858 has an A between the relative N2 positions 116070 and 116071.
  
'''1) How do I find out about the vectors used in the genome sequencing project?'''
+
ID              Type            CB4858  Chrom  Coordinate1    Coordinate2    Flank1            Flank2
 +
pas44644        Insertion      A      I       116070          116071          -cut-aactcaaaacctt aaaaaaaa-cut-
  
If you want the actual [ftp://ftp.sanger.ac.uk/pub/vector_sequences/ sequence of the vectors] used they are on the Sanger FTP site.
+
=== What's the easiest way for users to find the 'TRUE' ends, and thus the insert size, of a clone?  ===
  
and this should help you [http://www.sanger.ac.uk/Projects/C_elegans/DOCS/vectors.shtml identify the vector] for the clone you are interested in
+
The set of clone ends is dumped as part of the gff files:  
  
'''2) How do I obtain C.elegans cosmids/Yacs?'''
+
http://www.sanger.ac.uk/Projects/C_elegans/WORMBASE/GFF_files.shtml
  
Please send all requests for C.elegans genomic clones to Audrey Fraser (aef@sanger.ac.uk) Please remember to include a FedEx/DHL Account number to which we can charge the shipping cost. Note that the suffixes to YAC names (e.g. Y73B3A, Y73B3B) indicate sequenced segments of the YAC (Y73B3 in this case). They do not exist as subclones. You will be sent the whole YAC.
+
This is the source for the extents displayed in WormBase.  
  
'''3) How do I obtain C.elegans fosmid library?'''
+
The caveat with this is that the 'true' end is not marked up for all clones. The early cosmids do not have such annotations because nobody thought about marking them up. Later cosmids do have clone left and right ends as this became part of the standard procedure. Finally, many of the YACs do not have clone ends because the segment submitted to GenBank/EMBL is much smaller than the full clone, and hence the true ends lie within sequences already finished at that stage of the sequencing (i.e. we never went back to update clone ends in sequence already finished).  
  
Geneservice Ltd; a UK based reagent and service provider distribute many genomic libraries worldwide. Here is the link for their distribution of Don Moerman's C.elegans fosmid library. http://www.geneservice.co.uk/products/clones/Celegans_Fos.jsp
+
=== How can I FTP-download the genomic DNA database and the EST database for ''C. elegans''?  ===
  
=== How can I find allele information? ===
+
Our underlying database for WormBase is built on the acedb software (available freely from www.acedb.org). If you have acedb installed locally, you can download the entirety of our database from: ftp://ftp.sanger.ac.uk/pub2/wormbase/live_release
  
We do have information on many thousands of alleles in WormBase. We have also tried to extract the molecular details of the mutations (where known) and add those to WormBase. Some examples:
+
However, a simpler approach may be to just download a GFF file and DNA file for each chromosome from: ftp://ftp.sanger.ac.uk/pub2/wormbase/live_release/CHROMOSOMES/
  
Go to a gene page: http://www.wormbase.org/db/gene/gene?name=unc-71;class=Locus Then click on the link to the 'ay47' allele (near the bottom of the page), this takes you to: http://www.wormbase.org/db/gene/allele?name=ay47;class=Allele You can see that there is a 'c' to 't' substitution in this gene. If you go to the genome browser display for this gene: http://www.wormbase.org/db/seq/gbrowse/wormbase?name=unc-71 Then turn on the 'SNPs, Knockouts, and other Alleles' track and you will see the positions of the alleles in this gene.
+
=== Where are the flat files for the gene annotation of each chromosome of ''C. elegans''? ===
  
To find other alleles, you can go to the query page: http://www.wormbase.org/db/searches/wb_query and type the following queries (everything between the single quotes): 'Find Allele; Substitution' 'Find Allele; Deletion' 'Find Allele; Insertion'
+
You should take a look at the Feature Tables (GFF), which you can pick up from the same 'WormBase Downloads' page where you found the "Summary Tables" (http://www.wormbase.org/downloads.html).
  
=== What's the easiest way for users to find the 'TRUE' ends, and thus the insert size, of a clone? ===
+
You should also look at the 'Batch Downloads' page at WormBase (http://www.wormbase.org/db/searches/info_dump), where you can build your own tables of gene annotations.
  
The set of clone ends is dumped as part of the gff files:
+
One other WormBase page you should look at is the "Genome Dumper" (http://www.wormbase.org/db/searches/advanced/dumper).
  
http://www.sanger.ac.uk/Projects/C_elegans/WORMBASE/GFF_files.shtml
+
=== Where can I find details of the methods used to create wormpep?  ===
  
This is the source for the extents displayed in WormBase.
+
We make WormPep during each release of WormBase and the starting point is always a translation of our latest set of gene predictions. Gene predictions are initially based on the GeneFinder prediction program with human modification as is deemed necessary. The level of human involvement really depends on what other supporting data is available. Aside from routine inspection of gene predictions based on EST/mRNA data we also evaluate our predictions based on information from published papers and direct contact from the worm community. All gene predictions have been looked at by a human to some level.  
  
The caveat with this is that the 'true' end is not marked up for all clones. The early cosmids do not have such annotations because nobody thought about marking them up. Later cosmids do have clone left and right ends as this became part of the standard procedure. Finally, many of the YACs do not have clone ends because the segment submitted to GenBank/EMBL is much smaller than the full clone, and hence the true ends lie within sequences already finished at that stage of the sequencing (i.e. we never went back to update clone ends in sequence already finished).
+
We have started to distinguish subsets of WormPep. Thus all WormPep proteins can be thought of as either 'CONFIRMED', 'PARTIALLY CONFIRMED', or 'PREDICTED'. The first set contains all genes where there is transcript evidence for every base of every exon of the gene (note that this can still - in theory - mean that there are unpredicted exons in a 'CONFIRMED' gene). The second set contains genes for which there is some transcript evidence but the whole gene is not yet supported...either due to lack of transcript evidence or errors in our current gene prediction. The last set is everything else, i.e. genes with no transcript support. In the future we may expand this classification system to take account of other evidence (e.g. homology info from C. briggsae).  
  
=== How can I FTP-download the genomic DNA database and the EST database for ''C. elegans''? ===
+
Each new build usually sees a slight increase in the first two categories and a drop in the third category. The relevant status of each Wormpep entry is added into the FASTA header of every entry in each WormPep release.  
  
Our underlying database for WormBase is built on the acedb software (available freely from www.acedb.org). If you have acedb installed locally, you can download the entirety of our database from: ftp://ftp.sanger.ac.uk/pub/wormbase/current_release
+
=== I cannot find information for a gene mentioned in early microarray or RNAi studies: why?  ===
  
However, a simpler approach may be to just download a GFF file and DNA file for each chromosome from: ftp://ftp.sanger.ac.uk/pub/wormbase/current_release/CHROMOSOMES/
+
In early versions of microarray and RNAi libraries, clone and gene names were often used synonymously. Because gene models and names change over versions of WormBase and history has not been carfully preserved, this caused much confusion. Fortunately, for those clones that we have the sequence information, we provide up-to-date mapping from each clone to the current gene models. However, sometimes we don't have sufficient sequence information for a given clone and thus unable to provide any information about its identify and one must inquire the primary generators (corresponding authors of publication) of that clone for more information. Below is an example of a 'lost' clone:  
  
These two set of files contain information on all sequence features (coordinates of genomic clones, genes, BLAST hits etc.). EST and mRNA sequences can be downloaded from: ftp://ftp.sanger.ac.uk/pub/C.elegans_sequences/ESTS/C.elegans_nematode_ESTs.gz
+
"what is the present location of Y41D4A_2491.a?"
  
=== Where are the flat files for the gene annotation of each chromosome of ''C. elegans''? ===
+
Simple answer, there's none. A simple search for "Anything" "Y41D4A_2491.a" produces hits that indicate that Y41D4A_2491.a was used as a clone/gene name in Stanford microarray library. For sequence information, WormBase177 only has sequences for the oligos but not a PCR_product. The pair of oligo sequences (Oligo: sjj_Y41D4A_2491.a_b¬†; Oligo: sjj_Y41D4A_2491.a_f) fails to produce an ePCR product and each individually fails to map to the genome when searched with Genome browser oligo mapping tool.  
  
You should take a look at the Feature Tables (GFF), which you can pick up from the same 'WormBase Downloads' page where you found the "Summary Tables" (http://www.wormbase.org/downloads.html).
+
We keep uptodate mapping files here ftp://caltech.wormbase.org/pub/annots/rnai/.  
  
You should also look at the 'Batch Downloads' page at WormBase (http://www.wormbase.org/db/searches/info_dump), where you can build your own tables of gene annotations.
+
<br>
  
One other WormBase page you should look at is the "Genome Dumper" (http://www.wormbase.org/db/searches/advanced/dumper).
+
=== I am trying to figure what convention was used when the gene names were changed from a letter code to a number code (for example Y17G9B.a-i to Y17G9B.1-9). ===
  
=== Where can I find details of the methods used to create wormpep? ===
+
This naming convention change occurred following the initial annotation phase back in the 90's. Genomic clones were originally submitted with cosmid.letter annotations prior to 1998 but this was changed to increase the depth of the nomenclature as some clones started to have more than 26 genes.<br> <br> There are 3 approaches to identify the current gene that corresponds to an original letter code gene locus.
  
We make WormPep during each release of WormBase and the starting point is always a translation of our latest set of gene predictions. Gene predictions are initially based on the GeneFinder prediction program with human modification as is deemed necessary. The level of human involvement really depends on what other supporting data is available. Aside from routine inspection of gene predictions based on EST/mRNA data we also evaluate our predictions based on information from published papers and direct contact from the worm community. All gene predictions have been looked at by a human to some level.
+
1) Search through the wormpep.history file within the [ftp://ftp.sanger.ac.uk/pub/wormbase/WS190/wormpep190.tar.gz Wormpep190.tar.gz] archive found [ftp://ftp.sanger.ac.uk/pub/wormbase/WS190/ here].  
  
We have started to distinguish subsets of WormPep. Thus all WormPep proteins can be thought of as either 'CONFIRMED', 'PARTIALLY CONFIRMED', or 'PREDICTED'. The first set contains all genes where there is transcript evidence for every base of every exon of the gene (note that this can still - in theory - mean that there are unpredicted exons in a 'CONFIRMED' gene). The second set contains genes for which there is some transcript evidence but the whole gene is not yet supported...either due to lack of transcript evidence or errors in our current gene prediction. The last set is everything else, i.e. genes with no transcript support. In the future we may expand this classification system to take account of other evidence (e.g. homology info from C. briggsae).
+
If you look for your gene eg. Y17G9B.g you get:
 +
Y17G9B.g CE21394 17 18
 +
 +
Then if you look for all occurrences of the CE21388 number.
 +
Y17G9B.g CE21394 17 18
 +
Y17G9B.4 CE21394 18 72
 +
 +
From this you can assume that .g was renamed .4 as the gene encodes the same protein.  
 +
This doesn't always work as the gene may have undergone some annotation changes which breaks this link.  
 +
  
Each new build usually sees a slight increase in the first two categories and a drop in the third category. The relevant status of each Wormpep entry is added into the FASTA header of every entry in each WormPep release.
+
<br> 2) If the above doesn't work and you have a small list you can blast the old peptide against the genome and see which gene it overlaps.  
  
=== I cannot find information for a gene mentioned in early microarray or RNAi studies: why? ===
+
look in the wormpep.fasta190 file for the peptide sequence.
 +
 +
grep Y17G9B.a wormpep.history190 as before
 +
Y17G9B.a CE21388 17 18
 +
 +
wormpep.fasta190:
 +
CE21388 MLRLKNFSNLRELSTDS--snip--PVDDLISFLETFELDEEDE
 +
 
 +
TBLASTN the peptide sequence against the elegans genome and see where it hits the current assembly.
 +
  
In early versions of microarray and RNAi libraries, clone and gene names were often used synonymously. Because gene models and names change over versions of WormBase and history has not been carfully preserved, this caused much confusion. Fortunately, for those clones that we have the sequence information, we provide up-to-date mapping from each clone to the current gene models. However, sometimes we don't have sufficient sequence information for a given clone and thus unable to provide any information about its identify and one must inquire the primary generators (corresponding authors of publication) of that clone for more information. Below is an example of a 'lost' clone:
+
3) If you are interested in genes used in microarray and RNAi libraries see the previous [[#I_cannot_find_information_for_a_gene_mentioned_in_early_microarray_or_RNAi_studies:_why.3F|FAQ]].
  
"what is the present location of Y41D4A_2491.a?"
+
===  Where can I obtain C elegans strains ===
  
Simple answer, there's none. A simple search for "Anything" "Y41D4A_2491.a" produces hits that indicate that Y41D4A_2491.a was used as a clone/gene name in Stanford microarray library. For sequence information, WormBase177 only has sequences for the oligos but not a PCR_product. The pair of oligo sequences (Oligo: sjj_Y41D4A_2491.a_b ; Oligo: sjj_Y41D4A_2491.a_f) fails to produce an ePCR product and each individually fails to map to the genome when searched with Genome browser oligo mapping tool.
+
Please refer to the following link to obtain C. elegans strains:
  
We keep uptodate mapping files here ftp://caltech.wormbase.org/pub/annots/rnai/.
+
[http://www.cbs.umn.edu/CGC/strains/ http://www.cbs.umn.edu/CGC/strains/]
  
== About Database Queries ==
+
== About Database Queries ==
  
=== How do I get [http://wormbase.org/db/searches/aql_query AQL] to search data in hashes? ===
+
=== How do I get [http://wormbase.org/db/searches/aql_query AQL] to search data in hashes? ===
  
Like this example:
+
Like this example:  
  
select p-&gt;Standard_name, a[Institution], a[Email] from p in class Person, a in p-&gt;Address[0] where exists p-&gt;Supervised
+
select p-&gt;Standard_name, a[Institution], a[Email] from p in class Person, a in p-&gt;Address[0] where exists p-&gt;Supervised  
  
More documentation is available at http://www.acedb.org/Software/whelp/AQL/examples_worm.shtml<nowiki>; scroll down to "Queries on objects containing hash structures." </nowiki>
+
More documentation is available at http://www.acedb.org/Software/whelp/AQL/examples_worm.shtml<nowiki>; scroll down to "Queries on objects containing hash structures." </nowiki>  
  
=== How can I obtain all the abstracts on Wormbase and the particular genes that they are associated with? ===
+
=== How can I obtain all the abstracts on Wormbase and the particular genes that they are associated with? ===
  
There are two ways:
+
There are two ways:  
  
1) go to ftp://ftp.sanger.ac.uk/pub/wormbase/current_release/ and get the acedb data files.
+
1) go to ftp://ftp.sanger.ac.uk/pub/wormbase/current_release/ and get the acedb data files.  
  
2) Use AcePerl to get the abstracts. You can do this easily with an Aceperl script:
+
2) Use AcePerl to get the abstracts. You can do this easily with an Aceperl script:  
  
 
     my $db = Ace-&gt;connect(-host=&gt;yourhost.com) || warn 'yikes';
 
     my $db = Ace-&gt;connect(-host=&gt;yourhost.com) || warn 'yikes';
Line 516: Line 678:
 
     }
 
     }
  
=== How can I find out how many genes contain expression patterns generated with a specific method? ===
+
=== How can I find out how many genes contain expression patterns generated with a specific method? ===
  
'''(For example, by in situ hybridization?)'''
+
'''(For example, by in situ hybridization?)'''  
  
Type the following command under the menu and in the box of "Advanced Search". The following line search for Expr_patterns containig all three types of methods. If the '&amp;' is replaced by '|', the command will search for Expr_pattern with In_situ OR Antibody OR Reporter_gene data.
+
Type the following command under the menu and in the box of "Advanced Search". The following line search for Expr_patterns containig all three types of methods. If the '&amp;' is replaced by '|', the command will search for Expr_pattern with In_situ OR Antibody OR Reporter_gene data.  
  
 
  find Expr_pattern Type="In_situ" &amp; Type="Antibody" &amp; Type="Reporter_gene"
 
  find Expr_pattern Type="In_situ" &amp; Type="Antibody" &amp; Type="Reporter_gene"
 +
  
 
You may change the words following the same syntax to search for other objects.
 
You may change the words following the same syntax to search for other objects.
  
=== How can I download the alignments of EST sequences to genomic sequences? ===
+
=== How can I download the alignments of EST sequences to genomic sequences? ===
  
You can extract it from the GFF files that we provide with every release of WormBase. For more information on GFF files see:
+
You can extract it from the GFF files that we provide with every release of WormBase. For more information on GFF files see:  
  
http://www.sanger.ac.uk/Software/formats/GFF/
+
http://www.sanger.ac.uk/Software/formats/GFF/  
  
Basically, we release one GFF file per chromosome and this contains the coordinates and details of most features that we can map onto chromosome base pair coordinates.
+
Basically, we release one GFF file per chromosome and this contains the coordinates and details of most features that we can map onto chromosome base pair coordinates.  
  
These files are accesible from the main WormBase page (see the Feature table links on the right) and should also be on the WormBase and Sanger Institute FTP sites.
+
These files are accesible from the main WormBase page (see the Feature table links on the right) and should also be on the WormBase and Sanger Institute FTP sites.  
  
You will need to extract only a subset of these files, i.e. lines that match the pattern 'BLAT_EST_'. This is very easy to do if you have access to a UNIX/Linux system (use the 'grep' command).
+
You will need to extract only a subset of these files, i.e. lines that match the pattern 'BLAT_EST_'. This is very easy to do if you have access to a UNIX/Linux system (use the 'grep' command).  
  
E.g. here are two sample lines from the Chromosome II file (these will probably wrap around your screen):
+
E.g. here are two sample lines from the Chromosome II file (these will probably wrap around your screen):  
  
 
  CHROMOSOME_II BLAT_EST_BEST similarity 5754433 5755008 100
 
  CHROMOSOME_II BLAT_EST_BEST similarity 5754433 5755008 100
Line 544: Line 707:
 
  CHROMOSOME_II BLAT_EST_OTHER similarity 5755968 5755971 98.4
 
  CHROMOSOME_II BLAT_EST_OTHER similarity 5755968 5755971 98.4
 
  . . Target "Sequence:yk4g4.5" 116 119
 
  . . Target "Sequence:yk4g4.5" 116 119
 +
  
 
Within these lines are details of the chromosome coordinates, the BLAT score, the matching sequence name, and the coordinates within the matching sequence.
 
Within these lines are details of the chromosome coordinates, the BLAT score, the matching sequence name, and the coordinates within the matching sequence.
  
=== How can I retrieve timestamps of Acedb from the command line, and do I have to use Perl? ===
+
=== How can I retrieve timestamps of Acedb from the command line, and do I have to use Perl? ===
  
We use AcePerl to retrieve some timestamp information...this is done via an AQL query.
+
We use AcePerl to retrieve some timestamp information...this is done via an AQL query.  
  
E.g. if you wanted to find the timestamp of a tag in a particular object, belonging to a particular class, you could do:
+
E.g. if you wanted to find the timestamp of a tag in a particular object, belonging to a particular class, you could do:  
  
 
  my $aql_query = "select s,s-&gt;$tag.node_session from s in object(\"$class\",\"$object\")";
 
  my $aql_query = "select s,s-&gt;$tag.node_session from s in object(\"$class\",\"$object\")";
Line 557: Line 721:
 
  my $timestamp = $aql[0]-&gt;[1];
 
  my $timestamp = $aql[0]-&gt;[1];
  
=== How can I search for pseudogenes in Wormbase? ===
+
=== How can I search for pseudogenes in Wormbase? ===
  
It will take a long time if you do AQL queries. However, a different way of query can be done if you want to retrieve the info from wormbase website.
+
It will take a long time if you do AQL queries. However, a different way of query can be done if you want to retrieve the info from wormbase website.  
  
From More search -&gt; Advanced search at http://www.wormbase.org/db/searches/query
+
From More search -&gt; Advanced search at http://www.wormbase.org/db/searches/query  
  
In Query Acedb, type in
+
In Query Acedb, type in  
  
 
  find sequence *; pseudogene
 
  find sequence *; pseudogene
 +
  
 
You should get a result of pseudogene objects within a couple of seconds.
 
You should get a result of pseudogene objects within a couple of seconds.
  
=== Where can I find a list of classes and subclasses for Acedb? ===
+
=== Where can I find a list of classes and subclasses for Acedb? ===
  
You can find a list of Acedb classes by first clicking on the More Searches link on the upper right corner of the WormBase home page. From here, select the WormBase Class Browser, which will bring you to a searchable drop-down menu of all the Acedb classes.
+
You can find a list of Acedb classes by first clicking on the More Searches link on the upper right corner of the WormBase home page. From here, select the WormBase Class Browser, which will bring you to a searchable drop-down menu of all the Acedb classes.  
  
For performing queries, it is helpful to know the data model for each of the classes that you would like to search. The data models can be accessed from this same page by typing the class of interest into the search box and then selecting "Model" from the drop-down menu. This will lead you to a Tree Display that diagrams how data for a particular class is represented in Acedb.
+
For performing queries, it is helpful to know the data model for each of the classes that you would like to search. The data models can be accessed from this same page by typing the class of interest into the search box and then selecting "Model" from the drop-down menu. This will lead you to a Tree Display that diagrams how data for a particular class is represented in Acedb.  
  
Also, from the MoreSearches link, you can access the Advanced AQL Search, which has further documentation and examples for querying the database.
+
Also, from the MoreSearches link, you can access the Advanced AQL Search, which has further documentation and examples for querying the database.  
  
=== How can I retrieve the gi numbers only for a list of entries having the GO term selected? ===
+
=== How can I retrieve the gi numbers only for a list of entries having the GO term selected? ===
  
At present, you can retrieve Genbank identifiers (i.e. AAMxxxxx, AAKxxxxx, AAFxxxxx, etc.) for CDS's that are associated with a particular GO term by performing an AQL query. Here are the steps:
+
At present, you can retrieve Genbank identifiers (i.e. AAMxxxxx, AAKxxxxx, AAFxxxxx, etc.) for CDS's that are associated with a particular GO term by performing an AQL query. Here are the steps:  
  
 
     1) At the top right corner of the WormBase homepage, click on the More Searches link.
 
     1) At the top right corner of the WormBase homepage, click on the More Searches link.
Line 591: Line 756:
 
   
 
   
 
     4) You should get back a three-column table listing each CDS, the GO term you selected, and a Genbank ID.
 
     4) You should get back a three-column table listing each CDS, the GO term you selected, and a Genbank ID.
 +
  
 
If you are interested, the rationale for the query can be bettter understood by looking at our data model for CDS's, which is at [http://www.wormbase.org/db/misc/etree?name=%3FCDS&class=Model;expand=Visible#Visible http://www.wormbase.org/db/misc/etree?name=%3FCDS&amp;class=Model;expand=Visible#Visible]. The above query searches in the CDS class, in the attribute go_term, where we have defined the go_term to be "GO:0003700", and in the attribute protein_id for the unique text id which is the database identifier. The [1] after the letter c in the query indicates that the search will retrieve information in the 1, or text, column of protein_id, since the sequence column is considered column 0.
 
If you are interested, the rationale for the query can be bettter understood by looking at our data model for CDS's, which is at [http://www.wormbase.org/db/misc/etree?name=%3FCDS&class=Model;expand=Visible#Visible http://www.wormbase.org/db/misc/etree?name=%3FCDS&amp;class=Model;expand=Visible#Visible]. The above query searches in the CDS class, in the attribute go_term, where we have defined the go_term to be "GO:0003700", and in the attribute protein_id for the unique text id which is the database identifier. The [1] after the letter c in the query indicates that the search will retrieve information in the 1, or text, column of protein_id, since the sequence column is considered column 0.
  
=== How can I download ''C. briggsae'' 3' UTRs in bulk? ===
+
=== How can I download ''C. briggsae'' 3' UTRs in bulk? ===
  
We don't really have a strictly empirical set of 3' UTRs (3' flanking sequences taking from cDNA). However, what you probably really want are predicted 3' UTR regions. Those, you can get by going to the Genome Dumper:
+
We don't really have a strictly empirical set of 3' UTRs (3' flanking sequences taking from cDNA). However, what you probably really want are predicted 3' UTR regions. Those, you can get by going to the Genome Dumper:  
  
http://wormbase.org/db/searches/advanced/dumper
+
http://wormbase.org/db/searches/advanced/dumper  
  
selecting the species "C. briggsae", and then filling in the options for the download that you want. The main stumbling block is that the list you need for briggsae genomic sequences is rather long. However, I've tried typing this:
+
selecting the species "C. briggsae", and then filling in the options for the download that you want. The main stumbling block is that the list you need for briggsae genomic sequences is rather long. However, I've tried typing this:  
  
cb25.*
+
cb25.*  
  
into the window "Type in a list of sequence or chromosome names", and that seems to successfully prompt the Genome Dumper to search through all available C. briggsae genomic contigs.
+
into the window "Type in a list of sequence or chromosome names", and that seems to successfully prompt the Genome Dumper to search through all available C. briggsae genomic contigs.  
  
Also, you should pick the "Integrated ('hybrid') briggsae gene set", and select some reasonable value (e.g., 1000 bp) for the length of the 3' flank sequences that you want.
+
Also, you should pick the "Integrated ('hybrid') briggsae gene set", and select some reasonable value (e.g., 1000 bp) for the length of the 3' flank sequences that you want.  
  
=== How can I find the coding sequences of alleles for particular gene(s) having SNPs from ''C. elegans''? ===
+
=== How can I find the coding sequences of alleles for particular gene(s) having SNPs from ''C. elegans''? ===
  
For instance, if you want to find out SNP sequences for H39E23.1a gene, you can use the following AQL query: select a-&gt;predicted_gene, a, a-&gt;flanking_sequences[1],
+
For instance, if you want to find out SNP sequences for H39E23.1a gene, you can use the following AQL query: select a-&gt;predicted_gene, a, a-&gt;flanking_sequences[1],  
  
 
  a-&gt;flanking_sequences[2], a-&gt;substitution from a in class allele where
 
  a-&gt;flanking_sequences[2], a-&gt;substitution from a in class allele where
 
  a-&gt;predicted_gene = "H39E23.1a" and a-&gt;method = "snp"
 
  a-&gt;predicted_gene = "H39E23.1a" and a-&gt;method = "snp"
 +
  
The output of the query (in text mode) looks like this:
+
The output of the query (in text mode) looks like this:  
  
 
  H39E23.1a snp_AH10.2 tgaaaaaaactaatttttaatgtga tcttggccacaattgacctagtttg [A/G]
 
  H39E23.1a snp_AH10.2 tgaaaaaaactaatttttaatgtga tcttggccacaattgacctagtttg [A/G]
 
  H39E23.1a snp_AH10.3 ctgaacaactgaaaaaggaaagaaa agggaaaaagttcgaccacaaaaaa [G/A]
 
  H39E23.1a snp_AH10.3 ctgaacaactgaaaaaggaaagaaa agggaaaaagttcgaccacaaaaaa [G/A]
 +
  
 
Here the first column is the gene name, second is the allele name, third and fourth are sequences flanking the allele and the last one is the actual allele sequence change. You can modify the query to retrieve information for genes that you're interested in.
 
Here the first column is the gene name, second is the allele name, third and fourth are sequences flanking the allele and the last one is the actual allele sequence change. You can modify the query to retrieve information for genes that you're interested in.
  
=== How can I download the spliced and non-spliced regions for all ''C. elegans'' or ''C. briggsae'' genes? ===
+
=== How can I download the spliced and non-spliced regions for all ''C. elegans'' or ''C. briggsae'' genes? ===
  
You can download spliced/unspliced sequences for a list of genes using Batch Gene tool: http://www.wormbase.org/db/searches/info_dump You can paste a list of genes you're interested in into the search box and select Spliced and Unspliced check boxes in the Sequence field. If you output data as text, you'll be able to save it to your harg disk.
+
You can download spliced/unspliced sequences for a list of genes using Batch Gene tool: http://www.wormbase.org/db/searches/info_dump You can paste a list of genes you're interested in into the search box and select Spliced and Unspliced check boxes in the Sequence field. If you output data as text, you'll be able to save it to your harg disk.  
  
To get the list of C. briggsae genes (so that you can paste it into the search box), you can use the following query: select a from a in class cds where a-&gt;species like "*briggsae*".
+
To get the list of C. briggsae genes (so that you can paste it into the search box), you can use the following query: select a from a in class cds where a-&gt;species like "*briggsae*".  
  
 +
<br>
  
=== How do I find all genes with transmembrane or signalp domains? ===
+
=== How do I find all genes with transmembrane or signalp domains? ===
  
Go to the [http://www.wormbase.org/db/searches/wb_query WB Query page] and enter this query.
+
Go to the [http://www.wormbase.org/tools/queries WB Query page] and enter this query.  
  
'find Wormpep where Feature AND NEXT = "signalp"; follow Corresponding_CDS; follow Gene'
+
'find Wormpep where Feature AND NEXT = "signalp"; follow Corresponding_CDS; follow Gene'  
  
substitute "tmhmm" for "signalp" to get genes with transmembrane domains.
+
substitute "tmhmm" for "signalp" to get genes with transmembrane domains.  
  
 
Alternatively, all gene with transmembrane domains are automatically assigned the GO term GO:0016021
 
Alternatively, all gene with transmembrane domains are automatically assigned the GO term GO:0016021
  
== [http://www.wormbase.org/biomart/martview WormMart] questions ==
+
== [http://www.wormbase.org/biomart/martview WormMart] questions ==
 
 
=== Is there a way for a large number of genes to get not just the alleles, but also the actual mutation (when known)? ===
 
  
Here's what you need to do:
+
=== Is there a way for a large number of genes to get not just the alleles, but also the actual mutation (when known)?  ===
  
* Open [http://wormbase.org/biomart/martview/ WormMart]
+
Here's what you need to do:  
* SELECT RELEASE, SELECT DATABASE, SELECT DATASET [Variation]
 
* Click Filters in the left menu, then expand the Other Annotation filter and under Annotated with select [Sequence] Flanking Sequence
 
* Upload your file of GeneIDs to the Specified identifier of type field.
 
* Hit Count then Results to check your file is being read. nb. if you have chosen WS176 you should get n/80791.
 
* Click Attributes in the left menu and under Identification select Variation (Name), Method, Variation Type (merged) and Mutation Type. Under Affects select Gene (WB Gene ID) (merged), Gene (CGC name) and under Description select Sense, Sense Text, Splice site, Splice Site Text and Frameshift
 
* Hit Results
 
  
--Tuli 04:42, 6 July 2007 (EDT)
+
*Open [http://wormbase.org/biomart/martview/ WormMart]
 +
*SELECT RELEASE, SELECT DATABASE, SELECT DATASET [Variation]
 +
*Click Filters in the left menu, then expand the Other Annotation filter and under Annotated with select [Sequence] Flanking Sequence
 +
*Upload your file of GeneIDs to the Specified identifier of type field.
 +
*Hit Count then Results to check your file is being read. nb. if you have chosen WS176 you should get n/80791.
 +
*Click Attributes in the left menu and under Identification select Variation (Name), Method, Variation Type (merged) and Mutation Type. Under Affects select Gene (WB Gene ID) (merged), Gene (CGC name) and under Description select Sense, Sense Text, Splice site, Splice Site Text and Frameshift
 +
*Hit Results
  
<br />
+
--Tuli 04:42, 6 July 2007 (EDT)
  
=== How can I retrieve 1.5 kb promoter region upstream of a bunch of genes? ===
+
<br>
  
* Open WormMart at http://www.wormbase.org/biomart/martview/
+
=== How can I retrieve 1.5 kb promoter region upstream of a bunch of genes?  ===
* SELECT RELEASE, SELECT DATABASE, SELECT DATASET [Gene]
 
* Click Filters in the left menu, then expand the "Identification" filter, tick "[Gene] ID(s) of Type", select "[Gene] Any Name", upload (using "Browse") or type in a list of genes
 
* Hit Count to check your file is being read correctly
 
* Click Attributes in the left menu, tick "Gene Sequences", expand "Sequence Type", tick "Flank (Gene Coding Region)" (for upstream of translation start site), expand "Flanking Regions", tick "Upstream flank", type 1500 in the box
 
* Hit Results
 
* Results can be exported or e-mailed
 
  
=== How can I find all genes expressed somewhere with a particular GO term? ===
+
*Open WormMart at http://www.wormbase.org/biomart/martview/
 +
*SELECT RELEASE, SELECT DATABASE, SELECT DATASET [Gene]
 +
*Click Filters in the left menu, then expand the "Identification" filter, tick "[Gene] ID(s) of Type", select "[Gene] Any Name", upload (using "Browse") or type in a list of genes
 +
*Hit Count to check your file is being read correctly
 +
*Click Attributes in the left menu, tick "Gene Sequences", expand "Sequence Type", tick "Flank (Gene Coding Region)" (for upstream of translation start site), expand "Flanking Regions", tick "Upstream flank", type 1500 in the box
 +
*Hit Results
 +
*Results can be exported or e-mailed
  
(E.G., ''Find all genes expressed in the '''Vulva''' that have '''signal transducer activity'''''?)
+
=== How can I find all genes expressed somewhere with a particular GO term? ===
  
Firstly you need to identify the exact ontology identifiers that correspond to you query. Seach for the term in the main search box on the home page with the appropriate category selected.
+
(E.G., ''Find all genes expressed in the '''Vulva''' that have '''signal transducer activity'''''?)
  
    vulva  -  WBbt:0006748
+
Firstly you need to identify the exact ontology identifiers that correspond to you query. Seach for the term in the main search box on the home page with the appropriate category selected.
    signal transducer activity - GO:0003779
+
<pre>  vulva  -  WBbt:0006748
 +
</pre><pre>  signal transducer activity - GO:0004871 </pre>
 +
Armed with these you can start your [http://www.wormbase.org/biomart/martview WormMart] query.
  
Armed with these you can start your [http://www.wormbase.org/biomart/martview WormMart] query.
+
*Select which version of the database you are interested in (current release or a recent frozen one)
 +
*Select 'Gene' DATASET
 +
*In the left panel click 'Filters'. This will give an expandable list of filters in the main window.
 +
*Expand 'Annotation'
 +
*Check the ' [Annotation] IDs of Type: ' box and select '[Function] GO term ID' from the dropdown menu
 +
*put the GO term found earlier (GO:0004871) in to the box (or upload from file). (NOTE:You must include the GO: part)
  
* Select which version of the database you are interested in (current release or a recent frozen one)
+
This will find all genes annotated with GO term GO:0004871 - hit count for results so far (at time of writing = 111) Now we will add a second dataset to cross reference this result with.  
* Select 'Gene' DATASET
 
* In the left panel click 'Filters'. This will give an expandable list of filters in the main window.
 
* Expand 'Annotation'
 
* Check the ' [Annotation] IDs of Type: ' box and select '[Function] GO term ID [eg GO:0003779 ]' from the dropdown menu
 
* put the GO term found earlier (GO:004871) in to the box (or upload from file). (NOTE:You must include the GO: part)
 
  
This will find all genes annotated with GO term GO:0019901 - hit count for results so far (at time of writing = 7) Now we will add a second dataset to cross reference this result with.
+
*Back in the left panel, click 'Dataset' to get a drop down menu of other datasets.
 +
*Select 'Expression Pattern'
 +
*In the left panel, under 'Dataset' there is another 'Filters' option. Click to get a similar list as above.
 +
*Select ' Expressed in '
 +
*Check ' Specified identifiers of type ' box and choose ' Anatomy Term [eg WBbt:0006748]' from drop down list.
 +
*Enter the Anatomy term found earlier (or upload from file). (NOTE:You must include the WBbt: part)
  
* Back in the left panel, click 'Dataset' to get a drop down menu of other datasets.
+
This completes the querying part, you now need to select what information you want about the genes that the search finds.  
* Select 'Expression Pattern'
 
* In the left panel, under 'Dataset' there is another 'Filters' option. Click to get a similar list as above.
 
* Select ' Expressed in '
 
* Check ' Specified identifiers of type ' box and choose ' Anatomy Term [eg WBbt:0006748]' from drop down list.
 
* Enter the Anatomy term found earlier (or upload from file). (NOTE:You must include the WBbt: part)
 
  
This completes the querying part, you now need to select what information you want about the genes that the search finds.
+
*Click the 'Attributes' section in the left panel and expand the boxes as you need to select output categories of the Gene.
 +
*Click the 'Attributes' under 'Datasets' section to select attributes to do with the expression pattern.
  
* Click the 'Attributes' section in the left panel and expand the boxes as you need to select output categories of the Gene.
+
This [http://www.wormbase.org/biomart/martview?VIRTUALSCHEMANAME=default&ATTRIBUTES=wormbase_expr_pattern.default.attributes.expr_pattern|wormbase_expr_pattern.default.attributes.pattern|wormbase_expr_pattern.default.attributes.anatomy_term|wormbase_gene.default.attributes.gene|wormbase_gene.default.attributes.public_name|wormbase_gene.default.attributes.go_term|wormbase_gene.default.attributes.go_term_term&FILTERS=wormbase_expr_pattern.default.filters.anatomy_term.%22WBbt:0003891%22|wormbase_gene.default.filters.species_selection.%22Caenorhabditis%20elegans%22|wormbase_gene.default.filters.go_term.%22GO:0004871%22|wormbase_gene.default.filters.identity_status.%22Live%22&VISIBLEPANEL=resultspanel link]&nbsp;goes to the completed query. You can click on the relevant sections as described above to change any of GO or Anatomy terms and output data. You may need to click 'Results' to see the output of the query.  
* Click the 'Attributes' under 'Datasets' section to select attributes to do with the expression pattern.
 
  
This [http://www.wormbase.org/biomart/martview/a0a51b33af571096ffd2436c15c29a2b link] goes to the completed query. You can click on the relevant sections as described above to change any of GO or Anatomy terms and output data. You may need to click 'Results' to see the output of the query.
+
=== How do I pull out all operon details and the names of genes contained in an Operon?  ===
  
=== How do I pull out all operon details and the names of genes contained in an Operon? ===
+
You can retrieve this data through WormMart
  
You can retrieve this data through WormMart
+
*Select which version of the database you are interested in (current release or a recent frozen one)
 +
*Select 'Gene' DATASET
 +
*In the left panel click 'Filters'. This will give an expandable list of filters in the main window.
  
* Select which version of the database you are interested in (current release or a recent frozen one)
+
Select these Filters:  
* Select 'Gene' DATASET
 
* In the left panel click 'Filters'. This will give an expandable list of filters in the main window.
 
 
 
Select these Filters:
 
  
 
  [Gene] Species¬†: Caenorhabditis elegans
 
  [Gene] Species¬†: Caenorhabditis elegans
 
  [Gene] Status¬†: Live
 
  [Gene] Status¬†: Live
 
  [Location]¬†: Operon:Only (Annotation Tab - Limit to Entries Annotated with:)
 
  [Location]¬†: Operon:Only (Annotation Tab - Limit to Entries Annotated with:)
 +
  
* In the left panel click 'Attributes'
+
*In the left panel click 'Attributes'  
* In the right panel select these Attributes:
+
*In the right panel select these Attributes:
  
Gene WB ID [IDs tab] Operon [Location tab] Operon Start [Location tab] Operon End [Location tab]
+
Gene WB ID [IDs tab] Operon [Location tab] Operon Start [Location tab] Operon End [Location tab]  
  
This will give you a table like#:
+
This will give you a table like#:  
  
 
  Gene WB ID Gene Public Name  Operon Operon Start (bp)  Operon End (bp)
 
  Gene WB ID Gene Public Name  Operon Operon Start (bp)  Operon End (bp)
Line 727: Line 895:
 
  WBGene00000038 ace-4           CEOP2632 14197942   14210076
 
  WBGene00000038 ace-4           CEOP2632 14197942   14210076
  
=== How do you retrieve all the protein sequences of genes within Operons? ===
+
=== How do you retrieve all the protein sequences of genes within Operons? ===
  
You can retrieve this data through WormMart
+
You can retrieve this data through WormMart  
  
* Select which version of the database you are interested in (current release or a recent frozen one)
+
*Select which version of the database you are interested in (current release or a recent frozen one)  
* Select 'Gene' DATASET
+
*Select 'Gene' DATASET  
* In the left panel click 'Filters'. This will give an expandable list of filters in the main window.
+
*In the left panel click 'Filters'. This will give an expandable list of filters in the main window.
  
Select these Filters:
+
Select these Filters:  
  
 
  [Gene] Species¬†: Caenorhabditis elegans
 
  [Gene] Species¬†: Caenorhabditis elegans
 
  [Gene] Status¬†: Live
 
  [Gene] Status¬†: Live
 
  [Location]¬†: Operon:Only (Annotation Tab - Limit to Entries Annotated with:)
 
  [Location]¬†: Operon:Only (Annotation Tab - Limit to Entries Annotated with:)
 +
  
* In the left panel click 'Attributes'
+
*In the left panel click 'Attributes'  
* In the right hand window click 'Gene Sequences'
+
*In the right hand window click 'Gene Sequences'
  
Select These Attributes:
+
Select These Attributes:  
  
 
  Sequence Type:
 
  Sequence Type:
Line 752: Line 921:
 
   Gene Public Name
 
   Gene Public Name
 
   WB Wormpep ID
 
   WB Wormpep ID
 +
  
This should give you ~2800 on count and results like:
+
This should give you ~2800 on count and results like:  
  
* Click on 'Results'
+
*Click on 'Results'
  
 
  &gt; WBGene00000814|csn-2|WP:CE27562
 
  &gt; WBGene00000814|csn-2|WP:CE27562
Line 768: Line 938:
 
  RIDSIQSNIGTRIKF*
 
  RIDSIQSNIGTRIKF*
 
  etc. etc.
 
  etc. etc.
 +
 +
=== How do I retrieve a list of transcription factors which when mutated or targeted by RNAi cause embryonic lethal phenotype?  ===
 +
 +
*Open [http://wormbase.org/biomart/martview/ WormMart]
 +
*Select Database, e.g. "WormBase Release 188"
 +
*Select Dataset - "Phenotype"
 +
*Click on "Filter" link on the left and then on the "+" next to "Phenotype Annotation"
 +
*Select "Phenotype Inc. Descendents" checkbox and select "embryonic_lethal" from the pull down menu
 +
 +
If you click "Count" button at this point, you should see the number of entries that are annotated with this phenotype (this is not necessary).
 +
 +
Now add a second dataset:
 +
 +
*Click on second "Dataset" link on the left
 +
*Choose Additional Dataset - "Gene"
 +
*Click on "Filter" link on the left (for the second dataset) and then on the "+" next to "Annotation"
 +
*Select "[Annotation] IDs of Type" checkbox, select "[Function] GO Term ID" and enter GO:0003700 in the box below (corresponds to transcription factor activity)
 +
*Click on "Attributes" link on the left (for the second dataset) and then on the "+" next to "Function" and select "GO Term Info (merged)" checkbox (if you want to see GO annotations in addition to attributes selected by default, which you can change for each dataset through the Attributes dialog)
 +
*Press Results button and Export all results to File (also check Unique results only)
 +
 +
[http://www.wormbase.org/biomart/martview?VIRTUALSCHEMANAME=default&FILTERS=wormbase_gene.default.filters.go_term.%22GO:0003700%22|wormbase_gene.default.filters.identity_status.%22Live%22|wormbase_gene.default.filters.species_selection.%22Caenorhabditis%20elegans%22|wormbase_phenotype.default.filters.phenotype_name_ancestor_options.%22embryonic_lethal%22&ATTRIBUTES=wormbase_gene.default.attributes.gene|wormbase_gene.default.attributes.go_term_dminfo|wormbase_gene.default.attributes.public_name|wormbase_phenotype.default.attributes.name_primaryname_phenotypename|wormbase_phenotype.default.attributes.phenotype Here] is what you should see.
 +
 +
<br>
 +
 +
=== How do I download/generate a file containing the unspliced transcripts like I see on the sequence pages of WormBase?  ===
 +
 +
I would like to download the sequences that I see on the sequence summary pages eg.<br>
 +
 +
[[Image:Unspliced transcript.jpg|Image:Unspliced_transcript.jpg]]
 +
 +
To do this, replicate the following wormmart query.<br>
 +
 +
<br>
 +
 +
Dataset:&nbsp;&nbsp;&nbsp; CHOOSE DATABASE: WormBase WS195<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp; CHOOSE DATASET: Gene<br> <br> <span class="mart_summarypanel_AttFiltHeader_highlighted" id="wormbase_gene__summarypanel_filterbranch" title="Restrict your query by filtering">Filters leave as default and add (*) </span><span></span><span style="display: inline;"></span>
 +
 +
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; [Gene] Species&nbsp;: Caenorhabditis elegans&lt;input type="hidden" name="default____wormbase_gene__filterlist" value="wormbase_gene__filter.species_selection"&gt;&lt;/input&gt;
 +
 +
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; [Gene] Status&nbsp;: Live
 +
 +
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span style="display: inline;">* Annotation: </span>[Transcript] Type: Coding
 +
 +
<br> Attributes: select [Gene Sequences] at top of page<span style="display: inline;">
 +
</span>
 +
 +
&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; Sequence Type: Unspliced (Transcript)<span style="display: inline;">
 +
</span>
 +
 +
&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; Header Attributes: Whatever the user requires<span style="display: inline;">
 +
</span>
 +
 +
<br>
 +
 +
This should give you a count in the region of 22,000 objects and yield ~27,000 sequence objects in your file.
 +
 +
<span style="color: rgb(128, 0, 0);"></span>
 +
 +
&nbsp;If you have a specific list of genes you want sequence data for, you can upload a file of IDs. <br> e.g. WBGene IDs file format is:
 +
 +
WBGene00000001<br>WBGene00000002<br>WBGene00000003
 +
 +
Go back to your wormmart session and on the filters tab select ([Annotation] IDs of Type:) and upload your file.<br>
 +
 +
WormBase will provide a pre-computed file under the sequence directory on the ftp site:
 +
 +
ftp://ftp.sanger.ac.uk/pub/wormbase/live_release/genomes/c_elegans/sequences/dna/
 +
 +
 +
== Which GFF source and feature (method) should I use? ==
 +
The terms feature and method are used interchangably
 +
[[GFF_source_methods]]
 +
 +
 +
 +
 +
[[Category:User Guide]]

Latest revision as of 09:22, 19 February 2013

Warning.jpg Warning: Most of this pages content has been transferred to the FAQ pages on the new www.wormbase.org site. Warning.jpg


Contents

General Questions

How should I cite WormBase?

see Citing and Acknowledging WormBase

What database technology are you using for WormBase?

1) At the back end, WormBase data are deposited in an object-oriented database, ACeDB, which is the "master" database containing all data. ACeDB can be accessed both remotely and locally, through both commandline and web server.

2) Some data (especially sequence data including genomics sequence, ESTs, OSTs, SNPs, genes, RNAs etc) are extracted from ACeDB and are deposited in a "slave" MySQL database, to support some key features like gbrowse (see below);

3) At the front end sits the apache server with mod_perl. Wormbase software package containing configuration files and a series of CGI scripts runs on the apache server. The CGI scripts provide users with a number of ways to browse and search WormBase.

4) Some key features of the WormBase package: i. gbrowse (http://www.wormbase.org/db/seq/gbrowse?source=wormbase): developed by Lincoln Stein for the GMOD consortium and is widely used for other model organisms. It allows users to browse through the whole genome for feature tracks corresponding to specific genome regions. gbrowse is highly configuarable and support multiple foreign languages. ii. synteny browser(http://www.wormbase.org/db/seq/ebsyn?name=CBG22984): recently developed by Lincoln Stein for the GMOD consortium as well. It allows comparative view of two genomes side by side, focusing on the syntenic regions.

How are the WormBase entries created and maintained?

There is no simple answer to that. WormBase has a team of about 30 people who generate and curate data in many different ways. The genome sequence of C. elegans was determined at two of the four WormBase groups, and so a lot of data pertaining to gene predictions and other features annotated on to the genome are created and maintained by those groups.

The group at Caltech do a lot of literature curation and extract all sorts of information from the published literature (from hand-curated descriptions of gene function to details of individual RNAi experiments).

Also a lot of data comes from 3rd party collaborators who submit bulk datasets direct to WormBase (e.g. Orfeome data, 'knockout' deletion alleles). In contrast we also get directly submitted data from users at a very small level, e.g. individual allele submissions.

Finally, we also generate data de novo as part of the database build procedure, e.g. calculating molecular weights of proteins.

Can you give me medical advice on how to deal with infectious or parasitic worms?

Unfortunately, no; WormBase is specifically dedicated to the biology of Caenorhabditis elegans, is staffed by Ph.D.s rather than physicians, and would not lawfully be able to provide medical advice over the Internet even if it were a M.D.-staffed database oriented towards pathogenic worms. Please consult your local physician for all medical advice.

Gene Summary Page questions

What does the "% length" mean in the Best Blast Hits table?

BLAST queries can have matches with multiple regions on the same hit. WormBase attempts to reconcile this information and present a value which represents the extent of coverage of all matches on the target sequence.

-- Tharris 15:13, 9 February 2006 (EST)

What do the different gene model Status lines (ie confirmed) shown on the Gene Page and in the GFF files mean?

confirmed_inconsistent - means that a curator has decided that the intron doesn't fit with what we consider to be a valid transcript or there appears to be something wrong with the Transcript that confirms the intron.

confirmed_est - an intron confirmed by EST transcript sequence data

confirmed_cdna - an intron confirmed by cDNA/mRNA transcript sequence data

confirmed_false - these are where curators have confirmed that the confirmed intron is false or artefactual.

other confirmed_* types that a curator can add are:

confirmed_UTR - used when a confirmed intron looks like it is in the UTR of a gene.

confirmed_Homology - although I don't think this has ever been used.

Fetching Sequences

I'd like to fetch the DNA sequence of (some feature or coordinates). How do I go about this?

WormBase offers many ways to fetch sequences of features.

  • Using the Genome Browser
  1. Enter the desired coordinates into the Genome Browser using the format "CHROMOSOME:START..STOP" (eg. X:10..1000). If you don't know the coordinates of the feature of interest, just search for the feature itself.
  2. Select "Display Decorated FASTA File" from the "Reports & Analysis" popupmenu. Click Go to retrieve the sequence of the region. You can also specify optional formatiting of features contained within the sequence by clicking "configure...".
  3. HINT: You can adjust the coordinates of the segment to be retrieved manually or by zooming in or out.

-- Tharris11:15, 16 February 2006 (EST)

  • Using the Genome Browser for a batch of sequences
  1. On Genome Browser page, under "Reports & Analysis:" select from the pull down list "Download Sequence File"; click on "Configure"; paste in your coordinates (eg. X:10..1000, one per line) in the "Sequence IDs" box; select the choices of output you desire; hit go and enjoy

--Raymond 13:01, 14 July 2009 (EDT)

  • Using WormMart

Click Here for some example WormMart Queries.

-- Tharris11:15, 16 February 2006 (EST)

How can I download all the [3' UTR] sequences from the C. elegans genome?

The best way to download all sequences (for example the 3' UTR sequences) is through WormMart.

Here are the steps: Go to:

http://www.wormbase.org/biomart/martview

- Select the most recent version of WormBase and then select the 'Gene' dataset.  Hit 'next'.
- In the identification section on the next page, check the box to Limit to Gene IDs of Type:
  and select the appropriate gene identifier for how the genes are represented in your list,
  i.e. CGC names such as pal-1, sequence names such as R12B2.4, or the WBGene IDs, such
  as WBGene00000200.  Hit 'next'.
- At the top of the next page, select 'Gene, CDS and Protein Sequences' from the attribute page menu.
- Let the page reload and then select '3' UTR' from the sequence type menu.  Hit 'export'.

Where can I get repeatmasked genomic sequences for C. elegans, C. briggsae, or C. remanei?

How are the repeats determined?

For C. elegans, using the current (17 July 2007) most recent archival release of the database, you can get repeatmasked chromosomal sequences here:

   ftp://ftp.sanger.ac.uk/pub2/wormbase/WS170/CHROMOSOMES/CHROMOSOME_I_masked.dna.gz
   ftp://ftp.sanger.ac.uk/pub2/wormbase/WS170/CHROMOSOMES/CHROMOSOME_II_masked.dna.gz
   ftp://ftp.sanger.ac.uk/pub2/wormbase/WS170/CHROMOSOMES/CHROMOSOME_III_masked.dna.gz
   ftp://ftp.sanger.ac.uk/pub2/wormbase/WS170/CHROMOSOMES/CHROMOSOME_IV_masked.dna.gz
   ftp://ftp.sanger.ac.uk/pub2/wormbase/WS170/CHROMOSOMES/CHROMOSOME_V_masked.dna.gz
   ftp://ftp.sanger.ac.uk/pub2/wormbase/WS170/CHROMOSOMES/CHROMOSOME_X_masked.dna.gz

You will need to uncompress the relevant files with "gunzip CHROMOSOME_*_masked.dna.gz" or a similar command.

The C. briggsae repeatmasked genomic sequence is here, for assembly cb25.agp8:

   ftp://ftp.wormbase.org/pub/wormbase/genomes/briggsae/sequences/dna/cb25.agp8.supercontigs.masked.fa.gz

At some point there should be a repeatmasked version of the cb3 assembly, but as of 17 July 2007 there isn't yet.

C. remanei is here:

   ftp://ftp.wormbase.org/pub/wormbase/genomes/remanei/preliminary_analysis/assembly/C_remanei_masked.tar.gz

Note that in all of these cases, the sequences are hardmasked: i.e., the repeat sequences have been replaced by stretches of "N" residues, instead of being marked in some less information-destroying way. By contrast, softmasked sequences would keep the repeat sequences but distinguish them by changing their case: non-repeat sequences would be UPPERCASE, while repeat sequences embedded between the non-repeat sequences would be lowercase.

How do I get the sequence of a Brugia malayi protein?

For instance, sna-1 is annotated as being orthologous to B. malayi 13258.m00169, and paralogous to B. malayi 14704.m00455; but where do I go to get these sequences?

For the time being, the best method for getting B. malayi sequences quickly (and without having to download the entire predicted B. malayi proteome by 'licensed FTP') is to do a BlastP search, against the "BMA1_pep" protein sequence set, on TIGR's B. malayi Blast server at:

http://tigrblast.tigr.org/er-blast/index.cgi?project=bma1

A successful BlastP search will give a report that has hypertext links to individual protein sequences such as 13258.m00169 and 14704.m00455.

This workaround should stop being needed when TIGR and WormBase have worked out some agreement for WormBase to present individual protein sequences through WormBase's own FTP site and interface; but, as of 20 July 2007, this is the best we can (legally) do.

Another option would be to do a bulk download of the relevant sequence data from TIGR itself. See TIGR's data release policy at http://www.tigr.org/tdb/e2k1/bma1/ for more details.

To give you a glimpse of the things to come, Brugia malayi data imported from Genbank which will soon appear on the main site can be found at:

  • cds sequences (fasta) based on WS187
  • GFF3 genome annotation based on WS187
  • Protein Set based on translations of the GenBank submissions (first imported for WS185)
  • GBrowse of the updated data is soon to come(tm)
  • orthology predictions of Brugia malayi are included in the TreeView pages of C.elegans / C.briggsae and C.remanei genes starting with WS187

Gene structures and gene predictions

I think there should be a gene at the end of clone 'X' but WormBase doesn't show any genes in this region. Why not?

A full and complete description of all C. elegans genes is not known (and may not accurately be known for many years). WormBase attempts to represent all genes that have good experimental evidence plus a number of genes which have less experimental evidence but which were generated using gene finding software. If there are any publically available transcript data (EST, mRNA etc.) then WormBase should nearly always have attempted to make a gene prediction in that region. However, many poorly expressed genes may not have any transcript evidence and so may not be represented in WormBase at this time. Please help us by letting us know if you have any evidence for a gene that is currently not displayed in WormBase. Aside from transcript evidence (for which we would always encourage people to submit to GenBank/EMBL/DDBJ) a strong case can be made for creating a new gene if there is good conservation with other species (particularly C. briggsae or C. remanei) and if there is other supporting data (such as a positive RNAi phenotype).

Please also note that your gene may be there but may not be represented in the standard set of tracks in the Genome Browser. Check alternative gene predictions by turning on tracks for the 'GeneFinder' and 'Twinscan' gene predictions. Also consider turning on the 'Obsolete gene models' track as the gene may have existed in WormBase in the past but may have been removed.

-- Kbradnam 17:47, 12 February 2006 (EST)

I have found experimentally that the transcript of gene X is different from the gene model reported in WormBase. We would therefore like to update the gene model on WormBase.


Please send the new transcript sequence with a brief description of the required gene model change to wormbase-help@wormbase.org and a curator will make the appropriate change.

Please also submit your sequence to the EMBL/GenBank/DDBJ database. This helps in the confirmation and evidence for the wormbase gene prediction as we routinely retrieve sequence data from the public databases. This also makes the data public, allowing appropriate reference and acknowledgement to yourself.

--[/wiki/index.php?title=User:Gw3&action=edit Gw3] 09:22, 17 August 2007 (GMT)

What criteria does WormBase use to classify a gene as a Pseudogene?

See entry in Glossary_of_terms#P

I'd like to create a diagram of a genomic region similar to that shown on the Genome Browser. How can I do this?

One approach is to write your own scripts in Perl using the Bio::Graphics modules that are part of BioPerl. A second approach is to use the web interface to this software.

In the gff files I see confirmed_inconsistent and other confirmed statuses for introns, what do they all mean?

confirmed_est - an intron confirmed by EST transcript sequence data

confirmed_cdna - an intron confirmed by cDNA/mRNA transcript sequence data

confirmed_inconsistent - means that a curator has decided that the intron doesn't fit with what we consider to be a valid transcript or there appears to be something wrong with the Transcript that confirms the intron.

confirmed_false - these are where curators have confirmed that the confirmed intron is false or artefactual.

confirmed_UTR - used when a confirmed intron looks like it is in the UTR of a gene.

confirmed_Homology - where protein homology looks to confirm the intron, this has seen limited use.

what are seg, signalp and tmhmm motifs on the Protein report page?

  • seg - low complexity regions e.g. homopolymer runs - explanation

Old database releases

How do I remap the chromosomal coordinates between releases?

There is a page describing a perl script and the data to change the coordinates of GFF files here

Gene Model Naming

What do all the different gene names mean?

  • All genes have a corresponding sequence name, which are derived from the cosmid, fosmid or YAC clone on which they reside.
For instance the gene bli-4 has a sequence name of K04F10.4, indicating it was identified when the cosmid K04F10 was sequenced and annotated, and there are at least 3 other genes associated with that cosmid.
  • Any gene can code for multiple proteins (CDS) as a result of alternative splicing. In the case of bli-4 there are 6 known isoforms, called K04F10.4a, K04F10.4b, ..... K04F10.4f.
  • The corresponding transcripts for the isoforms are called K04F10.4a.1, K04F10.4b.1, ..... K04F10.4f.1
  • However if there is alternative splicing in the UTRs, which doesn't change the protein sequence, the alternatively-spliced transcripts are named K04F10.4a.1 and K04F10.4a.2.
  • ... and if there are no isoforms of the coding gene, for example AC3.5, but there is alternative splicing in the UTRs, there will be multiple transcripts named AC3.5.1 and AC3.5.2, etc.
  •  !!But if there are no alternate UTR transcripts the single coding_transcript is named the same as the CDS and does not have the .1 appended. As in the case of K04F10.4f

C. remanei

state of C.remanei integration

Interpolated Map Positions

We calculate for Genes without known genetic map positions a theoretical interpolated position based on a linear interpolation of surrounding genetic markers. [ A gene is considered a genetic marker if:

  • it has a physical map position
  • it has a CGC name
  • it has a genetic map position (experimental or promoted)

Promoted map positions are made for genes, that fulfil all other genetic marker requirements and were interpolated during prevbious builds. The logical order of the genetic markers is checked and curated during the build process to accomodate new experimental data as well as changes in the genomic sequence.

Obtaining WormBook articles

WormBook articles are available directly from the WormBook website, the development site, the Sanger WormBook mirror, as well as via Textpresso searches in WormBase.

The entire WormBook may also be downloaded as a zip archive (~400 Mb).

How do I get a list of genes with alternatively spliced isoforms?

Go to the WormBase Query Language Search page and enter the query 'find CDS; Isoform; follow Gene '

If you want to do further investigation of this group of genes set the output to 'Text' and copy the list to use in WormMart

How do I get a list of coding gene transcripts together with the supporting cDNA evidence for them?

Go to the WormBase Query Language Search page and enter the query:

'select l, cdna from l in class Transcript where l->Method = "Coding_transcript", cdna in l->Matching_cDNA '


What are WormBase's precomputed BLAST parameters?

These parameters refer to the precomputed BLAST results shown on the gene and protein report pages.

For BLASTP and BLASTX we use WU-BLAST2.0 with the following parameters

Z=10000000 (sets size of database in letters)
V=1000 (sets the number of one line summaries)
B=1000000 (sets number of database hits to report)
E=0.1 (E from the Karlin-Altschul equation - will not report hits with E-value greater than this)
cpus=1 (sets number of processors to use)
hitdist=40 (Max distance between word hits for 2-word seeding algorithm)

No low-complexity filtering is done for BLASTP. DNA sequences are masked with RepeatMasker and TRF before BLASTing.

About Cell and Anatomy

How do I find out cell lineage pedigrees?

There are two kinds of pedigree display. The Cell pedigree tree (located on the Cell Page) or the Lineage pedigree tree (located on the Pedigree Browser). The Cell Page is simple and easy to use, with a full description of the cell lay out, while the advantage of the Pedigree Browser is that it displays complete lineage pathways (from P0) with user-interested cell(s) highlighted.

Starting from the Search on the WormBase home page. Select from the pull-down menu "Cell" and enter the cell name. A "cell summary" display will appear with a Cell pedigree display box showing three generations of cells. Your cell will appear red on the pedigree. Users can move the pedigree tree up or down in the lineage by clicking on the parent cell or daughter cells. Another way to access pedigree is from "Cell and Pedigree Search" (under More Searches menu), which searches for specific cells, cell groups, or lineages.

What's the nomenclature for C. elegans cells?

There is a very good article explaining everything about embryonic cell lineage and nomenclature:

Sulston JE et al (1983) Dev Biol. "The embryonic cell lineage of the nematode C. elegans."

That article is the "dictionary" everyone refers.

P0 is the founder cell for C. elegans. It is the zygote after fertilization. The first few rounds of divisions produce six "founder cells": E, MS, AB, C, D and P4. Each of these founder cells generate different tissues. From then on, cells are named after these founder cells. For example, the daughters of AB are called ABa ('a' means anterior) and ABp ('p' means posterior). ABa will generate daughters ABal ('l' means left) and ABar ('r' means right)... If cell divides dorsal-ventrally, 'd' or 'v' will be added to the name of daughters.

Now you know when you see ABalppp , it comes from:

P0->AB->ABa->ABal->ABalp->ABalpp->ABalppp

Not only will you see the lineage pathway from the cell name, you will also see in which direction cells have divided and what the sister cells are for each step of the division.

How can I know each C. elegans cell's function and exactly at which stage of the embryonic lineage it appears?

Most of the information you need for a cell should be contained on Cell Report, which can be located by "Cell and Pedigree" search. In WormBase, if you read the Tree Display of a Cell Report, there is a tag called "Embryo_division_time"; it is the time when the cell divides or dies. Unfortunately, for cells generated after hatch, there is no such information in WormBase.

What is the connection between the cell P0, and the cells P1, P2, P3, ..., P7, P8, etc?

There are two sets of P cells. One arises from early embryonic divisions, and are called P0, P1', P2', P3' ... in WormBase; these are the lineage names. The other set is called P1, P2, P3, ... These are postembryonic blast cells, which are not related to the embryonic founder cells.

P1, P2, P3.. are adult names for post embryonic blast cells preset from hatching until the middle of the first lalval stage (L1), . A lot of cells have two names: lineage name and adult name. Adult name is the name people give to some cells that become terminal and differentiate (such as neurons) or not differentiate but will divide into an important lineage (such as P1, P2 ... lineages). Adult names are given by cell position and function, so it is a different naming system. Cells with the same adult name could come from different lineages depending on how bilateral symmetry is broken, for example: P7 can be developed either from AB.plappapp or AB.prappapp.

Lineage name is accurate, unique, but hard to remember for most people, so adult names are usually for researchers to use and do the query. That is why in WormBase cell nomenclature, whenever there exists adult name, we use it to call a cell, and bury its lineage name inside data field.

How to get all the cell types (neurons, actually) in which a gene is expressed?

When we curate a gene, we enter all the cells and cell groups that express the gene. This information can be easily viewed by clicking the "details" button at the gene page. For example, if you search for eat-16, which is expressed in neurons:

1. At the WormBase home page, select "Any gene" and search for "eat-16", and select "Exact Match", this will take you to the Gene Summary page for eat-16. 2. In the Function section, you will see "Anatomimic Expression Pattern". Here you will see some information about the eat-16 expression pattern, at the very end of the entry, you will see a link "Details". 3. If you click here, you will be brought to the Expression Pattern page for eat-16. On this page you will see the detailed cell and cell group information associated with eat-16. (To keep annotation easy, when a gene is expressed in lots of cells, we enter cell group name instead of all the cell names one by one. Each cell group will include the list of cells associated.)

Is there a file showing the lineage map of the worm?

Leon Avery has something like that on the Web: http://elegans.swmed.edu/parts/

About Orthologs and Homologs

How do I find the ortholog / paralog / etc. of gene X?

We do not explicitly make ortholog assignments in WormBase. This is a non-trivial task and something that we leave to external experts whose results we try to make available. There are several sources that may be useful in WormBase. NCBI COGS, InParanoid and TreeFam are all programs that attempt to predict orthologous relationships. InParanoid and TreeFam are visible from the gene pages (see cdk-1 page for eg). The COGs are found on the respective protein page.

inparanoid

KOGs

TreeFam


There are also the precomputed BLAST results that are summarised on the gene pages. Each release we also produce a file of best blastp hits for each worm protein which can be found on the FTP site called best_blastp_hits.WSXXX.gz

In addition we include since WS164 predicted orthologue assignemts based on Ensembl COMPARA which predicts orthology of the longest isoform based on homology as well as conserved gene order.

You can run this prepared query in WormMart for compara orthologs

How do I get a list of all C. elegans orthologs of H. sapiens disease genes?

One possible solution is to use EnsMart from Ensembl to query the EnsEMBL databases which include C.elegans. Go to BioMart and pick Caenorhabdits elegans (homology) and as set C.elegans and H.sapiens. Select orthologues_gene and filter as needed by the different types of orthology (one2one and/or one2many). Then pick as second dataset H.sapiens and select "associated with disease". That should return a list of all orthologues of human disease genes. The only possible problem is that due to the different release cycles between EnsEMBL and WormBase the EnsEMBL data might be slightly out of date (the CELXXX on the EnsEMBL pages refer to the corresponding WSXXX WormBase release).

How can I retrieve nematode specific genes with no homology to yeast, fly, mouse, and human?

/// out dated (Raymond 20080214) /// From the "advanced query" WormBase page, construct the following query:

find predicted_gene NOT Pep_homol

What is the meaning of several abbreviations for proteins that are used by WormBase, like "Protein SW"?

Or "Protein TR" and "Protein WP"? In addition, using the TR Database, sometimes the species origin (e.g., C. elegans) is missing - how can I find out? Furthermore, how can I get from a TR Database entry to the corresponding predicted gene in the C. elegans genome?

SW stands for Swiss-Prot, TR stands for TrEMBL and WP stands for WormPep. In case you're not familiar with any of theses protein databases you can go to: http://www.expasy.org/sprot/ and http://www.sanger.ac.uk/Projects/C_elegans/wormpep/ for an explanation and access to them.

Inside Protein SW or Protein TR, you may find the accession number of Swiss-Prot or TrEMBL. You can get all details of the protein (including species origin..) by going to http://www.expasy.org/sprot/ and entering the accession numbers,

What do those colorful bars for C. briggsae alignments mean?

Dark blue bars are regions of strong similarity. Light blue bars are regions of weak similarity. Dashed areas don't match.

When there are multiple bars in the same region, it means that there are several C. briggsae clones that all match the region.

How can I retrieve the best blast_p scored homologies of worm genes?

(I.e., the homologies produced automatically with each WormBase build -- roughly every 2 weeks?)

a. Go to the Wormbase ftp site by following the "Bulk Downloads" link in the "Web Site Directory" section of the Wormbase homepage or by entering the following URL in your browser: ftp://ftp.sanger.ac.uk/pub/wormbase? Select the most current Wormbase release (i.e. WS130).

b. Download the two best blastp files in this folder: best_blastp_hits.WS130.gz (elegans homolgies) best_blastp_hits_brigpep.WS130.gz (briggsae homologies)

c. Unpack the compressed files using a suitable software e.g. gunzip (linux)

d. The files have 15 columns delmited by a comma. The contents of the columns are as follows:

   Column 1 : Wormbase peptide accession number for elegans peptide

   Column 2 : Wormbase peptide accession number for highest homology elegans peptide

   Column 3 : e value for best elegans peptide/worm peptide hit

   Column 4 : Ensemble accession number for highest homolgy ensemble sequence

   Column 5 : e value for best elegans peptide/ensemble sequence hit

   Column 6 : Wormbase peptide accession number for highest homolgy briggsae peptide

   Column 7 : e value for best elegans peptide/briggsae peptide hit

   Column 8 : Flybase accession number for highest homology fly protein

   Column 9 : e value for best elegans peptide/fly protein hit

   Column 10: Saccharomyces Genome Database accession number from highest homology yeast protein

   Column 11: e value for best elegans peptide/yeast protein hit

   Column 12: Swissprot/Uniprot name from highest homology sequence

   Column 13: e value for best elegans peptide/swissprot sequence hit

   Column 14: TrEMBL accession number from highest homology sequence

   Column 15: e value for best elegans peptide/TrEMBL sequence hit

e. You might also want a file that maps Wormbase peptide accession numbers to the corresponding Gene in Wormbase (warning, a single gene may correspond to multiple peptides). For this you will have to perform an AQL query on Wormbase: - on the banner at the top of the Wormbase homepage select "Searches" - select the top search from the resulting list, "Acedb Searches(AQL)" - copy paste the following text into the search text box: select a, a->Cgc_name, c from a in class Gene, c in a->Molecular_name where c like "CE*" order by :1 asc - choose the "Text output" radio button and click Query ACeDB(the search may take a few minutes) - the resulting file contains a tab delimited mapping of Wormbase gene accession numbers to the CGC approved name for that gene (if it has one) to the peptide accession number for that gene. save the results file to your hard drive

How can I download the C. elegans-human gene homology map?

You can download a file that lists best blastp match to human, fly, yeast, C. briggsae, SwissProt, and TrEMBL proteins for every C. elegans protein form the wormbase ftp site:

ftp://ftp.sanger.ac.uk/pub/wormbase/current_release

The file name is best_blastp_hits.WSXXX.gz where XXX is the release number.

How can I download C. elegans-C.briggsae orthologs and their protein-coding DNA sequences?

One possible way to retrieve those would be to download a C. elegans-C.briggsae ortholog file:

ftp://ftp.wormbase.org/pub/wormbase/briggsae/supporting_data_stein_2003/orthologs_and_orphans/orthologs.txt and C. briggsae gene sequences in fasta format (briggenes.fa.gz):

ftp://ftp.wormbase.org/pub/wormbase/briggsae/

and write a script that would parse C. briggsae ortholog sequences based on C. elegans gene names.

Another way would be to use WormMart to get a list of genes with orthologs (filter by Homolog/Ortholog -> Homolog[Compara Orholog]). in the Attribute part you can select if you want to have the sequences or just a table of orthologs.

About the User Interface

What's the difference between the sequence displayed in "Sequence Report" and that in "Genome Browser"?

(For example: Sequence F35E8) The coordinates given in the Sequence Report under 'Genomic Location' are for the sequence F35E8, which is not the full sequence of the clone F35E8. The clone is represented under the diagram of the sequence features and has an arrow point off the left end indicating the clone extends to the left.

When you click to the Genome Browser your seeing the sequence of F35E8 with the clone again represented under the diagram of sequence features with an arrow pointing left. You have to zoom out to see the full extent of the clone F35E8.

About Nomenclature

How can I register a new lab?

New lab and allele designation should be registered directly with Jonathan Hodgkin (jah@bioch.ox.ac.uk), WormBase genetic nomenclature coordinator.

About Reagents (such as cosmid clones)

What are the different types of clones in WormBase?

1. There are seven different types of "Clone" objects in WormBase:

Type Nomenclature
Cosmid A*, B*, C*, D*, F*,J*,K*,M*,R*,T*,W*,Z*
Fosmid H* WRM*
YAC Y*
cDNA yk*, EC*, EB*, OST*, CK*, EF*, CEE*, CEM*, CES*, CB*, CN*, cm*
Plasmid: PCR clones V*,EGAP*
Other telo clones, 1 BAC, plasmid


Most cosmids, fosmids, YACs can be requested from Sanger, cDNA (yk*) from Dr. Yuji Kohara. The EGAP* plasmids can be obtained from MRC Geneservice. The V* plasmids are no longer available.

Whom could I contact about getting a cDNA clone?

All of the cDNA clones with a yk prefix can be ordered by the following method. All other cDNA clones will have to be requested from the submitting  party (found by looking at the EMBL/GenBank entry.)

Please go to NextDB(http://nematode.lab.nig.ac.jp), Yuji Kohara cDNA database repository. You can obtain cDNA clones from Yuji Kohara at the National Institute of Genetics, Mishima, Japan: ykohara@LAB.nig.ac.jp

Useful Clone info

1) How do I find out about the vectors used in the genome sequencing project?

If you want the actual sequence of the vectors used they are on the Sanger FTP site. and this should help you identify the vector for the clone you are interested in

How do I order C. elegans Cosmids Fosmids and Yacs?

Cosmids and Fosmids are available to the community via these routes:

1) How do I obtain C.elegans cosmids/Yacs?

Information can be found here: Cosmids/YACs

2) How do I obtain C.elegans fosmid from the Moerman fosmid library?

Information can be found here: Fosmids

3) How do I obtain C. elegans fosmid from the Incyte Genomics Inc. fosmid library?

Information can be found Here

How do I view alternate C. elegans clones that contain my gene? Similar to a physical map containing sequences and un-sequenced clones from the Sequencing project.

There is a view available on the website that displays the ACeDB graphical display for the physical map.

It is available from the clone summary page for the clone that your gene resides on.

Long Example:

  • I'm interested in lgc-4 (WBGene00017580)
  • Finally we are on the Clone report page.
  • This takes you to the view you require (The highlighted clones in yellow and red are the "golden path" the rest are unsequenced clones not used in the assembly).

or

Alternatively you can just input a url into your browser directly.

Example:

If you are interested in AH6.1 then this url would return a cosmid map view of the old cosmids, 
fosmids and yacs including un-sequenced clones.

http://www.wormbase.org/db/misc/epic?name=AH6;class=Clone


How can I find allele information?

We do have information on many thousands of alleles in WormBase. We have also tried to extract the molecular details of the mutations (where known) and add those to WormBase. Some examples:

Go to a gene page: http://www.wormbase.org/db/gene/gene?name=unc-71;class=Locus Then click on the link to the 'ay47' allele (near the bottom of the page), this takes you to: http://www.wormbase.org/db/gene/allele?name=ay47;class=Allele You can see that there is a 'c' to 't' substitution in this gene. If you go to the genome browser display for this gene: http://www.wormbase.org/db/seq/gbrowse/wormbase?name=unc-71 Then turn on the 'SNPs, Knockouts, and other Alleles' track and you will see the positions of the alleles in this gene.

To find other alleles, you can go to the query page: http://www.wormbase.org/db/searches/wb_query and type the following queries (everything between the single quotes): 'Find Allele; Substitution' 'Find Allele; Deletion' 'Find Allele; Insertion'

I'm interested in the CB4858 pas* snp data can I get a bulk download??

A complete dataset of pas snp data is available from Here

Explanation of dataset:

Substitution - the snp sits between Flank1 and Flank2 (gggtAtcg) and this makes up the N2 genomic hit. This is a 1bp feature as the snp is contained in the N2 genomic.

ID              Type            N2/CB   Chrom   Coordinate1     Coordinate2     Flank1             Flank2
pas10021        Substitution    A/G     V       19689782        19689782        -cut-aattttgggt    tcgaccttgaaa-cut-

Deletion - the snp sits between Flank1 and Flank2 (ttttCacacttt) and this makes up the N2 genomic hit. This is a 1bp feature as the snp is contained in the N2 genomic.

ID              Type            N2      Chrom   Coordinate1     Coordinate2     Flank1             Flank2
pas44643        Deletion        C       X       193667          193667          -cut-aaccattttt    acactttttggctta-cut-

Insertion - The insertion is in CB4858 so the 2 flanking sequences abut each other (accttaaaaaaaa) and so you get a 2bp feature as the N2 base to the left and right are marked up (Notice the pair of coordinates) In this case, CB4858 has an A between the relative N2 positions 116070 and 116071.

ID              Type            CB4858  Chrom   Coordinate1     Coordinate2     Flank1             Flank2
pas44644        Insertion       A       I       116070          116071          -cut-aactcaaaacctt aaaaaaaa-cut-

What's the easiest way for users to find the 'TRUE' ends, and thus the insert size, of a clone?

The set of clone ends is dumped as part of the gff files:

http://www.sanger.ac.uk/Projects/C_elegans/WORMBASE/GFF_files.shtml

This is the source for the extents displayed in WormBase.

The caveat with this is that the 'true' end is not marked up for all clones. The early cosmids do not have such annotations because nobody thought about marking them up. Later cosmids do have clone left and right ends as this became part of the standard procedure. Finally, many of the YACs do not have clone ends because the segment submitted to GenBank/EMBL is much smaller than the full clone, and hence the true ends lie within sequences already finished at that stage of the sequencing (i.e. we never went back to update clone ends in sequence already finished).

How can I FTP-download the genomic DNA database and the EST database for C. elegans?

Our underlying database for WormBase is built on the acedb software (available freely from www.acedb.org). If you have acedb installed locally, you can download the entirety of our database from: ftp://ftp.sanger.ac.uk/pub2/wormbase/live_release

However, a simpler approach may be to just download a GFF file and DNA file for each chromosome from: ftp://ftp.sanger.ac.uk/pub2/wormbase/live_release/CHROMOSOMES/

Where are the flat files for the gene annotation of each chromosome of C. elegans?

You should take a look at the Feature Tables (GFF), which you can pick up from the same 'WormBase Downloads' page where you found the "Summary Tables" (http://www.wormbase.org/downloads.html).

You should also look at the 'Batch Downloads' page at WormBase (http://www.wormbase.org/db/searches/info_dump), where you can build your own tables of gene annotations.

One other WormBase page you should look at is the "Genome Dumper" (http://www.wormbase.org/db/searches/advanced/dumper).

Where can I find details of the methods used to create wormpep?

We make WormPep during each release of WormBase and the starting point is always a translation of our latest set of gene predictions. Gene predictions are initially based on the GeneFinder prediction program with human modification as is deemed necessary. The level of human involvement really depends on what other supporting data is available. Aside from routine inspection of gene predictions based on EST/mRNA data we also evaluate our predictions based on information from published papers and direct contact from the worm community. All gene predictions have been looked at by a human to some level.

We have started to distinguish subsets of WormPep. Thus all WormPep proteins can be thought of as either 'CONFIRMED', 'PARTIALLY CONFIRMED', or 'PREDICTED'. The first set contains all genes where there is transcript evidence for every base of every exon of the gene (note that this can still - in theory - mean that there are unpredicted exons in a 'CONFIRMED' gene). The second set contains genes for which there is some transcript evidence but the whole gene is not yet supported...either due to lack of transcript evidence or errors in our current gene prediction. The last set is everything else, i.e. genes with no transcript support. In the future we may expand this classification system to take account of other evidence (e.g. homology info from C. briggsae).

Each new build usually sees a slight increase in the first two categories and a drop in the third category. The relevant status of each Wormpep entry is added into the FASTA header of every entry in each WormPep release.

I cannot find information for a gene mentioned in early microarray or RNAi studies: why?

In early versions of microarray and RNAi libraries, clone and gene names were often used synonymously. Because gene models and names change over versions of WormBase and history has not been carfully preserved, this caused much confusion. Fortunately, for those clones that we have the sequence information, we provide up-to-date mapping from each clone to the current gene models. However, sometimes we don't have sufficient sequence information for a given clone and thus unable to provide any information about its identify and one must inquire the primary generators (corresponding authors of publication) of that clone for more information. Below is an example of a 'lost' clone:

"what is the present location of Y41D4A_2491.a?"

Simple answer, there's none. A simple search for "Anything" "Y41D4A_2491.a" produces hits that indicate that Y41D4A_2491.a was used as a clone/gene name in Stanford microarray library. For sequence information, WormBase177 only has sequences for the oligos but not a PCR_product. The pair of oligo sequences (Oligo: sjj_Y41D4A_2491.a_b ; Oligo: sjj_Y41D4A_2491.a_f) fails to produce an ePCR product and each individually fails to map to the genome when searched with Genome browser oligo mapping tool.

We keep uptodate mapping files here ftp://caltech.wormbase.org/pub/annots/rnai/.


I am trying to figure what convention was used when the gene names were changed from a letter code to a number code (for example Y17G9B.a-i to Y17G9B.1-9).

This naming convention change occurred following the initial annotation phase back in the 90's. Genomic clones were originally submitted with cosmid.letter annotations prior to 1998 but this was changed to increase the depth of the nomenclature as some clones started to have more than 26 genes.

There are 3 approaches to identify the current gene that corresponds to an original letter code gene locus.

1) Search through the wormpep.history file within the Wormpep190.tar.gz archive found here.

If you look for your gene eg. Y17G9B.g you get:
Y17G9B.g CE21394 17 18

Then if you look for all occurrences of the CE21388 number.
Y17G9B.g CE21394 17 18 
Y17G9B.4 CE21394 18 72

From this you can assume that .g was renamed .4 as the gene encodes the same protein. 
This doesn't always work as the gene may have undergone some annotation changes which breaks this link. 


2) If the above doesn't work and you have a small list you can blast the old peptide against the genome and see which gene it overlaps.

look in the wormpep.fasta190 file for the peptide sequence. 

grep Y17G9B.a wormpep.history190 as before
Y17G9B.a CE21388 17 18 

wormpep.fasta190:
CE21388 MLRLKNFSNLRELSTDS--snip--PVDDLISFLETFELDEEDE
 
TBLASTN the peptide sequence against the elegans genome and see where it hits the current assembly. 

3) If you are interested in genes used in microarray and RNAi libraries see the previous FAQ.

Where can I obtain C elegans strains

Please refer to the following link to obtain C. elegans strains:

http://www.cbs.umn.edu/CGC/strains/

About Database Queries

How do I get AQL to search data in hashes?

Like this example:

select p->Standard_name, a[Institution], a[Email] from p in class Person, a in p->Address[0] where exists p->Supervised

More documentation is available at http://www.acedb.org/Software/whelp/AQL/examples_worm.shtml; scroll down to "Queries on objects containing hash structures."

How can I obtain all the abstracts on Wormbase and the particular genes that they are associated with?

There are two ways:

1) go to ftp://ftp.sanger.ac.uk/pub/wormbase/current_release/ and get the acedb data files.

2) Use AcePerl to get the abstracts. You can do this easily with an Aceperl script:

   my $db = Ace->connect(-host=>yourhost.com) || warn 'yikes';
   my $iterator = $db->fetch_many(-query=>qq(find Paper where Abstract)); while (my $obj = $iterator->next) {
   # grab info from the object
   my @genes = $obj->Gene;
   ... etc ...
   print join(' ',@genes);
   }

How can I find out how many genes contain expression patterns generated with a specific method?

(For example, by in situ hybridization?)

Type the following command under the menu and in the box of "Advanced Search". The following line search for Expr_patterns containig all three types of methods. If the '&' is replaced by '|', the command will search for Expr_pattern with In_situ OR Antibody OR Reporter_gene data.

find Expr_pattern Type="In_situ" & Type="Antibody" & Type="Reporter_gene"

You may change the words following the same syntax to search for other objects.

How can I download the alignments of EST sequences to genomic sequences?

You can extract it from the GFF files that we provide with every release of WormBase. For more information on GFF files see:

http://www.sanger.ac.uk/Software/formats/GFF/

Basically, we release one GFF file per chromosome and this contains the coordinates and details of most features that we can map onto chromosome base pair coordinates.

These files are accesible from the main WormBase page (see the Feature table links on the right) and should also be on the WormBase and Sanger Institute FTP sites.

You will need to extract only a subset of these files, i.e. lines that match the pattern 'BLAT_EST_'. This is very easy to do if you have access to a UNIX/Linux system (use the 'grep' command).

E.g. here are two sample lines from the Chromosome II file (these will probably wrap around your screen):

CHROMOSOME_II BLAT_EST_BEST similarity 5754433 5755008 100
. . Target "Sequence:yk776e12.5" 21 596
CHROMOSOME_II BLAT_EST_OTHER similarity 5755968 5755971 98.4
. . Target "Sequence:yk4g4.5" 116 119

Within these lines are details of the chromosome coordinates, the BLAT score, the matching sequence name, and the coordinates within the matching sequence.

How can I retrieve timestamps of Acedb from the command line, and do I have to use Perl?

We use AcePerl to retrieve some timestamp information...this is done via an AQL query.

E.g. if you wanted to find the timestamp of a tag in a particular object, belonging to a particular class, you could do:

my $aql_query = "select s,s->$tag.node_session from s in object(\"$class\",\"$object\")";
my @aql = $db->aql($aql_query);
my $timestamp = $aql[0]->[1];

How can I search for pseudogenes in Wormbase?

It will take a long time if you do AQL queries. However, a different way of query can be done if you want to retrieve the info from wormbase website.

From More search -> Advanced search at http://www.wormbase.org/db/searches/query

In Query Acedb, type in

find sequence *; pseudogene

You should get a result of pseudogene objects within a couple of seconds.

Where can I find a list of classes and subclasses for Acedb?

You can find a list of Acedb classes by first clicking on the More Searches link on the upper right corner of the WormBase home page. From here, select the WormBase Class Browser, which will bring you to a searchable drop-down menu of all the Acedb classes.

For performing queries, it is helpful to know the data model for each of the classes that you would like to search. The data models can be accessed from this same page by typing the class of interest into the search box and then selecting "Model" from the drop-down menu. This will lead you to a Tree Display that diagrams how data for a particular class is represented in Acedb.

Also, from the MoreSearches link, you can access the Advanced AQL Search, which has further documentation and examples for querying the database.

How can I retrieve the gi numbers only for a list of entries having the GO term selected?

At present, you can retrieve Genbank identifiers (i.e. AAMxxxxx, AAKxxxxx, AAFxxxxx, etc.) for CDS's that are associated with a particular GO term by performing an AQL query. Here are the steps:

   1) At the top right corner of the WormBase homepage, click on the More Searches link.

   2) Under the general heading, select the Advanced AQL Search link.

   3) Type the following query into the box:
   select a, b, c[1] from a in class CDS,
   b in a->go_term,
   c in a->protein_id where b = "GO:0003700"

   4) You should get back a three-column table listing each CDS, the GO term you selected, and a Genbank ID.

If you are interested, the rationale for the query can be bettter understood by looking at our data model for CDS's, which is at http://www.wormbase.org/db/misc/etree?name=%3FCDS&class=Model;expand=Visible#Visible. The above query searches in the CDS class, in the attribute go_term, where we have defined the go_term to be "GO:0003700", and in the attribute protein_id for the unique text id which is the database identifier. The [1] after the letter c in the query indicates that the search will retrieve information in the 1, or text, column of protein_id, since the sequence column is considered column 0.

How can I download C. briggsae 3' UTRs in bulk?

We don't really have a strictly empirical set of 3' UTRs (3' flanking sequences taking from cDNA). However, what you probably really want are predicted 3' UTR regions. Those, you can get by going to the Genome Dumper:

http://wormbase.org/db/searches/advanced/dumper

selecting the species "C. briggsae", and then filling in the options for the download that you want. The main stumbling block is that the list you need for briggsae genomic sequences is rather long. However, I've tried typing this:

cb25.*

into the window "Type in a list of sequence or chromosome names", and that seems to successfully prompt the Genome Dumper to search through all available C. briggsae genomic contigs.

Also, you should pick the "Integrated ('hybrid') briggsae gene set", and select some reasonable value (e.g., 1000 bp) for the length of the 3' flank sequences that you want.

How can I find the coding sequences of alleles for particular gene(s) having SNPs from C. elegans?

For instance, if you want to find out SNP sequences for H39E23.1a gene, you can use the following AQL query: select a->predicted_gene, a, a->flanking_sequences[1],

a->flanking_sequences[2], a->substitution from a in class allele where
a->predicted_gene = "H39E23.1a" and a->method = "snp"

The output of the query (in text mode) looks like this:

H39E23.1a snp_AH10.2 tgaaaaaaactaatttttaatgtga tcttggccacaattgacctagtttg [A/G]
H39E23.1a snp_AH10.3 ctgaacaactgaaaaaggaaagaaa agggaaaaagttcgaccacaaaaaa [G/A]

Here the first column is the gene name, second is the allele name, third and fourth are sequences flanking the allele and the last one is the actual allele sequence change. You can modify the query to retrieve information for genes that you're interested in.

How can I download the spliced and non-spliced regions for all C. elegans or C. briggsae genes?

You can download spliced/unspliced sequences for a list of genes using Batch Gene tool: http://www.wormbase.org/db/searches/info_dump You can paste a list of genes you're interested in into the search box and select Spliced and Unspliced check boxes in the Sequence field. If you output data as text, you'll be able to save it to your harg disk.

To get the list of C. briggsae genes (so that you can paste it into the search box), you can use the following query: select a from a in class cds where a->species like "*briggsae*".


How do I find all genes with transmembrane or signalp domains?

Go to the WB Query page and enter this query.

'find Wormpep where Feature AND NEXT = "signalp"; follow Corresponding_CDS; follow Gene'

substitute "tmhmm" for "signalp" to get genes with transmembrane domains.

Alternatively, all gene with transmembrane domains are automatically assigned the GO term GO:0016021

WormMart questions

Is there a way for a large number of genes to get not just the alleles, but also the actual mutation (when known)?

Here's what you need to do:

  • Open WormMart
  • SELECT RELEASE, SELECT DATABASE, SELECT DATASET [Variation]
  • Click Filters in the left menu, then expand the Other Annotation filter and under Annotated with select [Sequence] Flanking Sequence
  • Upload your file of GeneIDs to the Specified identifier of type field.
  • Hit Count then Results to check your file is being read. nb. if you have chosen WS176 you should get n/80791.
  • Click Attributes in the left menu and under Identification select Variation (Name), Method, Variation Type (merged) and Mutation Type. Under Affects select Gene (WB Gene ID) (merged), Gene (CGC name) and under Description select Sense, Sense Text, Splice site, Splice Site Text and Frameshift
  • Hit Results

--Tuli 04:42, 6 July 2007 (EDT)


How can I retrieve 1.5 kb promoter region upstream of a bunch of genes?

  • Open WormMart at http://www.wormbase.org/biomart/martview/
  • SELECT RELEASE, SELECT DATABASE, SELECT DATASET [Gene]
  • Click Filters in the left menu, then expand the "Identification" filter, tick "[Gene] ID(s) of Type", select "[Gene] Any Name", upload (using "Browse") or type in a list of genes
  • Hit Count to check your file is being read correctly
  • Click Attributes in the left menu, tick "Gene Sequences", expand "Sequence Type", tick "Flank (Gene Coding Region)" (for upstream of translation start site), expand "Flanking Regions", tick "Upstream flank", type 1500 in the box
  • Hit Results
  • Results can be exported or e-mailed

How can I find all genes expressed somewhere with a particular GO term?

(E.G., Find all genes expressed in the Vulva that have signal transducer activity?)

Firstly you need to identify the exact ontology identifiers that correspond to you query. Seach for the term in the main search box on the home page with the appropriate category selected.

   vulva  -  WBbt:0006748
   signal transducer activity - GO:0004871 

Armed with these you can start your WormMart query.

  • Select which version of the database you are interested in (current release or a recent frozen one)
  • Select 'Gene' DATASET
  • In the left panel click 'Filters'. This will give an expandable list of filters in the main window.
  • Expand 'Annotation'
  • Check the ' [Annotation] IDs of Type: ' box and select '[Function] GO term ID' from the dropdown menu
  • put the GO term found earlier (GO:0004871) in to the box (or upload from file). (NOTE:You must include the GO: part)

This will find all genes annotated with GO term GO:0004871 - hit count for results so far (at time of writing = 111) Now we will add a second dataset to cross reference this result with.

  • Back in the left panel, click 'Dataset' to get a drop down menu of other datasets.
  • Select 'Expression Pattern'
  • In the left panel, under 'Dataset' there is another 'Filters' option. Click to get a similar list as above.
  • Select ' Expressed in '
  • Check ' Specified identifiers of type ' box and choose ' Anatomy Term [eg WBbt:0006748]' from drop down list.
  • Enter the Anatomy term found earlier (or upload from file). (NOTE:You must include the WBbt: part)

This completes the querying part, you now need to select what information you want about the genes that the search finds.

  • Click the 'Attributes' section in the left panel and expand the boxes as you need to select output categories of the Gene.
  • Click the 'Attributes' under 'Datasets' section to select attributes to do with the expression pattern.

This link goes to the completed query. You can click on the relevant sections as described above to change any of GO or Anatomy terms and output data. You may need to click 'Results' to see the output of the query.

How do I pull out all operon details and the names of genes contained in an Operon?

You can retrieve this data through WormMart

  • Select which version of the database you are interested in (current release or a recent frozen one)
  • Select 'Gene' DATASET
  • In the left panel click 'Filters'. This will give an expandable list of filters in the main window.

Select these Filters:

[Gene] Species : Caenorhabditis elegans
[Gene] Status : Live
[Location] : Operon:Only (Annotation Tab - Limit to Entries Annotated with:)

  • In the left panel click 'Attributes'
  • In the right panel select these Attributes:

Gene WB ID [IDs tab] Operon [Location tab] Operon Start [Location tab] Operon End [Location tab]

This will give you a table like#:

Gene WB ID	Gene Public Name  Operon	Operon Start (bp)  Operon End (bp)
WBGene00000001	aap-1	          CEOP1906	5106224	           5111008
WBGene00000037	ace-3	          CEOP2632	14197942	   14210076
WBGene00000038	ace-4	          CEOP2632	14197942	   14210076

How do you retrieve all the protein sequences of genes within Operons?

You can retrieve this data through WormMart

  • Select which version of the database you are interested in (current release or a recent frozen one)
  • Select 'Gene' DATASET
  • In the left panel click 'Filters'. This will give an expandable list of filters in the main window.

Select these Filters:

[Gene] Species : Caenorhabditis elegans
[Gene] Status : Live
[Location] : Operon:Only (Annotation Tab - Limit to Entries Annotated with:)

  • In the left panel click 'Attributes'
  • In the right hand window click 'Gene Sequences'

Select These Attributes:

Sequence Type:
 Peptide
Header Attributes
 Gene WB ID
 Gene Public Name
 WB Wormpep ID

This should give you ~2800 on count and results like:

  • Click on 'Results'
> WBGene00000814|csn-2|WP:CE27562
MGDEYMDDDEDYGFEYEDDSGSEPDVDMENQYYTAKGLRSDGKLDEAIKSFEKVLELEGE
KGEWGFKALKQMIKITFGQNRLEKMLEYYRQLLTYIKSAVTKNYSEKSINAILDYISTSR
QMDLLQHFYETTLDALKDAKNERLWFKTNTKLGKLFFDLHEFTKLEKIVKQLKVSCKNEQ
GEEDQRKGTQLLEIYALEIQMYTEQKNNKALKWVYELATQAIHTKSAIPHPLILGTIREC
GGKMHLRDGRFLDAHTDFFEAFKNYDESGSPRRTTCLKYLVLANMLIKSDINPFDSQEAK
PFKNEPEIVAMTQMVQAYQDNDIQAFEQIMAAHQDSIMADPFIREHTEELMNNIRTQVLL
RLIRPYTNVRISYLSQKLKVSQKEVIHLLVDAILDDGLEAKINEESGMIEMPKNKKKMMV
TSLVVPNAGDQGTTKSDSKPGTSSEPSTTTSVTSSILQGPPATSSCHQELSMDGLRVWAE
RIDSIQSNIGTRIKF*
etc. etc.

How do I retrieve a list of transcription factors which when mutated or targeted by RNAi cause embryonic lethal phenotype?

  • Open WormMart
  • Select Database, e.g. "WormBase Release 188"
  • Select Dataset - "Phenotype"
  • Click on "Filter" link on the left and then on the "+" next to "Phenotype Annotation"
  • Select "Phenotype Inc. Descendents" checkbox and select "embryonic_lethal" from the pull down menu

If you click "Count" button at this point, you should see the number of entries that are annotated with this phenotype (this is not necessary).

Now add a second dataset:

  • Click on second "Dataset" link on the left
  • Choose Additional Dataset - "Gene"
  • Click on "Filter" link on the left (for the second dataset) and then on the "+" next to "Annotation"
  • Select "[Annotation] IDs of Type" checkbox, select "[Function] GO Term ID" and enter GO:0003700 in the box below (corresponds to transcription factor activity)
  • Click on "Attributes" link on the left (for the second dataset) and then on the "+" next to "Function" and select "GO Term Info (merged)" checkbox (if you want to see GO annotations in addition to attributes selected by default, which you can change for each dataset through the Attributes dialog)
  • Press Results button and Export all results to File (also check Unique results only)

Here is what you should see.


How do I download/generate a file containing the unspliced transcripts like I see on the sequence pages of WormBase?

I would like to download the sequences that I see on the sequence summary pages eg.

Image:Unspliced_transcript.jpg

To do this, replicate the following wormmart query.


Dataset:    CHOOSE DATABASE: WormBase WS195
                 CHOOSE DATASET: Gene

Filters leave as default and add (*)

       [Gene] Species : Caenorhabditis elegans<input type="hidden" name="default____wormbase_gene__filterlist" value="wormbase_gene__filter.species_selection"></input>

       [Gene] Status : Live

        * Annotation: [Transcript] Type: Coding


Attributes: select [Gene Sequences] at top of page

        Sequence Type: Unspliced (Transcript)

        Header Attributes: Whatever the user requires


This should give you a count in the region of 22,000 objects and yield ~27,000 sequence objects in your file.

 If you have a specific list of genes you want sequence data for, you can upload a file of IDs.
e.g. WBGene IDs file format is:

WBGene00000001
WBGene00000002
WBGene00000003

Go back to your wormmart session and on the filters tab select ([Annotation] IDs of Type:) and upload your file.

WormBase will provide a pre-computed file under the sequence directory on the ftp site:

ftp://ftp.sanger.ac.uk/pub/wormbase/live_release/genomes/c_elegans/sequences/dna/


Which GFF source and feature (method) should I use?

The terms feature and method are used interchangably GFF_source_methods