Searching WormBase for Information About C. elegans (Wiley)
- 1 INTRODUCTION
- 2 BASIC PROTOCOL 1: NAVIGATING THE WormBase HOME PAGE
- 3 BASIC PROTOCOL 2: PERFORMING A DATABASE SEARCH
- 4 BASIC PROTOCOL 3: EXAMINING A GENE IN C. ELEGANS
- 5 BASIC PROTOCOL 4: EXAMINING A MOLECULAR SEQUENCE IN C. ELEGANS
- 6 BASIC PROTOCOL 5: FINDING PROTEIN FEATURES
- 7 BASIC PROTOCOL 6: SEARCHING FOR GENE PRODUCTS WITH PARTICULAR SEQUENCE MOTIFS
- 8 BASIC PROTOCOL 7: USING THE GENOME BROWSER
- 9 BASIC PROTOCOL 8: VIEWING THE C. BRIGGSAE GENOME AND ITS SYNTENY WITH C. ELEGANS
- 10 BASIC PROTOCOL 9: FINDING SEQUENCE SIMILARITIES WITH BLAST
- 11 BASIC PROTOCOL 10: MINING GENE DATA WITH WORMMART
- 12 BASIC PROTOCOL 11: DOWNLOADING A BATCH OF SEQUENCES
- 13 BASIC PROTOCOL 12: EXAMINING THE GENOMIC CONTENT OF A CLASSICAL GENETIC INTERVAL
- 14 ALTERNATE PROTOCOL 1: INSTALLING AND RUNNING WormBase LOCALLY
- 15 COMMENTARY
- 16 KEY REFERENCES
- 17 INTERNET RESOURCES
- 18 FIGURE(S)
WormBase is the major public biological database for the nematode Caenorhabditis elegans (Chen et al., 2005). It is meant to be useful to any biologist who wants to use C. elegans, whatever his or her specialty. WormBase contains information about the genomic sequence of C. elegans, its genes and their products, and its higher-level traits such as gene expression patterns and neuronal connectivity. Also, WormBase contains sequence and gene structures of C. briggsae and C. remanei, two closely related worms. These data are interconnected, so that a search beginning with one object (such as a gene) can be directed to related objects of a different type (such as the DNA sequence of the gene, or the cells in which the gene is active). One can also do searches for complex data sets.
WormBase is constantly being changed and expanded, both by curation of newly available data and by modifications to the user interface. The entire database is updated and rebuilt into a new release every 3 weeks throughout the year, in releases named "WSnnn" (with "nnn" being 147 as of this writing). To give bioinformaticians reproducible data sets for their analyses, each tenth release is made permanently available as a frozen online archive, roughly once every seven months. Releases WS100 through WS150 have been archived so far. All of the information in this chapter is based on the version of WormBase available in September, 2005 (WS147 release; BioPerl/Generic Model Organism Database software).
The protocols described in this chapter include the following: general searches of WormBase with single search terms; studying a gene, sequence, or protein with its individual web page, or with the Genome Browser; searching for proteins by BLAST hits, sequence motifs, or Gene Ontology terms; aligning C. elegans with C. briggsae genomic sequences; detailed, user-customized searches with WormMart or AceDB Query Language; batch downloads of many sequences at once; identifying the genomic regions, genetic contents, or molecular clones spanning a defined chromosomal interval; electronic PCR; finding expression patterns, and the cell types or developmental origins from which these patterns arise; and searching for genome-wide RNAi results yielding particular phenotypes. Some advice is also provided for installing and running WormBase on a local computer.
A standard computer with a reasonably fast connection to the Internet (cable modem, DSL, or Ethernet recommended)
Web browser such as Internet Explorer (http://www.microsoft.com) or Firefox (http://www.mozilla.org/firefox)
The home page for the main WormBase site, http://wormbase.org allows general searches, gives links to specialized searches, and has news about improvements to WormBase.
This page is divided into several parts. At the top is a large, simple menu bar giving quick access to six popular search pages, a Web form for submitting comments or new data, an all-purpose Searches page, and a Site Map. This bar appears at the top of all of the WormBase pages and also includes a link back to the home page (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png). On the home page immediately below this menu bar are the serial number of the data release currently in use, and the official WormBase logo.
The next section of the main page has fields and menus that are used for a basic search of the full database (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png). A user can choose searches from any of over 20 different data types from the "Search for" drop-down menu (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png). The check boxes immediately beneath the search fields offer choices to require strict identity for search terms, to give results in XML format, or to search primary research articles in depth with Textpresso (http://www.textpresso.org; Müller et al., 2004).
Below the basic search section are a directory for the entire WormBase site (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png), and news about C. elegans bioinformatics. Near the bottom of the page is a Links section (with connections to several other sites about C. elegans which complement WormBase in some way), and a list of other sites running WormBase itself (the development server, a data-mining server, official mirror sites, and archival data freeze sites for http://wormbase.org). Finally, the bottom of the page gives a link for users to make comments or ask questions to the WormBase staff, and other links to WormBase policies on copyright and privacy; these links are given on every WormBase page.
WormBase has a development site (http://dev.wormbase.org) which uses the very latest data release and site software, while the main site lags by one release; while closely similar, the main site focuses on stability and the development site on novelty.
The examples of searches given in the following protocols are intended to be illustrative, but not exhaustive; many other searches are possible.
This protocol presents how to conduct a general search of WormBase
A standard computer with a reasonably fast connection to the Internet (cable modem, DSL, or Ethernet recommended)
Web browser such as Internet Explorer (http://www.microsoft.com) or Firefox (http://www.mozilla.org/firefox)
1. Go to the main page, either by entering its URL http://wormbase.org, or by clicking on the Home link at the top left of any WormBase Web page (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png).
2. Type a word or phrase into the search form (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png). For a search of the entire database, click the Search link after selecting Anything in the "Search for" pull-down menu.
3. Once the search has run, examine the results (if any). If a single data record is found, the search routine will sometimes go directly to the Web page for that record (it will automatically do this for Gene pages). However, it will usually instead have 0 or ‚â•2 results. In the latter case, the search will give one or more summary pages on which the data records found are listed.
4. Look at the page of summarized results, and consider whether there are too few or too many. If there are too few results, try different key words for the search; computers are literal-minded, and a particular search word may be almost but not quite recognized by the database. Conversely, if there are too many results, select a restricted subset of the database for the search from the "Search for" pull-down menu (e.g., "Any Gene"; (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png)), or check the "Exact match" box and resubmit the search by clicking the Search button.
5. If the search results look reasonable, then follow the hypertext links to the data records themselves.
6. To get a list of many data records all falling into a specific class, pick a subset of the database and search with a wildcard. For instance, if the "Any Gene" search is selected, and run with unc-* as the search term, WormBase will return links to 117 Locus records (snt-1 through vab-8, with 114 unc-* hits). When possible, WormBase recognizes and returns synonyms in searches (which is why a search for unc-* genes returns three non-unc-* gene names; these three genes have aliases of unc-107, unc-110, and unc-121).
7. Remember that search words can be anything of interest. While there is no guarantee that the database will give hits for any given search string, specific topics of interest can have useful results.
For instance, consider a search for' '''hyperplasia ''under all categories (i.e., by choosing Anything from the pull-down menu in (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png). Hyperplasia is a topic normally associated with cancer biology or endocrinology (Merke and Bornstein 2005; Simpson et al., 2005), but one might want to see the potential for C. elegans to serve as a model for the control of tissue proliferation. A search with this term reveals nine hits, including six genetic loci that either have hyperplastic phenotypes or are homologs of human disease genes with roles in deregulated proliferation.
A standard computer with a reasonably fast connection to the Internet (cable modem, DSL, or Ethernet recommended)
From the front WormBase page, enter the name of a gene of interest and click Search (leaving the Find menu set to its default, "Any Gene"). A successful search will lead directly to a Gene page for the gene requested.
A Gene page is intended summarize everything of biological importance known about a given gene; it can thus be quite complex, although it is also meant to be concise ((http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_03.png) and
The page can describe any or all of the following information, when available: identity of the gene's product; normal function of the gene's product; orthologs of the gene (if any); the gene's meiotic and physical location within the genome; its phenotype in classical mutations or RNAi screens; its spatial and temporal expression pattern, with microarray data; domains of its encoded protein; Gene Ontology (UNIT 7.2) terms describing its function; mutant alleles and strains carrying them; homologs identified by best BLASTP (UNIT 3.4) scores versus other proteomes; cDNA, STS, and antibody reagents; microarray probes; SAGE oligonucleotides; and references from the primary C. elegans research literature. All of these are given with hypertext links allowing the user to find more information on a given datum of interest.
Most of this information is given as tabulated text. However, for cloned genes, there will also be a schematic diagram of the gene's physical organization. For instance, the diagram on the zyg-1 Gene Page reveals that zyg-1 actually has a functional gene, bli-2, nested within itself (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_04.png). This diagram provides a graphical link to the Genome Browser (see Basic Protocol 7).
C. elegans is currently known to have ~20,100 protein-coding genes, as well as ~900 genes producing non-protein-coding RNA transcripts. A gene in C. elegans can be studied in several different ways. The original method was genetics, in which a gene equated a classical locus at which two or more different alleles had been mapped (Brenner, 1974). Despite all advances in functional genomics, classical genetics remains uniquely powerful in uncovering new aspects of worm biology (Jorgensen and Mango, 2002).
There are, accordingly, 44,606 Gene objects in the WS147 release of WormBase, marking either current or obsolete gene predictions for C. elegans. In the database, they are given uninformative serial numbers such as "WBGene00021622". However, in real use these genes either have names like "xyz-N" (a three-letter lower-case abbreviation, where N represents an Arabic numeral) or others with names like "cosmid.number" (where "cosmid" is the name of a genomic clone, usually a cosmid, sequenced in the C. elegans genome project (C. elegans Sequencing Consortium, 1998), and "number" denotes an otherwise anonymous gene embedded in the clone).
There are fewer classical (three-letter) gene names for C. elegans genes (~6,100) than names for genes identified through genomic sequencing (~21,000). An increasing number of three-letter names are given to genes identified purely through genomic sequencing and analysis. Conversely, a significant number of genes identified through classical mutagenesis have not yet been linked to the genome through cloning. So, the rule of thumb is that, while Gene objects in WormBase will often have both a classical and a sequence name (as zyg-1 does), this is not inevitable; many genes in C. elegans either lack a classical name or have a classical (mutant) name without a known sequence name. At the same time, functional genomics is being applied to all genes in C. elegans through chromosome-wide RNAi screens, microarray analyses, and protein interaction mapping (Ge et al., 2003; Gunsalus et al., 2005). WormBase therefore annotates a good deal of information about "anonymous" genes with only nondescript sequence names.
If a set of Gene records is wanted, do the search for "Any Gene" with a wildcard entry (such as unc-*), or do a basic search (for "Anything"). With a well chosen query, this will produce a summary page having one or more entries in the database with the general format Gene:xyz-N. Examine the entries and click on whichever is most useful.
Each Gene page for a genomic protein-coding sequence (CDS), even if it lacks mutant alleles mapped through classical genetics, has an interpolated genetic map position for that coding sequence, with a link to a list of nearby genes. This map position is formatted and handled in the same way as the genetic map position for a classical genetic locus (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_04.png). In the case of a gene like zyg-1, which already has mutant alleles, the Genetic Position link (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_04.png) will mainly help design tests for allelism with other mutant loci. However, for a gene identified solely as a CDS, interpolated map information can indicate which classical mutant alleles might reside within the gene.
WormBase has Sequence Pages as well as Gene Pages. Sequence Pages are focused on details of the nucleotide and predicted protein-coding sequences of genes.
Web browser such as Internet Explorer (http://www.microsoft.com) or Firefox (http://www.mozilla.org/firefox)
1. From any page where a molecular sequence is given as relevant to some other object (such as zyg-1, which has the synonym F59E12.2; (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_03.png), click on the relevant Sequence Report or Sequence name link. This will lead to a Sequence page.
Like a Gene page, a Sequence will have diverse information, with a somewhat greater emphasis on genomic sequence rather than classical biology.
The F59E12.2 Sequence page in WS147 gives a schematic diagram of F59E12.2's organization in the genome somewhat different from that on the Gene page, a BLAST (UNIT 3.3) search launcher, external links of the protein sequence to GenPept and SwissProt, and precise numerical coordinates for the F59E12.2 transcription unit along with the identities of cDNAs matching it. The BLAST search is automatically set up against WormBase's genomic sequence data by clicking on the "Run BLAST or BLAT" button (see Basic Protocol 9). This Sequence page is also be a convenient place to get the sequence in FASTA format (Apweiler, 2005; Appendix 1B) for BLAST searches outside WormBase. Given the F59E12.2 Sequence page, one can cut and paste it in FASTA format directly into one of the search pages at the NCBI (http://www.ncbi.nlm.nih.gov/BLAST; UNITS 3.3 & 3.4) or into a more specialized server. The Sequence page also gives links to the Protein page for zyg-1/F59E12.2 and to the C. briggsae ortholog of zyg-1.
2. Click the "Click Here to Browse" link to display the corresponding region in the Genome Brower (see Basic Protocol 7).
WormBase has, in addition to Gene and Sequence pages, Protein records that give the known or predicted details of a gene's protein product(s).
1. From the front WormBase page, select "Protein, C. elegans" from the "Search for" pull-down menu, then enter the name of a protein of interest in the adjacent text field and click the Search button.
2. A successful search will lead directly to a "Protein report" for the gene requested
This is the Protein page for that polypeptide in WormBase.
3. Alternatively, from a Gene or Sequence page, click on the link given for Corresponding Protein(s).
For example, the Gene page for zyg-1 (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_03.png) and the Sequence page for F59E12.2 (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_05.png) both link to the Protein Report page for WP:CE28571.
WormBase uses a numbering system for C. elegans proteins with the format WP:CExxxxx, where the last five digits are numbered 00001 to 99999. (For a search of WormBase, CExxxxx is sufficient.) A CE number is unique to a particular peptide sequence, and new CE numbers are generated every time that a change or expansion in predicted gene structures implies a new conceptual protein sequence; where there is the possibility of confusion, CE numbers prevent any problems with changes in Sequence nomenclature.
4. Examine the page for protein characteristics of interest.
A Protein page contains several different classes of information (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_06.png). For a given protein, where applicable, WormBase will show the following: motifs from Interpro (Mulder et al., 2005) and PFAM (Bateman et al., 2004; UNIT 2.5); regions of low-complexity sequence (detected with SEG; Wootton, 1994); predicted transmembrane domains (via THMMER; Krogh et al., 2001) or coiled-coil domains (through NCOILS; Lupas, 1997); the amino acid sequence; length, estimated isoelectric point and molecular weight, and residue composition.
In addition to features inherent to a protein sequence, WormBase also gives the protein's precomputed BLASTP search hits against a number of databases. These databases include C. elegans itself (to detect paralogs within the worm genome); Saccharomyces cerevisiae (from SGD; Balakrishnan et al., 2005), Drosophila melanogaster (from Gadfly/FlyBase; Drysdale et al., 2005); human beings (from Ensembl; Hubbard et al., 2005); and a nonredundant subset of proteins from SWISS-PROT and TrEMBL (Bairoch et al., 2005). Similarities detected by BLASTP are diagrammed alongside motifs and other sequence features (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_06.png).
One way to identify all C. elegans proteins sharing a given PFAM or InterPro motif is to check that motif's WormBase entry. One advantage of such searches is that hypertext links are given for each motif to a short description of its biological properties (if any; many motifs are evolutionarily conserved but functionally uncharacterized).
Web browser such as Internet Explorer (http://www.microsoft.com) or Mozilla (http://www.mozilla.org/firefox)
1. From the front page, enter a term for the desired motif, such as "ribonucleoprotein", select "Protein Family or Motif" from the pull-down menu (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png), click Search, and study the results.
Click on the hyperlinked motifs and examine them individually to see which are most useful.
2. Alternatively, pick a gene whose protein product is known to be in the class of interest. Find its Gene page by a standard search (see Basic Protocol 3). Click from there to its Protein page(s) and check for motifs. (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_06.png)
3. Using either method, if one or more motifs are found, click on them to see whether they have other proteins associated with them.
An example of a motif found by searching for ribonucleoprotein is the InterPro motif IPR001163, "Like-Sm ribonucleoprotein, core." The motif has 17 associated proteins; in cases where a motif's gene has a classical three-letter name, this is shown.
All the sections of WormBase described so far consist mainly of written text. However, organisms have a text of their own, the genomic DNA sequence, within which important biological features reside. WormBase thus provides schematic diagrams of genomic DNA, as well as textual descriptions of genes, with a Generic Genome Browser that allows customizable views of C. elegans DNA (Stein et al., 2002).
Web browser such as Internet Explorer (http://www.microsoft.com) or Mozilla (http://www.mozilla.org/firefox)
1. Click the Genome link at the top left of the WormBase front page (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png) to start the Genome Browser (http://wormbase.org/db/seq/gbrowse/wormbase/;
2. Select an appropriate search term. The Genome Browser requires a "Landmark or Region" (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_09.png), but this can include genome-wide scans for nucleotide sequences, gene classes, or key words in gene descriptions. Examples are given on the Genome Browser page. The Browser can view both linear nuclear chromosomes (autosomes 1 through 5, plus the sex chromosome X; (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_09.png), and the circular mitochondrial chromosome (MtDNA, shown as a line).
3. Click the Search button and examine the results. If a search term is used that has multiple hits in the genome (e.g., zyg-*), one will get a schematic view displaying their positions in the genome, with hypertext links.
Study a particular genomic region
4. Where there is more than one search result, click one link to see an individual genomic region
and decide which of its features to further examine.
Features of the given region are displayed by selecting specific bands or "tracks" for viewing the sequence from left to right. The view of a given region can be customized by many different "tracks." The default view is to have only a few tracks (e.g., gene models, GenBank entries, and operons), in order to make the initial view clear. However, the view can be expanded by adding more tracks (such as ESTs, restriction sites and alignments to C. briggsae) until it is quite elaborate. From a C. briggsae alignment, one can hop to the relevant segment of the C. briggsae genome. ESTs can be ordered from the research group of Yuji Kohara (email@example.com).
5. Select the most useful tracks, and click on the Update Image link.
The view will then show the added information tracks. One revised view is given in (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_10.png) In this case, alignments to C. briggsae, to "best" ESTs, and to "best" mRNAs have been invoked. Most ESTs in the region, and a single mRNA (identified by cDNA sequencing), support the structure of zyg-1. Two ESTs overlap the nested bli-2 gene, and bli-2-like sequences within zyg-1-like sequences are highly conserved in C. briggsae. Moreover, small regions of conserved sequence on the 5¬¢ flanks of the zyg-1 and bli-2 transcription units; these regions are reasonable candidates for transcriptional regulatory sequences (Nardone et al., 2004). Well-chosen tracks can thus show complex data in a compact, vivid way.
6. Optionally, adjust the scale of the genome view by clicking on the appropriate links in the Scroll/Zoom menu (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_09.png).
This is a small but powerful menu; it can expand the view to subchromosomal ranges of 1 Mb
or focus the view in to single-nucleotide resolution.
To get views of individual nucleotides, set the zoom for "100 bp" and switch on the track "DNA/GC Content." (At scales greater than 100 bp, this gives a graphical view of the percentage of GC nucleotides.) In either case, the location of the sequence shown within its entire chromosome is also diagrammed.
7. Another way to examine desired slices of the genome is to type in their exact coordinates into the "Landmark or Region" text field (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_09.png) and then click the Search button. Genomic coordinates are given in the format [chromosome]:[start nucleotide]..[final nucleotide]; for instance, the coordinates of zyg-1 are II:5649706..5654391. (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_04.png) By choosing coordinates and then adjusting them to improve the graphical view, it is possible to incrementally refine the region viewed to very high precision, starting with large changes and ending with small ones.
8. Given a particular region, one may want to extract its nucleotides for analysis. The Genome Browser enables this with the "Display Sequence File" option in the "Reports & Analysis" menu. When the viewer is set to the exact nucleotides wanted, select this option and then click the Go button to the right. This will produce a text file in FASTA format, which can then be saved to disk and manipulated with other programs such as EMBOSS (http://www.hgmp.mrc.ac.uk/Software/EMBOSS/).
9. If a particular object in the sequence view is itself worth examining, click on it to go to its record in the database. For instance, in the original view of zyg-1, one can click on the bli-2 gene and go to the Sequence page for bli-2/F59E12.12.
Normally, WormBase gives its views of the genome as ordinary bit-by-bit (raster-based) PNG diagrams in a Web browser. However, to publish a figure in a professional research journal, it is often preferable to have figures that are line drawings (scalable vector graphics, or SVG) instead of PNG diagrams. It is possible to generate publication-quality line drawings (SVG figures) from WormBase. Viewing these figures requires the SVG plugin for one's Web browser. See http://plugindoc.mozdev.org/linux.html for Mozilla/Linux plugins. Adobe SVG plugins for Macintosh and Windows systems are available at http://www.adobe.com/support/downloads/. SVG figures can then be further worked up for publication with a commercial program such as Adobe Illustrator, or open-source programs such as Inkscape (http://www.inkscape.org) or Batik (http://xml.apache.org/batik).
10. To get line drawings of a genomic region, click the "High-res Image" link in the Genome Browser menu (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_10.png). Follow the resulting instructions to save the newly generated SVG file to disk. To view the diagram immediately, one will need to have an appropriate SVG plugin for one's Web browser (see http://plugindoc.mozdev.org/linux.html for Mozilla/Linux plugins).
11. WormBase provides advanced functions to allow collaborative annotation of the genome by outside researchers. If it is desirable to provide annotations for a given region, open the "Add your own tracks" menu at the bottom of the Genome Viewer page, and then use the "Upload your own annotations" or "Add remote annotations" options. Help documents for these are available on-line.
At least six species in the Caenorhabditis genus have morphology and genetics close enough to C. elegans that they can be classed as its siblings (Cho et al., 2004; Kiontke et al., 2004). Of these sibling species, three (C. briggsae, C. remanei, and C. sp. CB5161) have been isolated as laboratory strains. C. briggsae and C. remanei have undergone whole-genome shotgun sequencing so that they can be compared to C. elegans (Stein et al., 2003; J. Spieth, pers. comm.). As of mid-September 2005, WormBase includes both the C. briggsae genome and a preliminary assembly of the C. remanei genome. C. briggsae's genome can be viewed either by itself or in syntenic linkages to C. elegans.
Web browser such as Internet Explorer (http://www.microsoft.com) or Mozilla (http://www.mozilla.org/firefox)
1. Click on the "C. briggsae Genome" link in the Sequences section of the Web site directory on the main WormBase page (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png). This will give a Genome Browser window, from which one can then pick C. briggsae sequences for viewing in the same way as with C. elegans (http://wormbase.org/db/seq/gbrowse/briggsae;).
At this writing, the whole-genome shotgun sequence exists as 5,341 contigs, organized into 142 supercontigs with names like "cb25.fpc0058"; the predicted genes have names like "CBG02462". Such a view is useful for extracting nucleotide sequences, but is rather nondescript.
Usually the C. briggsae sequences are interesting because of their relationship to C. elegans orthologs. Such orthologies can be viewed directly via the Synteny Viewer.
2. Click on the Synteny Viewer link in the Sequences section of the Web site directory on the main WormBase page (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png). This gives a search window in which one can ask for given C. elegans or C. briggsae sequences by its sequence name. A search for either type of sequence gives a syntenic alignment of it with its predicted ortholog (if any) in the opposite species. For instance, searching for zyg-1 by its sequence name F59E12.2 (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_05.png) in the Synteny Viewer gives an alignment of the zyg-1/bli-2 cluster in both species.
Click on the Shown Alignment link to get the two-species sequence alignment in a new browser window.
3. Alternatively, start with a Genome Browser view of some interesting C. elegans gene, and invoke the "Briggsae alignments" track in the viewer (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_10.png). Then, click directly on the blue "Briggsae alignments" line in the diagram. This will move the viewer from a single-genome view to a syntenic view of the region that was clicked on (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_14png). This approach is particularly useful if the C. elegans genomic region one wants to study is complicated, and getting the correct homology for a specific site is crucial.
4. The Synteny Viewer itself can be zoomed outward or inward to view large or small regions, just as the Genome Browser can. Moreover, one genome can be visually flipped with respect to another so that its homology can be disentangled and made easy to see. With some care, it is possible to get pleasingly clear diagrams of what would otherwise be murky homologies.
5. Analysis of the C. briggsae genome identified ~12,000 predicted genes as putative orthologs of C. elegans genes (Stein et al., 2003). These orthologs, where they exist, are listed as hypertext links on each C. elegans Gene page, with the choice of going directly to the C. briggsae gene or to its syntenic alignment (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_03.png). On some C. elegans Gene pages, there will also be BLASTP hits given against the C. briggsae genome; clicking on these hits takes one to a C. briggsae protein page. In general, most WormBase functions work for C. briggsae in the same way as for C. elegans.
This protocol describes how to BLAST a sequence against the sequences in WormBase. Further discusion of the BLAST algorithm can be found in UNITS 3.3, 3.4, & 3.11.
Web browser such as Internet Explorer (http://www.microsoft.com) or Mozilla (http://www.mozilla.org)
1. Click the Blast/Blat link at the top middle left of the WormBase front page (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png). This will lead to http://blast.wormbase.org/db/searches/blat. This is the method of choice for searching the C. elegans genome with a non-worm protein or nucleic acid sequence.
For instance, consider the human gene DYMECLIN (DYM), which when mutated leads to Dyggve-Melchior-Clausen or Smith-McCort dysplasia (Cohn et al., 2003; El Ghouzzi et al., 2003). Suppose one would like to use C. elegans as a model system for dissecting DYMs function. Which, if any, C. elegans genes are significantly similar to DYM? Performing BLASTP on WormPep with the DYM protein sequence yields one strong hit, the gene C47D12.2, so far uncharacterized.
In addition to the plain orthology between DYM and C47D12.2, a weaker but also significant similarity is visible to two protein products of the hid-1 gene, required for normal muscular activity and insulin signaling (Ailion and Thomas, 2003). BLAST searches can thus reveal not only orthologies but also paralogies that may suggest further clues to gene function.
While BLAST is designed to efficiently search genomes for subtle matches to protein-coding sequences, BLAT is aimed at quickly finding strong (95% to 100%) identities of genomic DNA to (‚â•40-residue) nucleotide sequences (Kent, 2002).
2. To examine the exact match between a query sequence and a given BLAST hit, click on its Alignment link (in the Details section of the tabulated BLAST outputs; (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_15.png). This will yield BLAST results in the standard high-scoring segment pair format (Korf et al., 2003).
3. To examine the location of a BLAST hit in the context of the C. elegans genome, go to its Genome View link instead of its Alignment link (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_15.png). This will give a Genome Browser view with the BLAST hit mapped onto the genome.
Web pages for a single Gene, Sequence, or Protein only allow relatively small amounts of data to be analyzed at a time; yet the rise of functional genomics has made it useful and necessary for biologists to handle large, complex information about tens to thousands of genes in their everyday work. To support this, WormBase provides tools for wrangling large data sets. The most recently designed and generally usable of these is WormMart, based on the BioMart search engine used by Ensembl (Kasprzyk et al., 2004). WormMart is a Web interface that allows users to design and run complex database searches without having to use (or even know) complex database query languages.
1. Go to the WormMart page (by clicking its link on the top left center of the front page; (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png). Once on the first WormMart page, choose a version of WormBase (either the most recent release, or a permanently archived release such as WS140). Then choose a Dataset to search, such as genes, expression patterns, research papers, phenotypes, or "variations" (i.e., mutations or RNAi). Click the "next" button, and go to the Filter page.
2. Decide on how to filter the data set chosen.
For instance, the Genes data set can be filtered by species, gene names or classes, the reliability of the gene's structual prediction, whether the gene is protein-coding or not, genomic location and DNA strand orientation, inclusion or exclusion of 5' and 3' flanking sequences, and RNAi phenotype. Other data sets can be filtered in analogous ways. Moreoever, WormMart allows more than one data set to be filtered in a single search. Although the Filter page gives many choices, it is easy to undo one's choices and explore a different query, by clicking the "back" button. After the data filters are satisfactory, click the "next" button and go to the Output page.
3. On the Output page, decide how to best select and present the filtered data. First, choose an "Attribute". For genes, these attributes can be "Features", "Structures", or "Sequences". Features include short identifiers, genomic locations, functional annotations, available reagents, mutant alleles, or references; Structures include exons, introns, and 5' or 3' untranslated regions; and Sequences are pure nucleotide sequences selected from structural elements or from DNA-based reagents such as cDNAs.
After selecting the details for such an output, choose an output format such as HTML, plain text, Excel table, or compressed file, and click "export".
4. Examine the results.
If they seem promising but not as well-chosen as necessary, go back one or more steps and try different options, and see how the results come out with somewhat different filter and output choices. The best way to get an intuitive sense for this interface is probably to tinker with it at first. With practice, however, it can be a very powerful and flexible tool.
In analyses of the C. elegans genome, it can be useful to have such things as the complete set of 5¬¢ flanking regions for all known or predicted genes in C. elegans. One way WormBase provides these data is through its Genome Dumper. To download a single segment of the genome, use the Genome Browser (see Basic Protocol 7, steps 7-8).
Web browser such as Internet Explorer (http://www.microsoft.com) or Mozilla
1. From the middle top of the front page, click on the Batch Sequences link; this leads to http://wormbase.org/db/searches/advanced/dumper.
2. The Genome Dumper can extract data from either the C. elegans or the C. briggsae genome. Pick the species to be scanned from the Species menu.
3. In the Input Options box, enter a list of gene, sequence or chromosome names, either by cut-and-paste into the large box, or by uploading a text file with space-separated names. Note that this can automatically translate classical gene names (such as zyg-1) into their genomic sequence equivalents (such as F59E12.2), so translating classical gene lists into molecular outputs is easy. This is of particular use if WormMart or the Batch Genes server has first been used to produce the full list of classical gene names sharing some characteristic of interest; this list can then be fed directly to the Batch Sequences service.
Also note that, because a single gene may have two or more alternatively spliced transcripts, entering one classical gene name may produce two or more sequence outputs.
4. Pick a choice from the "Select one feature to retrieve" pull-down menu. The default, Gene Models, will base a search on the structures of known or predicted worm genes.
5. In "Output options," choose whichever alternatives make sense. The Dumper will provide sequences in FASTA format. Unless it is desirable to have both a flanking sequence and the gene sequence linked to it, select the radio button for "flanking sequences only." For many purposes, the default of plain text is preferable to HTML. Saving to disk is an option, and is probably preferable for large outputs (such as chromosome- or genome-wide ones).
6. Click the DUMP button. The resulting download will be into a window of the Web browser being used. If the download is large (e.g., flanking sequences for all C. elegans genes), it will take some time (up to 1 hr). It is important not to interrupt the download by prematurely quitting the browser or using the Stop button of the browser; doing this will give a partial download, useless except for giving an example of the correct output. Again, for large downloads, consider saving the output to disk as a file to avoid this problem.
7. Check the format of the resulting text. It may be necessary to reformat the headers of the sequences; this can be done with Perl, or with the text editor or word processor of one's choice.
One example of a Genome Dumper search would be to run it with the following settings:
Species: "C. elegans"
1. Input Options: "I II III IV V X" [i.e., all nuclear chromosomes]
2. Select one feature to retrieve: "Gene models"
3. Output options:"flanking sequences only"
show 0 bp, 5¬¢ flank show 1000 bp, 3¬¢ flank coordinates relative to chromosome sequence orientation relative to feature Save to disk (plain TEXT).
A complete download, from the WS147 release of WormBase, gives 28,145 sequences from the 3¬¢ flanks of transcription units. A general use for such data sets is to acquire large sets of possible regulatory sequences (promoters or 3¬¢ UTRs) for computational analysis of cis-regulatory motifs through Gibbs sampling or hidden Markov models (e.g., GuhaThakurta et al., 2004). Prediction of such motifs can, in turn, be used to formulate experimentally testable hypotheses for transcriptional or translational gene regulation.
8. The Genome Dumper can do recursive searches. To use this, check the "paste features back for further searches" radio button in the "Output options" menu before carrying out a genome dump.
Within a defined physical or genetic interval, what gene markers and SNPs are available for mapping a newly identified mutation? WormBase provides a page specifically for this problem.
1. Click the Markers link at the right center top of the WormBase front page (Fig. 1.8.1) to begin a search for genetic markers and their strains (http://wormbase.org/db/searches/strains).
2. Choose the appropriate markers within an interval of 2 cM or less of a chosen target. Full instructions are given on the page; there are options to exclude subtle, male-sterile, or lethal phenotypes, to select SNPs (Swan et al., 2002) or snip-SNPs (Wicks et al., 2001), or to only show publicly available strains as sources of markers.
3. Click the Search link and examine the results. In Figure 1.8.19, an example is shown of gene markers in the vicinity of hid-3, an uncloned gene identified by screening for mutants defective in the control of dauer larval development during severe heat (Ailion and Thomas, 2003). A table (Fig. 1.8.20) as well as a graphical map is given.
Note that instructions for ordering strains are available from the CGC Strains link in the Worm Reagent section of the front page's Links directory (Fig. 1.8.1); presently, this link leads to the e-mail address of Theresa Stiernagle (firstname.lastname@example.org) at the University of Minnesota.
All of the protocols above assume that a user will work with one of the main WormBase servers, with his or her own computer operating merely as a client. However, for people who want to use WormBase intensively, it can be highly useful to set up a local installation of WormBase on one's own desktop or laptop computer. Doing this frees one from needing a network connection and allows a single machine to be dedicated to one's own research needs. Local installations of WormBase are not for the faint of heart, because WormBase is UNIX-based and requires installing several different, complex software packages for it to run. However, work is underway to make installation more reliable.
WormBase software should be portable to any standard Unix-like operating system. At present it has been ported to Linux (Welsh et al., 2002) and Mac OS X (Pogue, 2005). The minimum suggested specifications for the computer's hardware are: a CPU running at 1 GHz; 1 Gb RAM; 1 Gb of swap space; 25 Gb of free disk space; and a high-speed connection with the Internet. In practice this minimum of CPU speed and RAM will yield a usable but annoyingly sluggish database. Better specifications are one 2 GHz CPU, 2 Gb of RAM. Better still is to have two processors running at 1 GHz or more, 2+ Gb of RAM, and two hard drives, preferably SCSI rather than IDE. The advantage of two hard drives is that WormBase uses two separate databases (ACeDB and MySQL) which work better if put on separate disks.
The computer's operating system must have a usable version of the GNU Compiler Collection (gcc: http://www.gnu.org/software/gcc/gcc.html). It must also have basic utilities for compiling and running open-source software. A standard distribution of Linux will satisfy these criteria, as will Mac OS X if auxillary XCode (http://www.apple.com/developer) and Fink (http://fink.sourceforge.net) utilities are installed. Several other software packages are also required, foremost among them MySQL (http://www.mysql.com; Reese et al., 2002), GMOD (http://www.gmod.org) and BioPerl (http://bioperl.org; Stajich et al., 2002; Tisdall, 2003). Full installation instructions, with a complete listed of required software packages, are available at http://wormbase.org/docs. Almost all of the software required for running WormBase is freely available in open-source form (Stone et al., 1999). One exception is the Washington University distribution of BLAST maintained by the laboratory of Warren R. Gish. While this program is available free of charge to academic users in precompiled (executable binary) form, a license from Washington University is required before it can be downloaded and used. See http://blast.wustl.edu for details.
The data files for each release of WormBase are assembled and archived by curators at the Wellcome Trust Sanger Institute, at ftp://ftp.sanger.ac.uk/pub2/wormbase/live_release.
Installing WormBase on a local computer
Installing WormBase locally requires skill with Unix system administration, Perl, Apache, and MySQL. Even if one has these skills, the installation is fairly complicated. Painstakingly follow the instructions at http://wormbase.org/docs. For a standard, prepackaged Linux distribution, one will need to compile and reinstall several basic packages from source code, such as Apache (http://httpd.apache.org) and mod_perl (http://perl.apache.org). If defects in the installation instructions or incompatibilities in the required software are found, please describe the bugs to WormBase via the Feedback page (http://wormbase.org/db/misc/feedback), or send E-mail to email@example.com.
If installing WormBase on a computer with two hard drives, place the ACeDB component of WormBase on one drive and the MySQL component on the other; this should significantly improve performance.
WormBase contains information about the genomic sequence of C. elegans, its genes and their products, and its higher-level traits such as gene expression patterns and neuronal connectivity. These data are interconnected, so that a search beginning with one object (such as a gene) can be directed to related objects of a different type (such as the DNA sequence of the gene, or the cells in which the gene is active). WormBase contains a list of literature pertinent to C. elegans, comprising over 11,000 published articles, along with abstracts containing prepublication data from annual research meetings. This literature can be searched not only with simple word tags but with the Textpresso software package, which extracts complex subsets of information from over 6,400 research papers (http://www.textpresso.org; M√ºller et al., 2004). Expression patterns reported for any C. elegans gene in the published literature can be found in WormBase, by reference either to genes or to cell or tissue types. There are results from RNAi screens with over 100 discrete phenotypes, inactivating over 3100 distinct genes. The genomic sequence has been aligned to orthologous sites in the genome of C. briggsae, a sibling species of C. elegans (Stein et al., 2003). Finally, there are an increasing number of functional annotations for genes, provided both in free text and as Gene Ontology terms (Harris et al., 2004; UNIT 7.2).
WormBase is aimed at Web use, which favors individual over batch queries. However, with some effort, one can also do searches for complex data sets. ACeDB, the engine of WormBase, can be used for large-scale bioinformatics; some searches allow uploading large files for queries. Searches using the ACeDB Query Language (AQL) are also feasible. Information about how to use WormBase is available through e-mail at firstname.lastname@example.org and through the on-line WormBase Users' Guide. The WormBase developers group actively invites suggestions for improvement from users, and WormBase's source code is freely available for local installation and improvement.
There is an on-line User's Guide maintained by WormBase curators, available at http://www.wormbase.org/docs/index.html. It contains a full index, a set of answers to frequently asked questions, a downloadable PDF version of the Guide (for easy reading in printed form), and explanations of many individual parts of WormBase. It was written independently of this unit, and is recommended as a complementary source of information about WormBase.
At this writing, sequencing and analysis of the C. remanei genome is still underway. However, a preliminary C. remanei genome sequence assembly, with protein-coding gene predictions, is available at http://dev.wormbase.org/db/seq/gbrowse/remanei. From that site, carry out genome sequence searches in the same way as for C. elegans or C. briggsae Genome Browser pages (Basic Protocols 7-8). In the near future, this should move from the development site to the main site, and be syntenically aligned to the C. elegans and C. briggsae genomes.
WormBase data sources
WormBase is derived from the previous C. elegans database ACeDB (Eeckman and Durbin, 1995). ACeDB was designed to archive and correlate classical genetic maps of chromosomes, clone maps (primarily of cosmids, but also of YACs, phage clones, and cDNAs), and sequences of genomic DNA; it also had telegraphic descriptions of some genes, with references and abstracts from C. elegans literature and meetings. ACeDB was crucial in aligning the genetic and physical maps and in allowing collaboration between the C. elegans genome project and its research community. Designed in the early 1990s, ACeDB required local client-server interactions between several personal computers and one server (typically Unix). The rise of the World Wide Web made this unnecessary, while making it desirable to have a central server present the most up-to-date version of ACeDB through a Web interface. This was accomplished in 1998 with AcePerl (Stein and Thierry-Mieg, 1998). AcePerl and ACeDB became the core of WormBase at its founding in 2000 (Chen et al., 2005), although part of WormBase was later moved to MySQL (http://www.mysql.com; UNIT 9.2). More details of WormBase's design are discussed elsewhere (Schwarz et al., 2002).
The main reason for using WormBase is that it represents the most extensive C. elegans database publically available. It is currently designed to make several different kinds of information available. There are deficiencies in WormBase, however, that must be pointed out. Several data sets still remain to be curated by the WormBase staff and included in WormBase. Functional annotations, while currently covering most named genes, need to be continually corrected, revised, and expanded, both in text and in Gene Ontology form. While results for large-scale RNAi (UNIT 12.3) experiments are generally up to date, RNAi results from individual researchers are still being curated. New data sets (e.g., SAGE and whole-proteome interaction maps; Jones et al., 2001; Li et al., 2004) continue to be generated and to require annotation. While it is possible to do a local installation of WormBase, it is not easy; local installations are complex and error-prone. Finally, while it is possible to extract complex patterns from the data in WormBase, and increasingly straightforward tools such as WormMart exist to support this, such extraction is still sometimes clumsy and ad hoc. The curators of WormBase continually strive to diminish such defects, but users should be on guard for them.
Local versus remote access
One issue to be considered in using WormBase is whether to access it remotely or whether to set up a local installation. The authors' experience is that local installations are preferable if they are on reasonably fast hardware (the "optimal" specifications described in Necessary Resources for Alternate Protocol 1), simply because running the program locally reduces lags due to network overload, sharing of a single server by many users, or temporary bugs in WormBase itself. WormBase currently has two official mirror sites, and welcomes any investigators who would like to run public mirrors of their own, since increasing the number of WormBase sites increases reliability and helps debug flaws in the site software.
Alternatives to WormBase
While there is no exact substitute for WormBase, there are alternatives to it. Users who want a C. elegans database that is easy to install locally and that has some of WormBase's general features can try ACeDB, which can be compiled and run on any Linux/Unix operating system. The source code for ACeDB is available, with documentation, at http://www.acedb.org. Another option is to use the C. elegans (or C. briggsae) track of the UCSC Genome Browser (Kent et al., 2002; http://genome.ucsc.edu/cgi-bin/hgGateway?org=C.+elegans; UNIT 1.4). The WormGenes database (http://www.wormgenes.org) has convenient links to the gene expression database of Kohara and coworkers (Tabara et al., 1996). Finally, the fee-only database Proteome (Costanzo et al., 2001) is maintained by Incyte (http://www.incyte.com); this database includes C. elegans genes.
Using mirror sites
The two main Web sites for WormBase, http://wormbase.org and http://dev.wormbase.org, are maintained by Lincoln Stein at Cold Spring Harbor Laboratory. While generally very stable, they are not always accessible (because of heavy demand or blockage of the network). If the main sites are unavailable, try the mirrors at California Institute of Technology (http://caltech.wormbase.org) or at the Institute of Molecular Biology and Biotechnology (IMBB) in Crete http://imbb.wormbase.org. The development site has newer data releases and software versions; these are put up on the development site for three weeks, to check for defects, before being transferred to the main site.
To find what mirrors are available at any given time, check the WormBase front page (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png). Hypertext links to any official mirrors are given immediately near the bottom of the front page (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png).
Subscribing to a mailing list
WormBase has E-mail list servers, allowing worldwide participation in the development of WormBase. One of these listservs (wormbase-dev) is a technical list used by official WormBase developers. The others are open to any interested member of the research community. The two listservs most likely to be useful are wormbase-announce and wormbase-help (alias email@example.com). The former is used for general news about WormBase; the latter is an an open forum where anyone can ask questions (or make complaints) about WormBase. Any E-mail to wormbase-help is likely to be read by a WormBase curator within a few hours, and may well be read and replied to within minutes.
To subscribe, simply click on the Mailing Lists section of the WormBase front page's site directory. This will lead to the WormBase Mailing Lists page http://www.wormbase.org/mailarch. One can than subscribe to any E-mail listservs that are useful by sending an E-mail to firstname.lastname@example.org. In the body of the E-mail (not the subject line) write subscribe [listserv name]. Some of these listservs are restricted to internal WormBase use.
Asking for help, information, or new features
WormBase is not a static resource: it is maintained by a team of bioinformaticians, programmers, and curators who collectively work to explain its features, help make sense of its search results, correct errors, fix defects, add new features on request, and generally try to keep WormBase improving. Investigators can submit feedback from any Web page in WormBase simply by clicking the "Send comments or questions to WormBase" link in the lower left-hand corner (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png). This leads to the Feedback page (http://wormbase.org/db/misc/feedback). Type in comments, questions, or complaints, and send them off by clicking the Submit Comments button. Alternatively, one can send e-mail to email@example.com.
Submitting data to WormBase
Click on the Submit link at the top of the WormBase front page (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png). This will lead to the on-line data submission form at http://wormbase.org/db/curate/base. If in doubt, use the basic feedback link, which is given at the top of the data submission page; it leads to http://wormbase.org/db/ misc/feedback. The basic feedback page is always a good choice. Specific areas of interest are given on the Submit Data page to let individual researchers steer their queries in a specific direction if they want to, but doing so is purely optional.
If there is a specific topic that one is particularly interested in, choose an area of interest on the data submission page. There are several areas that one can choose: sequence and gene structure; the worm proteome; cells and anatomy; gene mutations, map locations, and functions; or nomenclature. To direct information to any one area, click on its link (which will read either "Fill out online form" or "Email Individual Person"). E-mails to individuals go directly to a WormBase curator officially responsible for a given topic.
Suggestions for Further Analysis
Using other web sites relevant to C. elegans
The Resources link in the Links section of the WormBase front page (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png) leads to http://wormbase.org/about/about_Celegans.html. From there, one can choose one or more links to different C. elegans projects.
These include three on-line reference works on C. elegans biology, and links to several other databases concerned with C. elegans genetics, genomics, or anatomy. Particularly noteworthy is Leon Avery's Web site at the University of Texas (http://elegans.swmed.edu). This site gives convenient, up-to-date access to the abstracts of C. elegans meetings, the Worm Breeder's Gazette, and news announcements by researchers in academia, government, and industry.
Another important resource is WormAtlas (http://www.wormatlas.org). The WormAtlas database is intended to provide a comprehensive atlas of C. elegans anatomy, like the atlases that have long been available for human anatomy in medicine. It includes a collection of serial electron microscope sections, and several guides to cellular anatomy, including beautiful schematics of neuronal morphology.
Finally, WormBook (http://www.wormbook.org) provides an on-line compendium of over 70 articles about C. elegans biology, covering genetics, genomics, molecular biology, cell biology, sex determination, developmental biology, neurobiology, and evolution.
Other model organism databases
Go to http://wormbase.org/about/mods.html. From here, one can reach public databases for several eukaryotic species (such as Saccharomyces cerevisiae, Drosophila melanogaster, Arabidopsis thaliana, and Mus musculus). There are also links to the Gene Ontology (UNIT 7.2), Genome KnowledgeBase, and Generic Model Organism Database projects.
Downloading bulk data
The WormBase front page not only allows a basic search, but also has links to several other search pages (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png). One of these provides a general collection of data for intensive analysis on one's own computer.
Go to the WormBase Downloads page http://wormbase.org/downloads.html, which in turn gives access to several large data sets. These sets include genomic features in the General Feature Format (GFF; http://www.sanger.ac.uk/Software/formats/GFF/index.shtml); full genomic and EST sequences in FASTA format; the set of all protein sequences known or predicted to exist in C. elegans (wormpep), and the set of all known non-protein-coding RNA transcripts (wormrna). There are also sequence and gene prediction data for C. briggsae, allowing comparative studies such as phylogenetic footprinting (Nardone et al., 2004). Extensive documentation for how GFF is used in WormBase is available as part of the documentation for the Generic Genome Browser (Stein et al., 2002; http://www.gmod.org, GBrowse, CONFIGURE_HOWTO.pod). Other data sets include: classical genetic maps; movies of mutant embryos in mass RNAi assays; technical details of standard transgenic vectors; translation tables linking classical gene to genomic sequence names; and EST sequences from other nematode species (such as animal or plant parasites) which may include genes lost from the C. elegans genome (such as the proto-oncogene EMSY; Hughes-Davies et al., 2003).
To obtain a specific data set, click on its hypertext link. Data will be given as a Web page (which can be saved to one's hard drive) or, alternatively, one will be prompted to save a file immediately.
Ailion, M. and Thomas, J.H. 2003. Isolation and characterization of high-temperature-induced dauer formation mutants in Caenorhabditis elegans. Genetics 165:127-144.
Apweiler, R. 1995. Sequence databases. In Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, 3rd. ed. (A.D. Baxevanis and B.F.F. Ouellette, eds.) pp. 3-24. John Wiley & Sons, Inc., New York.
Ashrafi, K., Chang, F.Y., Watts, J.L., Fraser, A.G., Kamath, R.S., Ahringer, J., and Ruvkun, G. 2003. Genome-wide RNAi analysis of Caenorhabditis elegans fat regulatory genes. Nature 421:268-272.
Bairoch, A., Apweiler, R., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M.J., Natale, D.A., O'Donovan, C., Redaschi, N., and Yeh, L.S. 2005. The Universal Protein Resource (UniProt). Nucleic Acids Res. 33:D154-D159.
Balakrishnan, R., Christie, K.R., Costanzo, M.C., Dolinski, K., Dwight, S.S., Engel, S.R., Fisk, D.G., Hirschman, J.E., Hong, E.L., Nash, R., Oughtred, R., Skrzypek, M., Theesfeld, C.L., Binkley, G., Dong, Q., Lane, C., Sethuraman, A., Weng, S., Botstein, D., and Cherry, J.M. 2005. Fungal BLAST and Model Organism BLASTP Best Hits: new comparison resources at the Saccharomyces Genome Database (SGD). Nucleic Acids Res. 33:D374-D377. Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L., Studholme, D.J., Yeats, C., and Eddy, S.R. 2004. The Pfam protein families database. Nucleic Acids Res. 32¬†:D138-D141.
Brenner, S. 1974. The genetics of Caenorhabditis elegans. Genetics 77:71-94. C. elegans Sequencing Consortium. 1998. Genome sequence of the nematode C. elegans: A platform for investigating biology. Science 282:2012-2018.
Chalfie, M. and White, J. 1988. The nervous system. In The Nematode Caenorhabditis elegans (W.B. Wood., ed.) pp. 337-391. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York.
Chen, N., Harris, T.W., Antoshechkin, I., Bastiani, C., Bieri, T., Blasiar, D., Bradnam, K., Canaran, P., Chan, J., Chen, C.K., Chen, W.J., Cunningham, F., Davis, P., Kenny, E., Kishore, R., Lawson, D., Lee, R., M√ºller, H.M., Nakamura, C., Pai, S., Ozersky, P., Petcherski, A., Rogers, A., Sabo, A., Schwarz, E.M., Van Auken, K., Wang, Q., Durbin, R., Spieth, J., Sternberg, P.W., and Stein, L.D. 2005. WormBase: a comprehensive data resource for Caenorhabditis biology and genomics. Nucleic Acids Res. 33:D383-D389.
Cho, S., Jin, S.W., Cohen, A., and Ellis, R.E. 2004. A phylogeny of Caenorhabditis reveals frequent loss of introns during nematode evolution. Genome Res. 14¬†:1207-1220.
Cohn, D.H., Ehtesham, N., Krakow, D., Unger, S., Shanske, A., Reinker, K., Powell, B.R., and Rimoin, D.L. 2003. Mental retardation and abnormal skeletal development (Dyggve-Melchior-Clausen dysplasia) due to mutations in a novel, evolutionarily conserved gene. Am. J. Hum. Genet. 72:419-428.
Costanzo, M.C., Crawford, M.E., Hirschman, J.E., Kranz, J.E., Olsen, P., Robertson, L.S., Skrzypek, M.S., Braun, B.R., Hopkins, K.L., Kondu, P., Lengieza, C., Lew- Smith, J.E., Tillberg, M., and Garrels, J.I. 2001. YPD, PombePD and WormPD: Model organism volumes of the BioKnowledge library, an integrated resource for protein information. Nucleic Acids Res. 29:75-79.
Drysdale, R.A., Crosby, M.A., and FlyBase Consortium. 2005. FlyBase: genes and gene models. Nucleic Acids Res. 33:D390-D395.
Eeckman, F.H. and Durbin, R. 1995. ACeDB and Macace. Methods Cell Biol. 48:583-605.
El Ghouzzi, V., Dagoneau, N., Kinning, E., Thauvin-Robinet, C., Chemaitilly, W., Prost-Squarcioni, C., Al-Gazali, L.I., Verloes, A., Le Merrer, M., Munnich, A., Trembath, R.C., and Cormier-Daire, V. 2003. Mutations in a novel gene Dymeclin (FLJ20071) are responsible for Dyggve-Melchior-Clausen syndrome. Hum. Mol. Genet. 12:357-364.
Fire, A., Xu, S., Montgomery, M.K., Kostas, S.A., Driver, S.E. and Mello, C.C. 1998. Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans. Nature 391:806-811.
Ge, H., Walhout, A.J., and Vidal, M. 2003. Integrating 'omic' information: A bridge between genomics and systems biology. Trends Genet. 19:551-560.
GuhaThakurta, D., Schriefer, L.A., Waterston, R.H., and Stormo, G.D. 2004. Novel transcription regulatory elements in Caenorhabditis elegans muscle genes. Genome Res. 14:2457-2468.
Gunsalus, K.C., Ge, H., Schetter, A.J., Goldberg, D.S., Han, J.D., Hao, T., Berriz, G.F., Bertin, N., Huang, J., Chuang, L.S., Li, N., Mani, R., Hyman, A.A., Sonnichsen, B., Echeverri, C.J., Roth, F.P., Vidal, M., and Piano, F. 2005. Predictive models of molecular machines involved in Caenorhabditis elegans early embryogenesis. Nature 436:861-865.
Harris, M.A., Clark, J., Ireland, A., Lomax, J., Ashburner, M., Foulger, R., Eilbeck, K., Lewis, S., Marshall, B., Mungall, C., Richter, J., Rubin, G.M., Blake, J.A., Bult, C., Dolan, M., Drabkin, H., Eppig, J.T., Hill, D.P., Ni, L., Ringwald, M., Balakrishnan, R., Cherry, J.M., Christie, K.R., Costanzo, M.C., Dwight, S.S., Engel, S., Fisk, D.G., Hirschman, J.E., Hong, E.L., Nash, R.S., Sethuraman, A., Theesfeld, C.L., Botstein, D., Dolinski, K., Feierbach, B., Berardini, T., Mundodi, S., Rhee, S.Y., Apweiler, R., Barrell, D., Camon, E., Dimmer, E., Lee, V., Chisholm, R., Gaudet, P., Kibbe, W., Kishore, R., Schwarz, E.M., Sternberg, P., Gwinn, M., Hannick, L., Wortman, J., Berriman, M., Wood, V., de la Cruz, N., Tonellato, P., Jaiswal, P., Seigfried, T., and White, R. 2004. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 32: D258-D261.
Hubbard, T., Andrews, D., Caccamo, M., Cameron, G., Chen, Y., Clamp, M., Clarke, L., Coates, G., Cox, T., Cunningham, F., Curwen, V., Cutts, T., Down, T., Durbin, R., Fernandez-Suarez, X.M., Gilbert, J., Hammond, M., Herrero, J., Hotz, H., Howe, K., Iyer, V., Jekosch, K., Kahari, A., Kasprzyk, A., Keefe, D., Keenan, S., Kokocinsci, F., London, D., Longden, I., McVicker, G., Melsopp, C., Meidl, P., Potter, S., Proctor, G., Rae, M., Rios, D., Schuster, M., Searle, S., Severin, J., Slater, G., Smedley, D., Smith, J., Spooner, W., Stabenau, A., Stalker, J., Storey, R., Trevanion, S., Ureta-Vidal, A., Vogel, J., White, S., Woodwark, C., and Birney, E. 2005. Ensembl 2005. Nucleic Acids Res. 33:D447-D453.
Hughes-Davies, L., Huntsman, D., Ruas, M., Fuks, F., Bye, J., Chin, S.F., Milner, J., Brown, L.A., Hsu, F., Gilks, B., Nielsen, T., Schulzer, M., Chia, S., Ragaz, J., Cahn, A., Linger, L., Ozdag, H., Cattaneo, E., Jordanova, E.S., Schuuring, E., Yu, D.S., Venkitaraman, A., Ponder, B., Doherty, A., Aparicio, S., Bentley, D., Theillet, C., Ponting, C.P., Caldas, C., and Kouzarides, T. 2003. EMSY links the BRCA2 pathway to sporadic breast and ovarian cancer. Cell 115:523-535.
Jones, S.J., Riddle, D.L., Pouzyrev, A.T., Velculescu, V.E., Hillier, L., Eddy, S.R., Stricklin, S.L., Baillie, D.L., Waterston, R., and Marra, M.A. 2001. Changes in gene expression associated with developmental arrest and longevity in Caenorhabditis elegans. Genome Res. 11:1346-1352.
Jorgensen, E.M. and Mango, S.E. 2002. The art and design of genetic screens: Caenorhabditis elegans. Nat. Rev. Genet. 3:356-369.
Kamath, R.S., Martinez-Campos, M., Zipperlen, P., Fraser, A.G., and Ahringer, J. 2001. Effectiveness of specific RNA-mediated interference through ingested double-stranded RNA in Caenorhabditis elegans. Genome Biol. 2:RESEARCH0002.
Kamath, R.S., Fraser, A.G., Dong, Y., Poulin, G., Durbin, R., Gotta, M., Kanapin, A., Le Bot, N., Moreno, S., Sohrmann, M., Welchman, D.P., Zipperlen, P., and Ahringer, J. 2003. Systematic functional analysis of the Caenorhabditis elegans genome using RNAi. Nature 421:231-237.
Kasprzyk, A., Keefe, D., Smedley, D., London, D., Spooner, W., Melsopp, C., Hammond, M., Rocca-Serra, P., Cox, T., and Birney, E. 2004. EnsMart: a generic system for fast and flexible access to biological data. Genome Res. 14:160-169.
Kent, W.J. 2002. BLAT: The BLAST-like alignment tool. Genome Res. 12:656-664.
Kent, W.J., Sugnet, C.W., Furey, T.S., Roskin, K.M., Pringle, T.H., Zahler, A.M., and Haussler, D. 2002. The human genome browser at UCSC. Genome Res. 12:996-1006.
Kiontke, K., Gavin, N.P., Raynes, Y., Roehrig, C., Piano, F., and Fitch, D.H. 2004. Caenorhabditis phylogeny predicts convergence of hermaphroditism and extensive intron loss. Proc. Natl. Acad. Sci. U.S.A. 101¬†:9003-9008.
Korf, I., Yandell, M., and Bedell, J. 2003. BLAST. O'Reilly & Associates, Inc., Sebastopol, Calif.
Krause, M. 1995. Techniques for analyzing transcription and translation. Methods Cell Biol. 48:513-529.
Krogh, A., Larsson, B., von Heijne, G., and Sonnhammer, E.L. 2001. Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. J. Mol. Biol. 305:567-580.
Li, S., Armstrong, C.M., Bertin, N., Ge, H., Milstein, S., Boxem, M., Vidalain, P.-O., Han J.-D.J., Chesneau, A., Hao, T., Goldberg, D.S., Li, N., Martinez, M., Rual, J.-F., Lamesch, P., Xu, L., Tewari, M., Wong, S.L., Zhang, L.V., Berriz, G.F., Jacotot, L., Vaglio, P., Reboul, J., Hirozane-Kishikawa, T., Li, Q., Gabel, H.W., Elewa, A., Baumgartner, B., Rose, D.J., Yu, H., Bosak, S., Sequerra, R., Fraser, A., Mango, S.E., Saxton, W.M., Strome, S., van den Heuvel, S., Piano, F., Vandenhaute, J., Sardet, C., Gerstein, M., Doucette-Stamm, L., Gunsalus, K.C., Harper, J.W., Cusick, M.E., Roth, F.P., Hill, D.E., and Vidal, M. 2004. A map of the interactome network of the metazoan C. elegans. Science 303:540-543.
Lippincott-Schwartz, J. and Patterson, G.H. 2003. Development and use of fluorescent protein markers in living cells. Science 300:87-91.
Lupas, A. 1997. Predicting coiled-coil regions in proteins. Curr. Opin. Struct. Biol. 7:388-393.
Mello, C. and Fire, A. 1995. DNA transformation. Methods Cell Biol. 48:451-482.
Merke, D.P. and Bornstein, S.R. 2005. Congenital adrenal hyperplasia. Lancet 365:2125-2136.
Miller, D.M. and Shakes, D.C. 1995. Immunofluorescence microscopy. Methods Cell Biol. 48:365-394.
Mulder, N.J., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bradley, P., Bork, P., Bucher, P., Cerutti, L., Copley, R., Courcelle, E., Das, U., Durbin, R., Fleischmann, W., Gough, J., Haft, D., Harte, N., Hulo, N., Kahn, D., Kanapin, A., Krestyaninova, M., Lonsdale, D., Lopez, R., Letunic, I., Madera, M., Maslen, J., McDowall, J., Mitchell, A., Nikolskaya, A.N., Orchard, S., Pagni, M., Ponting, C.P., Quevillon, E., Selengut, J., Sigrist, C.J., Silventoinen, V., Studholme, D.J., Vaughan, R., and Wu, C.H. 2005. InterPro, progress and status in 2005. Nucleic Acids Res. 33:D201-D205.
Müller, H.M., Kenny, E.E., and Sternberg, P.W. 2004. Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol. 2:e309.
Nardone, J., Lee, D.U., Ansel, K.M., and Rao, A. 2004. Bioinformatics for the 'bench biologist': how to find regulatory regions in genomic DNA. Nat. Immunol. 5:768-774.
O'Connell, K.F., Leys, C.M., and White, J.G. 1998. A genetic screen for temperature-sensitive cell-division mutants of Caenorhabditis elegans. Genetics 149:1303-1321.
Pogue, D. 2005. Mac OS X: The Missing Manual, Tiger Edition. O'Reilly & Associates, Inc., Sebastopol, Calif.
Reese, G., Yarger, R.J., and King, T. 2002. Managing and Using MySQL, 2cd. ed. O'Reilly & Associates, Inc., Sebastopol, Calif.
Riddle, D.L., Blumenthal, T., Meyer, B.J., and Priess, J.R. (eds.). 1997. C. elegans II. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York.
Schuler, G.D. 1997. Sequence mapping by electronic PCR. Genome Res. 7:541-550.
Schwartz, R.L., Phoenix, T., and Foy, B.D. 2005. Learning Perl, 4th. ed. O'Reilly & Associates, Inc., Sebastopol, Calif.
Schwarz, E.M., Stein, L.D., and Sternberg, P.W. 2002. Caenorhabditis elegans databases. Curr. Genomics 3:111-119.
Sieburth, D., Ch'ng, Q., Dybbs, M., Tavazoie, M., Kennedy, S., Wang, D., Dupuy, D., Rual, J.F., Hill, D.E., Vidal, M., Ruvkun, G., and Kaplan, J.M. 2005. Systematic analysis of genes required for synapse structure and function. Nature 436:510-517.
Simmer, F., Moorman, C., Van Der Linden, A.M., Kuijk, E., Van Den Berghe, P.V., Kamath, R., Fraser, A.G., Ahringer, J., and Plasterk, R.H. 2003. Genome-wide RNAi of C. elegans using the hypersensitive rrf-3 strain reveals novel gene functions. PLoS Biol. 1:E12.
Simpson, P.T., Reis-Filho, J.S., Gale, T., and Lakhani, S.R. 2005. Molecular evolution of breast cancer. J. Pathol. 205:248-254.
Stajich, J.E., Block, D., Boulez, K., Brenner, S.E., Chervitz, S.A., Dagdigian, C., Fuellen, G., Gilbert, J.G., Korf, I., Lapp, H., Lehvaslaiho, H., Matsalla, C., Mungall, C.J., Osborne, B.I., Pocock, M.R., Schattner, P., Senger, M., Stein, L.D., Stupka, E., Wilkinson, M.D., and Birney, E. 2002. The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 12:1611-1618.
Stein, L.D. and Thierry-Mieg, J. 1998. Scriptable access to the Caenorhabditis elegans genome sequence and other ACEDB databases. Genome Res. 8:1308-1315.
Stein, L.D., Mungall, C., Shu, S., Caudy, M., Mangone, M., Day, A., Nickerson, E., Stajich, J.E., Harris, T.W., Arva, A., and Lewis, S. 2002. The Generic Genome Browser: A building block for a model organism system database. Genome Res. 12:1599-1610.
Stein, L.D., Bao, Z., Blasiar, D., Blumenthal, T., Brent, M.R., Chen, N., Chinwalla, A., Clarke, L., Clee, C., Coghlan, A., Coulson, A., D'Eustachio, P., Fitch, D.H., Fulton, L.A., Fulton, R.E., Griffiths-Jones, S., Harris, T.W., Hillier, L.W., Kamath, R., Kuwabara, P.E., Mardis, E.R., Marra, M.A., Miner, T.L., Minx, P., Mullikin, J.C., Plumb, R.W., Rogers, J., Schein, J.E., Sohrmann, M., Spieth, J., Stajich, J.E., Wei, C., Willey, D., Wilson, R.K., Durbin, R., and Waterston, R.H. 2003. The genome sequence of Caenorhabditis briggsae: A platform for comparative genomics. PLoS Biol. 1:E45.
Stone, M., Ockman, S., and DiBona, C. (eds.). 1999. Open Sources: Voices From the Open Source Revolution. O'Reilly & Associates, Inc., Sebastopol, Calif.
Sulston, J.E., Schierenberg, E., White, J.G., and Thomson, J.N. 1983. The embryonic cell lineage of the nematode Caenorhabditis elegans. Dev. Biol. 100:64-119.
Swan, K.A., Curtis, D.E., McKusick, K.B., Voinov, A.V., Mapa, F.A., and Cancilla, M.R. 2002. High-throughput gene mapping in Caenorhabditis elegans. Genome Res. 12:1100-1105.
Tabara, H., Motohashi, T., and Kohara, Y. 1996. A multi-well version of in situ hybridization on whole mount embryos of Caenorhabditis elegans. Nucleic Acids Res. 24:2119-2124.
Tisdall, J.D. 2003. Mastering Perl for Bioinformatics. O'Reilly & Associates, Inc., Sebastopol, Calif.
Welsh, M., Dalheimer, M.K., Dawson, T., and Kaufman, L. 2002. Running Linux, 4th ed. O'Reilly & Associates, Inc., Sebastopol, Calif.
White, J.G., Southgate, E., Thomson, J.N., and Brenner, S. 1986. The structure of the nervous system of Caenorhabditis elegans. Philos. Trans. R. Soc. Lond. B Biol. Sci. 314:1-340.
Wicks, S.R., Yeh, R.T., Gish, W.R., Waterston, R.H., and Plasterk, R.H. 2001. Rapid gene mapping in Caenorhabditis elegans using a high density polymorphism map. Nat Genet. 28:160-164.
Wood, W.B. (ed.). 1988. The nematode Caenorhabditis elegans. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York.
Wootton, J.C. 1994. Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput. Chem. 18:269-285.
Brenner, 1974. See above.
The beginning of modern C. elegans research, this article remains vividly readable and gives a clear introduction to the goals and tactics of worm genetics.
C. elegans Sequencing Consortium, 1998. See above.
This summarizes the first findings from the near-completion (~98%) of the C. elegans genome, and gives useful background information about how the genomic sequence was acquired and organized. (Final closure of the last 2% of gaps in the genome sequence was achieved in November 2002, four years later.)
Harris et al., 2004. See above.
Gene Ontology has become a recognized common vocabulary for functionally annotating gene products in both WormBase and many other genomic databases.
Mulder et al., 2005. See above.
Like Gene Ontology terms, InterPro motifs have become a common vocabulary for genome annotation in WormBase and elsewhere.
Stein et al., 2002. See above.
The genome browser described here is extensively used in WormBase.
Sulston et al., 1983. See above.
The lineage browser in WormBase is an attempt to provide a Web interface for small slices of the entire set of findings given here.
White et al., 1986. See above.
The neuronal connections and anatomy stored in WormBase are largely derived from this work.
The main Web site for WormBase. It is meant to be slightly less bleeding-edge than the development site, but to be maximally stable.
The development Web site for WormBase. This site runs the latest release of the WormBase data, allowing any bugs in the data or their presentation to be caught before being put on the main site. If one wants the absolutely latest information on C. elegans, this is therefore the site to use (the main site lags by two weeks, the interval between successive data releases). New additions to the site software are tested here first as well.
WormBook is an on-line anthology of over 70 articles reviewing a great deal of current knowledge about C. elegans biology in 2005. All articles are provided in both HTML and PDF format, and are freely downloadable. Along with its two predecessors (Wood, 1988; Riddle et al., 1997), WormBook is strongly recommended for anybody starting work on this organism or studying it.
The Web site for the archival WS150 release of WormBase's data. The advantage of this site is that the data do not change, and thus can be used for reliable cross-comparison of bioinformatical analyses by different research groups at different times. Similar sites exist for several previous tenth releases (e.g., WS140: http://ws140.wormbase.org) and future archival sites are planned for roughly once every 7 months.
The User's Guide for WormBase.
Web Resources for C. elegans - this link is not part of the original Wiley chapter
WormBook's WormMethod chapter by Raymond Lee containing core web resources for C. elegans research
This FTP site contains archives of the wormpep and wormrna files, and the core software for running WormBase as a local installation.
The California Institute of Technology mirror site for WormBase maintained by Erich Schwarz (firstname.lastname@example.org).
The Greek mirror site for WormBase, maintained by Nektarios Tavernarakis (email@example.com).
This FTP site contains the complete releases of WormBase's data (typically the most recent two, along with permanently archived releases such as WS100 through WS150).
The public atlas of C. elegans anatomy, with several key references for worm anatomy on line, including White et al. (1986).
A useful site with links to worm literature and genetic strain information, maintained by Leon Avery (firstname.lastname@example.org).
Links to several other C. elegans databases are collected here.
This site gives extensive documentation for the Gene Ontology system, increasingly used for functional annotation in WormBase.
Erich M. Schwarz California Institute of Technology Pasadena, California
Paul W. Sternberg Howard Hughes Medical Institute California Institute of Technology Pasadena, California
(http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png) The front Web page of WormBase, showing a general database search forzyg-1 and the Web Site Directory. This page gives several different entry points for WormBase‚Äôs diverse data. An example is shown of the simplest and broadest search (for ‚ÄúAnything‚Äù) with a single keyword. A menu of the most-used database searches lines the top of the page, while a list of more specialized data fills the Web Site Directory on the page‚Äôs left side.
(http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_02.png) Results of the database search in (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png). Having searched the entire database for anything matching zyg-1, one sees a plethora of disparate results: genes with zyg-1* names, protein-coding sequences (CDSes), expression patterns, and archived research papers. The advantage of this sort of search is that it lowers the chance that one will miss a wanted item, but it necessarily requires that one then pick and choose among this sort of data slurry. Alternatively, one could pick a specific data class in the search menu (such as ‚ÄúAny Gene‚Äù or ‚ÄúCell‚Äù; (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png) and get narrower, but better-focused results.
(http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_03.png) The top of the Gene page for zyg-1. WormBase organizes its data around a few key hubs. Gene pages are perhaps the most important single such hub; they are intended to give a compact but full summary of everything known about a given gene in C. elegans. Even in this excerpt, one can get summarized gene function and orthology, a list of transcripts and their experimental evidence, and links to DNA and protein sequences, a C. briggsae ortholog, and external database records.
(http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_04.png) Genetic and genomic information from the Gene page for zyg-1. Further down the same page as in (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_03.png) is a small but detailed diagram of the gene‚Äôs DNA structure, with links to transcripts, sequenced clones, and alleles. Along with this are given exact nucleotide coordinates and the meiotic gene map position.
(http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_05.png) The Sequence page of F59E12.2 (linked to zyg-1‚Äôs Gene page). Most data for the exact nucleotide sequence is too detailed to be of immediate interest on a Gene page, so it is given its own Sequence page instead (linked to the Gene and Protein pages). These data are most useful in designing cloning experiments or direct perturbations of DNA function such as RNAi. Further down this page are another schematic diagram, a BLAST search launcher, exact coordinates of exons and introns in the genomic sequences, and a list of available cDNA clones.
(http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_06.png) Part of the CE28571 (zyg-1) Protein page, with a schematic diagram of CE28571's exons, protein motifs, low-complexity domains (detected by SEG), and similarities to proteins in other eukaryotic species. As with nucleotide sequences, proteins have enough detailed information to require their own specialized pages. WormBase‚Äôs Protein pages give both text and diagrams to let a user map individual sequence features with respect to one another and to the protein‚Äôs exonic coding sequences. The sequence features shown range from very generic (signal, low-complexity, and predicted transmembrane) to broadly distributed but specific motifs (e.g., ‚Äútyrosine protein kinase‚Äù) and then to individual BLAST matches with highly similar proteins in other organisms. Diagramming all of these allows the user to quickly see what parts of the protein are likely to have distinct functions.
(http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_07.png) Protein motifs identified by a "ribonucleoprotein" search term. WormBase has a extensive catalog of protein motifs, taken from both the PFAM and the InterPro compilations. Keyword searches of these motifs are one way to subdivide a general protein type into several types with detailed functional differences.
(http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_08.png) Proteins identified as sharing a single motif. Motifs are evolutionarily mobile; they can be spread among homologous proteins or transferred horizontally between nonhomologous ones. Accordingly, each motif in WormBase is listed with the full set of proteins encoding it. This gives one way of identifying every gene product in C. elegans likely to participate in a shared biochemical function.
(http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_09.png) A view of the entire X chromosome in the Genome Browser. Like the Gene Page, the Genome Browser provides a central hub around which complex data can be economically organized. Here we see its view expanded to an entire chromosome. The view is customizable with many different user-selected tracks (a few of which are visible at the figure‚Äôs bottom).
(http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_10.png) An expanded Genome Brower view of the F59E12.2 (zyg-1) sequence, with added tracks for ESTs, mRNAs, and C. briggsae homologies. Where the Gene Page gives a text-oriented, human-readable summary of zyg-1, the Genome Browser here gives a view rooted in its DNA structure. Picking just a few tracks allows this view to link gene coexpression (through operons), likely regulatory sequences (i.e., non-coding DNA highly conserved in C. briggsae), direct evidence for gene activity (ESTs and a cDNA), a genomic clone (archived in Genbank), and complexities of the gene‚Äôs structure (including a nested gene with an entirely dissimilar mutant phenotype).
(http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_11.png) A view of 1 Mb of genomic DNA, centered on the F59E12.2 (zyg-1) sequence. Genome Browser views are customizable not only in their contents but in their size. Here we see a tracked view spanning one megabase of genomic DNA. As the view grows, fine details are merged into an general map; this works best when one is looking for features that vary over a scale of tens or hundreds of thousands of nucleotides.
(http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_12.png) A view of 100 base pairs of genomic DNA immediately to the 5' side of F59E12.2 (zyg-1). The opposite extreme of size selection is this 100-nt view of zyg-1‚Äôs 5‚Äô-flank. This view lists individual nucleotides, and is ideal for fine resolution of transgenic construct or cis-regulatory sites. As in larger views, multiple tracks can be chosen to make easy comparisons of diverse features (such as cDNAs versus predicted start sites).
(http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_13.png) The Genome Browser showing the C. briggsae ortholog of zyg-1.'' C. briggsae‚Äôs genome is also available through the Genome Browser. This view of zyg-1 confirms that its complex structure is indeed conserved in C. briggsae, while also showing small differences in intron size.
(http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_14png) The Synteny Viewer showing the zyg-1/bli-2 cluster in C. elegans and C. briggsae. Here the zyg-1 loci from two Caenorhabditis species are shown in syntenic alignment, making their precise similarities and differences obvious. Like the Genome Browser, this view can be expanded to take in large chromosomal spans or contracted to single DNA sites. A particularly good use of this viewer is in working out the clearest possible view of an evolutionarily complex syntenic region.
(http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_15.png) A BLASTP search of WormPep release 147 with the human dymeclin (DYM) protein, which when mutated leads to Dyggve-Melchior-Clausen or Smith-McCort dysplasia. BLAST searches in WormBase not only give hit results, but also give hyperlinks to their database records, making it easy to go from a positive search result to its Gene Page or to a view of its genomic region. Both strong and weak hits can be informative, since they can identify both orthologs and paralogs of a query sequence. Searches have a default cut-off E-value of 0.01, but this can be adjusted by the user for more or less stringency (and hits).
(http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_16.png) The Filter menu of WormMart, with filters set to select for pqn-* genes in C. elegans with uncoordinated RNAi phenotypes. WormMart gives the user a menu with which one or more of a great many different conditions can be imposed on data. Each condition is itself simple, but the freedom of users to choose and mix them with a graphical interface makes highly complex searches practical. This particular search started by choosing the WS140 data release (shown in the Summary on the right-hand side) and its Gene data set. This still leaves the user with over 40,000 objects to sort through. In this simple search, the user has selected only those genes falling into the pqn class, which includes ~100 genes encoding prion-like proteins with domains highly enriched for glutamine (Q) or asparagine (N).
(http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_17.png) The Output menu for Sequence attributes, showing several different choices of gene substructure. After filtering, data in WormMart need to be exported somehow. Again, many different choices of output contents and format exist. One particularly useful form of this are sequence outputs in which the user picks some type of gene structure (such as 5‚Äô flanks, introns, or exons) for mass export from a selected gene set (selected by choices like those shown in (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_16.png). As a given option for sequence export is picked, a small schematic diagram of the gene is marked in red to clarify what the option means in practice. Since the sequences are exported in FASTA format, the headers for these FASTA records can themselves be loaded with user-selected data (such as gene names).
(http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_18.png) Final results of the search in (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_16.png). Another option for user-selected output is to have tables listing gene features rather than nucleotide sequences. This output was generated from the pqn-* search shown in (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_16.png) by selecting (in addition to the pqn gene class) for molecular and classical gene names, RNAi phenotypes, and conserved orthologous protein groups (KOGs). As with the Genome Browser, a strength of these user-selected outputs is the ability to quickly compare disparate data sets in an easily scanned, well-aligned format.
(http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_19.png) The graphical output from a search for genetic markers in the vicinity of hid-3. Classical genetics in C. elegans remains crucial for finding new biological functions. Here the user has a gene map for the region around the uncloned hid-3 gene which integrates cloned genes, uncloned loci, predicted genes, and STS markers. Such a view makes it straightforward to design fine-scale STS mapping and to identify other loci that might be allelic to hid-3.
(http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_20.png) Part of the tabular output from a search for hid-3 markers. Graphic and tabular gene maps have complementary uses. The graphical map in (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_19.png) lets the user take in a genetic region intuitively at a glance; this table lists the exact identity and details of its contents. Details include the meiotic map position, alleles, and laboratory strains for each gene in a region.
(http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_21.png) Results of a Gene Ontology (GO) search for "RNA splicing". GO allows genes to be classified by their shared biochemical or biological roles whether or not their products have any similarity to each other. While this classification is powerful, it can be difficult to decipher because there are a great many GO terms, most with complex meanings. To help make sense of this complexity, searches in WormBase for GO terms give tables listing not only the names of terms, but also their definitions and the genes associated with them. Searching with a simple phrase such as ‚ÄúRNA splicing‚Äù can give many different results with highly detailed meanings.
(http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_22.png) Summary of the "RNA splicing" GO term in WormBase, with its connections to genes and protein motifs. Each GO term has its own summary page, accessible either through a term search (as in (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_21.png) or through gene or protein motif pages. The broadly-defined ‚ÄúRNA splicing‚Äù term is seen here to encompass two different genes and three different protein motifs. One link on this page leads to a browsable version of the entire Gene Ontology system.
(http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_23.png) A detailed view of the "RNA splicing, via transesterification reactions with bulged adenosine as nucleophile" GO term in WormBase: defined, shown in its context of other GO terms, and connected to genes and protein motifs. Another view of a GO term, this time with a browsable context. As in (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_22.png), links are given for associated genes and protein motifs, but here one can also see how this rather specialized term fits into the overall Gene Ontology. Note that this term is not actually the most narrow one possible, but is itself a parent term for three even more specialized terms.
(http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_24.png) An expanded view of neuronal lineages. WormBase gives a graphically browsable diagram of C. elegans‚Äô entire developmental lineage, from the fertilized egg to the adult body. Here is shown a small subset of that lineage, starting from the progenitor cell P1. Each node can be either collapsed or expanded by clicking on it, to give simplified or elaborated views; here all the nodes have been expanded. Each cell type is given a hypertext link to its own Cell page.
(http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_25.png) The Cell page for AS1, a neuron in the P1 lineage seen in (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_24.png). Clicking on the AS1 link in P1‚Äôs lineage leads to this report, summarizing developmental and functional traits for this cell type. A single cell can belong to more than one group, defined either by cell class or by organ or tissue. Cells can be major progenitors of a lineage branch (blast cells), intermediates during development, or terminally differentiated. They can also have many different gene expression patterns associated with them, either generically (e.g., a gene expressed in ‚Äúneurons‚Äù will implicitly be expressed in AS1) or more or less specifically (some genes may be expressed in only AS1, while others may be expressed in some well-defined set of cells including AS1). Although WormBase has tended so far to emphasize a gene-centric view of the organism, Cell pages are likely to become increasingly detailed hubs of information rivalling the Gene pages and Genome Browser as WormBase‚Äôs contents extend to integrative, physiological data.
(http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_26.png) Diagram of the ADAL neuron. Another Cell page, for a sensory neuron of the head. Pages for sensory and some other neurons include small diagrams of their structure in the body, with the pharynx given as a background for orientation. More specialized and fully detailed anatomical views are available in WormAtlas http://www.wormatlas.org. Each neuron page also gives a detailed list of neuron-by-neuron connections, determined from electron-microscopic serial sectioning of an entire worm‚Äôs nervous system.
(http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_27.png) Summary for a chosen set of neurons. Neurons in C. elegans have somewhat cryptic names (e.g., ‚ÄúAFD‚Äù for ‚Äúamphid finger neuron‚Äù). The tabular output from a Neuron search decodes these names by listing their human-readable identities, their membership in neuronal groups (by shared ganglia or shared traits), and their developmental lineage abbreviations (totally different from their differentiated neuron abbreviations). Numbers of their synaptic connections and gap junctions are also given, whose identities are detailed on a neuron‚Äôs Cell page ((http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_26.png)).
(http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_28.png) Results for an Expression Pattern search with "AS neurons". Gene expression patterns can be searched with terms for pre-defined cell groups, extracted from the primary literature. The resulting table gives, for each pattern found, the gene driving it and a summary of the cells it includes. Since these expression patterns are usually driven by a gene‚Äôs entire promoter, and since metazoan promoters can have complex, multiple cis-regulatory elements, the patterns can be heterogeneous and extensive. However, some genes are solely expressed to a single cell type in the whole animal, while yet others appear to have truly ubiquitous expression in all somatic cells. Hypertext links from these search results can lead to a gene, its DNA sequence, or a detailed report about a single expression pattern (e.g., ‚ÄúExpr217‚Äù). The expression pattern report, in turn, will list the exact reagents used to determine the pattern (such as a defined DNA region, or an antibody).
(http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_29.png) A diagram of overlaps between zyg-1‚Äôs canonical (sequenced) genomic cosmid clone and other clones that are not sequenced but may have more useful termini for experimental use. There are many more clones produced by the C. elegans genome project than have actually been sequenced. The default view in the Genome Browser gives only those cosmid or YAC clones that were actually used in genomic sequencing; however, for actual experiments on individual genes, a ‚Äúnon-canonical‚Äù cosmid‚Äôs insert may better encompass the gene‚Äôs full 5‚Äô and 3‚Äô flanks. The clone map search allows users to see the entire set of clones available for a gene region.