Difference between revisions of "Searching WormBase for Information About C. elegans (Wiley)"

From WormBaseWiki
Jump to navigationJump to search
Line 319: Line 319:
  
 
==[[BASIC_PROTOCOL_10:_MINING_GENE_DATA_WITH_WORMMART|BASIC PROTOCOL 10: MINING GENE DATA WITH WORMMART]]==
 
==[[BASIC_PROTOCOL_10:_MINING_GENE_DATA_WITH_WORMMART|BASIC PROTOCOL 10: MINING GENE DATA WITH WORMMART]]==
 +
Web pages for a single Gene, Sequence, or Protein only allow relatively small amounts of data to be analyzed at a time; yet the rise of functional genomics has made it useful and necessary for biologists to handle large, complex information about tens to thousands of genes in their everyday work. To support this, WormBase provides tools for wrangling large data sets. The most recently designed and generally usable of these is WormMart, based on the BioMart search engine used by Ensembl (Kasprzyk et al., 2004). WormMart is a Web interface that allows users to design and run complex database searches without having to use (or even know) complex database query languages.
  
 +
'''''Necessary Resources'''''
  
 +
''Hardware''
 +
 +
    A standard computer with a reasonably fast connection to the Internet (cable modem, DSL, or Ethernet recommended)
 +
 +
''Software''
 +
 +
    Web browser such as Internet Explorer (http://www.microsoft.com) or Mozilla (http://www.mozilla.org/firefox)
 +
 +
1. Go to the WormMart page (by clicking its link on the top left center of the front page; (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png). Once on the first WormMart page, choose a version of WormBase (either the most recent release, or a permanently archived release such as WS140). Then choose a Dataset to search, such as genes, expression patterns, research papers, phenotypes, or "variations" (i.e., mutations or RNAi). Click the "next" button, and go to the Filter page.
 +
 +
2. Decide on how to filter the data set chosen.
 +
 +
http://www.wormbase.org/wiki/images/Fig_1_8_16.png
 +
 +
For instance, the Genes data set can be filtered by species, gene names or classes, the reliability of the gene's structual prediction, whether the gene is protein-coding or not, genomic location and DNA strand orientation, inclusion or exclusion of 5' and 3' flanking sequences, and RNAi phenotype. Other data sets can be filtered in analogous ways. Moreoever, WormMart allows more than one data set to be filtered in a single search. Although the Filter page gives many choices, it is easy to undo one's choices and explore a different query, by clicking the "back" button. After the data filters are satisfactory, click the "next" button and go to the Output page.
 +
 +
3. On the Output page, decide how to best select and present the filtered data. First, choose an "Attribute". For genes, these attributes can be "Features", "Structures", or "Sequences". Features include short identifiers, genomic locations, functional annotations, available reagents, mutant alleles, or references; Structures include exons, introns, and 5' or 3' untranslated regions; and Sequences are pure nucleotide sequences selected from structural elements or from DNA-based reagents such as cDNAs.
 +
 +
http://www.wormbase.org/wiki/images/Fig_1_8_17.png
 +
 +
After selecting the details for such an output, choose an output format such as HTML, plain text, Excel table, or compressed file, and click "export".
 +
 +
4. Examine the results.
 +
 +
http://www.wormbase.org/wiki/images/Fig_1_8_18.png
 +
 +
If they seem promising but not as well-chosen as necessary, go back one or more steps and try different options, and see how the results come out with somewhat different filter and output choices. The best way to get an intuitive sense for this interface is probably to tinker with it at first. With practice, however, it can be a very powerful and flexible tool.
  
 
==[[BASIC_PROTOCOL_11:_DOWNLOADING_A_BATCH_OF_SEQUENCES|BASIC PROTOCOL 11: DOWNLOADING A BATCH OF SEQUENCES]]==
 
==[[BASIC_PROTOCOL_11:_DOWNLOADING_A_BATCH_OF_SEQUENCES|BASIC PROTOCOL 11: DOWNLOADING A BATCH OF SEQUENCES]]==

Revision as of 22:48, 29 November 2010


INTRODUCTION

WormBase is the major public biological database for the nematode Caenorhabditis elegans (Chen et al., 2005; PMID: 15608221). It is meant to be useful to any biologist who wants to use C. elegans, whatever his or her specialty. WormBase contains information about the genomic sequence of C. elegans, its genes and their products, and its higher-level traits such as gene expression patterns and neuronal connectivity. Also, WormBase contains sequence and gene structures of C. briggsae and C. remanei, two closely related worms. These data are interconnected, so that a search beginning with one object (such as a gene) can be directed to related objects of a different type (such as the DNA sequence of the gene, or the cells in which the gene is active). One can also do searches for complex data sets.

WormBase is constantly being changed and expanded, both by curation of newly available data and by modifications to the user interface. The entire database is updated and rebuilt into a new release every 3 weeks throughout the year, in releases named "WSnnn" (with "nnn" being 147 as of this writing). To give bioinformaticians reproducible data sets for their analyses, each tenth release is made permanently available as a frozen online archive, roughly once every seven months. Releases WS100 through WS150 have been archived so far. All of the information in this chapter is based on the version of WormBase available in September, 2005 (WS147 release; BioPerl/Generic Model Organism Database software).

The protocols described in this chapter include the following: general searches of WormBase with single search terms; studying a gene, sequence, or protein with its individual web page, or with the Genome Browser; searching for proteins by BLAST hits, sequence motifs, or Gene Ontology terms; aligning C. elegans with C. briggsae genomic sequences; detailed, user-customized searches with WormMart or AceDB Query Language; batch downloads of many sequences at once; identifying the genomic regions, genetic contents, or molecular clones spanning a defined chromosomal interval; electronic PCR; finding expression patterns, and the cell types or developmental origins from which these patterns arise; and searching for genome-wide RNAi results yielding particular phenotypes. Some advice is also provided for installing and running WormBase on a local computer.


BASIC PROTOCOL 1: NAVIGATING THE WormBase HOME PAGE

Necessary Resources

Hardware

   A standard computer with a reasonably fast connection to the Internet (cable modem, DSL, or Ethernet recommended)

Software

   Web browser such as Internet Explorer (http://www.microsoft.com) or Firefox (http://www.mozilla.org/firefox)

The home page for the main WormBase site, http://wormbase.org allows general searches, gives links to specialized searches, and has news about improvements to WormBase.


Fig_1_8_01.png


This page is divided into several parts. At the top is a large, simple menu bar giving quick access to six popular search pages, a Web form for submitting comments or new data, an all-purpose Searches page, and a Site Map. This bar appears at the top of all of the WormBase pages and also includes a link back to the home page (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png). On the home page immediately below this menu bar are the serial number of the data release currently in use, and the official WormBase logo.

The next section of the main page has fields and menus that are used for a basic search of the full database (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png). A user can choose searches from any of over 20 different data types from the "Search for" drop-down menu (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png). The check boxes immediately beneath the search fields offer choices to require strict identity for search terms, to give results in XML format, or to search primary research articles in depth with Textpresso (http://www.textpresso.org; Muller et al., 2004).

Below the basic search section are a directory for the entire WormBase site (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png), and news about C. elegans bioinformatics. Near the bottom of the page is a Links section (with connections to several other sites about C. elegans which complement WormBase in some way), and a list of other sites running WormBase itself (the development server, a data-mining server, official mirror sites, and archival data freeze sites for http://wormbase.org). Finally, the bottom of the page gives a link for users to make comments or ask questions to the WormBase staff, and other links to WormBase policies on copyright and privacy; these links are given on every WormBase page.

WormBase has a development site (http://dev.wormbase.org) which uses the very latest data release and site software, while the main site lags by one release; while closely similar, the main site focuses on stability and the development site on novelty.

The examples of searches given in the following protocols are intended to be illustrative, but not exhaustive; many other searches are possible.


BASIC PROTOCOL 2: PERFORMING A DATABASE SEARCH

This protocol presents how to conduct a general search of WormBase

Necessary Resources

Hardware

   A standard computer with a reasonably fast connection to the Internet (cable modem, DSL, or Ethernet recommended)

Software

   Web browser such as Internet Explorer (http://www.microsoft.com) or Firefox (http://www.mozilla.org/firefox)

1. Go to the main page, either by entering its URL http://wormbase.org, or by clicking on the Home link at the top left of any WormBase Web page (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png).

2. Type a word or phrase into the search form (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png). For a search of the entire database, click the Search link after selecting Anything in the "Search for" pull-down menu.

3. Once the search has run, examine the results (if any). If a single data record is found, the search routine will sometimes go directly to the Web page for that record (it will automatically do this for Gene pages). However, it will usually instead have 0 or ‚â•2 results. In the latter case, the search will give one or more summary pages on which the data records found are listed.

Fig_1_8_02.png

4. Look at the page of summarized results, and consider whether there are too few or too many. If there are too few results, try different key words for the search; computers are literal-minded, and a particular search word may be almost but not quite recognized by the database. Conversely, if there are too many results, select a restricted subset of the database for the search from the "Search for" pull-down menu (e.g., "Any Gene"; (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png)), or check the "Exact match" box and resubmit the search by clicking the Search button.

5. If the search results look reasonable, then follow the hypertext links to the data records themselves.

6. To get a list of many data records all falling into a specific class, pick a subset of the database and search with a wildcard. For instance, if the "Any Gene" search is selected, and run with unc-* as the search term, WormBase will return links to 117 Locus records (snt-1 through vab-8, with 114 unc-* hits). When possible, WormBase recognizes and returns synonyms in searches (which is why a search for unc-* genes returns three non-unc-* gene names; these three genes have aliases of unc-107, unc-110, and unc-121).

7. Remember that search words can be anything of interest. While there is no guarantee that the database will give hits for any given search string, specific topics of interest can have useful results.

For instance, consider a search for' '''hyperplasia ''under all categories (i.e., by choosing Anything from the pull-down menu in (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png). Hyperplasia is a topic normally associated with cancer biology or endocrinology (Merke and Bornstein 2005; Simpson et al., 2005), but one might want to see the potential for C. elegans to serve as a model for the control of tissue proliferation. A search with this term reveals nine hits, including six genetic loci that either have hyperplastic phenotypes or are homologs of human disease genes with roles in deregulated proliferation.


BASIC PROTOCOL 3: EXAMINING A GENE IN C. ELEGANS

Necessary Resources

Hardware

   A standard computer with a reasonably fast connection to the Internet (cable modem, DSL, or Ethernet recommended)

Software


Web browser such as Internet Explorer (http://www.microsoft.com) or Firefox (http://www.mozilla.org/firefox)

From the front WormBase page, enter the name of a gene of interest and click Search (leaving the Find menu set to its default, "Any Gene"). A successful search will lead directly to a Gene page for the gene requested.

Fig_1_8_03.png

A Gene page is intended summarize everything of biological importance known about a given gene; it can thus be quite complex, although it is also meant to be concise ((http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_03.png) and

Fig_1_8_04.png

The page can describe any or all of the following information, when available: identity of the gene's product; normal function of the gene's product; orthologs of the gene (if any); the gene's meiotic and physical location within the genome; its phenotype in classical mutations or RNAi screens; its spatial and temporal expression pattern, with microarray data; domains of its encoded protein; Gene Ontology (UNIT 7.2) terms describing its function; mutant alleles and strains carrying them; homologs identified by best BLASTP (UNIT 3.4) scores versus other proteomes; cDNA, STS, and antibody reagents; microarray probes; SAGE oligonucleotides; and references from the primary C. elegans research literature. All of these are given with hypertext links allowing the user to find more information on a given datum of interest.

Most of this information is given as tabulated text. However, for cloned genes, there will also be a schematic diagram of the gene's physical organization. For instance, the diagram on the zyg-1 Gene Page reveals that zyg-1 actually has a functional gene, bli-2, nested within itself (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_04.png). This diagram provides a graphical link to the Genome Browser (see Basic Protocol 7).

C. elegans is currently known to have ~20,100 protein-coding genes, as well as ~900 genes producing non-protein-coding RNA transcripts. A gene in C. elegans can be studied in several different ways. The original method was genetics, in which a gene equated a classical locus at which two or more different alleles had been mapped (Brenner, 1974). Despite all advances in functional genomics, classical genetics remains uniquely powerful in uncovering new aspects of worm biology (Jorgensen and Mango, 2002).

There are, accordingly, 44,606 Gene objects in the WS147 release of WormBase, marking either current or obsolete gene predictions for C. elegans. In the database, they are given uninformative serial numbers such as "WBGene00021622". However, in real use these genes either have names like "xyz-N" (a three-letter lower-case abbreviation, where N represents an Arabic numeral) or others with names like "cosmid.number" (where "cosmid" is the name of a genomic clone, usually a cosmid, sequenced in the C. elegans genome project (C. elegans Sequencing Consortium, 1998), and "number" denotes an otherwise anonymous gene embedded in the clone).

There are fewer classical (three-letter) gene names for C. elegans genes (~6,100) than names for genes identified through genomic sequencing (~21,000). An increasing number of three-letter names are given to genes identified purely through genomic sequencing and analysis. Conversely, a significant number of genes identified through classical mutagenesis have not yet been linked to the genome through cloning. So, the rule of thumb is that, while Gene objects in WormBase will often have both a classical and a sequence name (as zyg-1 does), this is not inevitable; many genes in C. elegans either lack a classical name or have a classical (mutant) name without a known sequence name. At the same time, functional genomics is being applied to all genes in C. elegans through chromosome-wide RNAi screens, microarray analyses, and protein interaction mapping (Ge et al., 2003; Gunsalus et al., 2005). WormBase therefore annotates a good deal of information about "anonymous" genes with only nondescript sequence names.

If a set of Gene records is wanted, do the search for "Any Gene" with a wildcard entry (such as unc-*), or do a basic search (for "Anything"). With a well chosen query, this will produce a summary page having one or more entries in the database with the general format Gene:xyz-N. Examine the entries and click on whichever is most useful.

Each Gene page for a genomic protein-coding sequence (CDS), even if it lacks mutant alleles mapped through classical genetics, has an interpolated genetic map position for that coding sequence, with a link to a list of nearby genes. This map position is formatted and handled in the same way as the genetic map position for a classical genetic locus (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_04.png). In the case of a gene like zyg-1, which already has mutant alleles, the Genetic Position link (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_04.png) will mainly help design tests for allelism with other mutant loci. However, for a gene identified solely as a CDS, interpolated map information can indicate which classical mutant alleles might reside within the gene.


BASIC PROTOCOL 4: EXAMINING A MOLECULAR SEQUENCE IN C. ELEGANS

WormBase has Sequence Pages as well as Gene Pages. Sequence Pages are focused on details of the nucleotide and predicted protein-coding sequences of genes.

Necessary Resources

Hardware

   A standard computer with a reasonably fast connection to the Internet (cable modem, DSL, or Ethernet recommended)

Software

   Web browser such as Internet Explorer (http://www.microsoft.com) or Firefox (http://www.mozilla.org/firefox)

1. From any page where a molecular sequence is given as relevant to some other object (such as zyg-1, which has the synonym F59E12.2; (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_03.png), click on the relevant Sequence Report or Sequence name link. This will lead to a Sequence page.

Fig_1_8_05.png

Like a Gene page, a Sequence will have diverse information, with a somewhat greater emphasis on genomic sequence rather than classical biology.

The F59E12.2 Sequence page in WS147 gives a schematic diagram of F59E12.2's organization in the genome somewhat different from that on the Gene page, a BLAST (UNIT 3.3) search launcher, external links of the protein sequence to GenPept and SwissProt, and precise numerical coordinates for the F59E12.2 transcription unit along with the identities of cDNAs matching it. The BLAST search is automatically set up against WormBase's genomic sequence data by clicking on the "Run BLAST or BLAT" button (see Basic Protocol 9). This Sequence page is also be a convenient place to get the sequence in FASTA format (Apweiler, 2005; Appendix 1B) for BLAST searches outside WormBase. Given the F59E12.2 Sequence page, one can cut and paste it in FASTA format directly into one of the search pages at the NCBI (http://www.ncbi.nlm.nih.gov/BLAST; UNITS 3.3 & 3.4) or into a more specialized server. The Sequence page also gives links to the Protein page for zyg-1/F59E12.2 and to the C. briggsae ortholog of zyg-1.

2. Click the "Click Here to Browse" link to display the corresponding region in the Genome Brower (see Basic Protocol 7).


BASIC PROTOCOL 5: FINDING PROTEIN FEATURES

WormBase has, in addition to Gene and Sequence pages, Protein records that give the known or predicted details of a gene's protein product(s).

Necessary Resources

Hardware

   A standard computer with a reasonably fast connection to the Internet (cable modem, DSL, or Ethernet recommended)

Software

   Web browser such as Internet Explorer (http://www.microsoft.com) or Firefox (http://www.mozilla.org/firefox)

1. From the front WormBase page, select "Protein, C. elegans" from the "Search for" pull-down menu, then enter the name of a protein of interest in the adjacent text field and click the Search button.

2. A successful search will lead directly to a "Protein report" for the gene requested

Fig_1_8_06.png

This is the Protein page for that polypeptide in WormBase.

3. Alternatively, from a Gene or Sequence page, click on the link given for Corresponding Protein(s).

For example, the Gene page for zyg-1 (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_03.png) and the Sequence page for F59E12.2 (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_05.png) both link to the Protein Report page for WP:CE28571.

WormBase uses a numbering system for C. elegans proteins with the format WP:CExxxxx, where the last five digits are numbered 00001 to 99999. (For a search of WormBase, CExxxxx is sufficient.) A CE number is unique to a particular peptide sequence, and new CE numbers are generated every time that a change or expansion in predicted gene structures implies a new conceptual protein sequence; where there is the possibility of confusion, CE numbers prevent any problems with changes in Sequence nomenclature.

4. Examine the page for protein characteristics of interest.

A Protein page contains several different classes of information (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_06.png). For a given protein, where applicable, WormBase will show the following: motifs from Interpro (Mulder et al., 2005) and PFAM (Bateman et al., 2004; UNIT 2.5); regions of low-complexity sequence (detected with SEG; Wootton, 1994); predicted transmembrane domains (via THMMER; Krogh et al., 2001) or coiled-coil domains (through NCOILS; Lupas, 1997); the amino acid sequence; length, estimated isoelectric point and molecular weight, and residue composition.

In addition to features inherent to a protein sequence, WormBase also gives the protein's precomputed BLASTP search hits against a number of databases. These databases include C. elegans itself (to detect paralogs within the worm genome); Saccharomyces cerevisiae (from SGD; Balakrishnan et al., 2005), Drosophila melanogaster (from Gadfly/FlyBase; Drysdale et al., 2005); human beings (from Ensembl; Hubbard et al., 2005); and a nonredundant subset of proteins from SWISS-PROT and TrEMBL (Bairoch et al., 2005). Similarities detected by BLASTP are diagrammed alongside motifs and other sequence features (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_06.png).


BASIC PROTOCOL 6: SEARCHING FOR GENE PRODUCTS WITH PARTICULAR SEQUENCE MOTIFS

One way to identify all C. elegans proteins sharing a given PFAM or InterPro motif is to check that motif's WormBase entry. One advantage of such searches is that hypertext links are given for each motif to a short description of its biological properties (if any; many motifs are evolutionarily conserved but functionally uncharacterized).

Necessary Resources

Hardware

   A standard computer with a reasonably fast connection to the Internet (cable modem, DSL, or Ethernet recommended)

Software

   Web browser such as Internet Explorer (http://www.microsoft.com) or Mozilla (http://www.mozilla.org/firefox)

1. From the front page, enter a term for the desired motif, such as "ribonucleoprotein", select "Protein Family or Motif" from the pull-down menu (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png), click Search, and study the results.

Fig_1_8_07.png

Click on the hyperlinked motifs and examine them individually to see which are most useful.

2. Alternatively, pick a gene whose protein product is known to be in the class of interest. Find its Gene page by a standard search (see Basic Protocol 3). Click from there to its Protein page(s) and check for motifs. (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_06.png)

3. Using either method, if one or more motifs are found, click on them to see whether they have other proteins associated with them.

An example of a motif found by searching for ribonucleoprotein is the InterPro motif IPR001163, "Like-Sm ribonucleoprotein, core." The motif has 17 associated proteins; in cases where a motif's gene has a classical three-letter name, this is shown.

Fig_1_8_08.png


BASIC PROTOCOL 7: USING THE GENOME BROWSER

All the sections of WormBase described so far consist mainly of written text. However, organisms have a text of their own, the genomic DNA sequence, within which important biological features reside. WormBase thus provides schematic diagrams of genomic DNA, as well as textual descriptions of genes, with a Generic Genome Browser that allows customizable views of C. elegans DNA (Stein et al., 2002).

Necessary Resources

Hardware

   A standard computer with a reasonably fast connection to the Internet (cable modem, DSL, or Ethernet recommended)

Software

   Web browser such as Internet Explorer (http://www.microsoft.com) or Mozilla (http://www.mozilla.org/firefox)

1. Click the Genome link at the top left of the WormBase front page (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png) to start the Genome Browser (http://wormbase.org/db/seq/gbrowse/wormbase/;

Fig_1_8_09.png

2. Select an appropriate search term. The Genome Browser requires a "Landmark or Region" (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_09.png), but this can include genome-wide scans for nucleotide sequences, gene classes, or key words in gene descriptions. Examples are given on the Genome Browser page. The Browser can view both linear nuclear chromosomes (autosomes 1 through 5, plus the sex chromosome X; (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_09.png), and the circular mitochondrial chromosome (MtDNA, shown as a line).

3. Click the Search button and examine the results. If a search term is used that has multiple hits in the genome (e.g., zyg-*), one will get a schematic view displaying their positions in the genome, with hypertext links.

Study a particular genomic region

4. Where there is more than one search result, click one link to see an individual genomic region

Fig_1_8_10.png

and decide which of its features to further examine.

Features of the given region are displayed by selecting specific bands or "tracks" for viewing the sequence from left to right. The view of a given region can be customized by many different "tracks." The default view is to have only a few tracks (e.g., gene models, GenBank entries, and operons), in order to make the initial view clear. However, the view can be expanded by adding more tracks (such as ESTs, restriction sites and alignments to C. briggsae) until it is quite elaborate. From a C. briggsae alignment, one can hop to the relevant segment of the C. briggsae genome. ESTs can be ordered from the research group of Yuji Kohara (ykohara@lab.nig.ac.jp).

5. Select the most useful tracks, and click on the Update Image link.

The view will then show the added information tracks. One revised view is given in (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_10.png) In this case, alignments to C. briggsae, to "best" ESTs, and to "best" mRNAs have been invoked. Most ESTs in the region, and a single mRNA (identified by cDNA sequencing), support the structure of zyg-1. Two ESTs overlap the nested bli-2 gene, and bli-2-like sequences within zyg-1-like sequences are highly conserved in C. briggsae. Moreover, small regions of conserved sequence on the 5¢ flanks of the zyg-1 and bli-2 transcription units; these regions are reasonable candidates for transcriptional regulatory sequences (Nardone et al., 2004). Well-chosen tracks can thus show complex data in a compact, vivid way.

6. Optionally, adjust the scale of the genome view by clicking on the appropriate links in the Scroll/Zoom menu (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_09.png).

This is a small but powerful menu; it can expand the view to subchromosomal ranges of 1 Mb

Fig_1_8_11.png

or focus the view in to single-nucleotide resolution.

Fig_1_8_12.png

To get views of individual nucleotides, set the zoom for "100 bp" and switch on the track "DNA/GC Content." (At scales greater than 100 bp, this gives a graphical view of the percentage of GC nucleotides.) In either case, the location of the sequence shown within its entire chromosome is also diagrammed.

7. Another way to examine desired slices of the genome is to type in their exact coordinates into the "Landmark or Region" text field (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_09.png) and then click the Search button. Genomic coordinates are given in the format [chromosome]:[start nucleotide]..[final nucleotide]; for instance, the coordinates of zyg-1 are II:5649706..5654391. (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_04.png) By choosing coordinates and then adjusting them to improve the graphical view, it is possible to incrementally refine the region viewed to very high precision, starting with large changes and ending with small ones.

8. Given a particular region, one may want to extract its nucleotides for analysis. The Genome Browser enables this with the "Display Sequence File" option in the "Reports & Analysis" menu. When the viewer is set to the exact nucleotides wanted, select this option and then click the Go button to the right. This will produce a text file in FASTA format, which can then be saved to disk and manipulated with other programs such as EMBOSS (http://www.hgmp.mrc.ac.uk/Software/EMBOSS/).

9. If a particular object in the sequence view is itself worth examining, click on it to go to its record in the database. For instance, in the original view of zyg-1, one can click on the bli-2 gene and go to the Sequence page for bli-2/F59E12.12.

Normally, WormBase gives its views of the genome as ordinary bit-by-bit (raster-based) PNG diagrams in a Web browser. However, to publish a figure in a professional research journal, it is often preferable to have figures that are line drawings (scalable vector graphics, or SVG) instead of PNG diagrams. It is possible to generate publication-quality line drawings (SVG figures) from WormBase. Viewing these figures requires the SVG plugin for one's Web browser. See http://plugindoc.mozdev.org/linux.html for Mozilla/Linux plugins. Adobe SVG plugins for Macintosh and Windows systems are available at http://www.adobe.com/support/downloads/. SVG figures can then be further worked up for publication with a commercial program such as Adobe Illustrator, or open-source programs such as Inkscape (http://www.inkscape.org) or Batik (http://xml.apache.org/batik).

10. To get line drawings of a genomic region, click the "High-res Image" link in the Genome Browser menu (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_10.png). Follow the resulting instructions to save the newly generated SVG file to disk. To view the diagram immediately, one will need to have an appropriate SVG plugin for one's Web browser (see http://plugindoc.mozdev.org/linux.html for Mozilla/Linux plugins).

11. WormBase provides advanced functions to allow collaborative annotation of the genome by outside researchers. If it is desirable to provide annotations for a given region, open the "Add your own tracks" menu at the bottom of the Genome Viewer page, and then use the "Upload your own annotations" or "Add remote annotations" options. Help documents for these are available on-line.


BASIC PROTOCOL 8: VIEWING THE C. BRIGGSAE GENOME AND ITS SYNTENY WITH C. ELEGANS

At least six species in the Caenorhabditis genus have morphology and genetics close enough to C. elegans that they can be classed as its siblings (Cho et al., 2004; Kiontke et al., 2004). Of these sibling species, three (C. briggsae, C. remanei, and C. sp. CB5161) have been isolated as laboratory strains. C. briggsae and C. remanei have undergone whole-genome shotgun sequencing so that they can be compared to C. elegans (Stein et al., 2003; J. Spieth, pers. comm.). As of mid-September 2005, WormBase includes both the C. briggsae genome and a preliminary assembly of the C. remanei genome. C. briggsae's genome can be viewed either by itself or in syntenic linkages to C. elegans.

Necessary Resources

Hardware

   A standard computer with a reasonably fast connection to the Internet (cable modem, DSL, or Ethernet recommended)

Software

   Web browser such as Internet Explorer (http://www.microsoft.com) or Mozilla (http://www.mozilla.org/firefox)

1. Click on the "C. briggsae Genome" link in the Sequences section of the Web site directory on the main WormBase page (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png). This will give a Genome Browser window, from which one can then pick C. briggsae sequences for viewing in the same way as with C. elegans (http://wormbase.org/db/seq/gbrowse/briggsae;).

Fig_1_8_13.png

At this writing, the whole-genome shotgun sequence exists as 5,341 contigs, organized into 142 supercontigs with names like "cb25.fpc0058"; the predicted genes have names like "CBG02462". Such a view is useful for extracting nucleotide sequences, but is rather nondescript.

Usually the C. briggsae sequences are interesting because of their relationship to C. elegans orthologs. Such orthologies can be viewed directly via the Synteny Viewer.

2. Click on the Synteny Viewer link in the Sequences section of the Web site directory on the main WormBase page (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png). This gives a search window in which one can ask for given C. elegans or C. briggsae sequences by its sequence name. A search for either type of sequence gives a syntenic alignment of it with its predicted ortholog (if any) in the opposite species. For instance, searching for zyg-1 by its sequence name F59E12.2 (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_05.png) in the Synteny Viewer gives an alignment of the zyg-1/bli-2 cluster in both species.

Fig_1_8_14.png

Click on the Shown Alignment link to get the two-species sequence alignment in a new browser window.

3. Alternatively, start with a Genome Browser view of some interesting C. elegans gene, and invoke the "Briggsae alignments" track in the viewer (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_10.png). Then, click directly on the blue "Briggsae alignments" line in the diagram. This will move the viewer from a single-genome view to a syntenic view of the region that was clicked on (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_14png). This approach is particularly useful if the C. elegans genomic region one wants to study is complicated, and getting the correct homology for a specific site is crucial.

4. The Synteny Viewer itself can be zoomed outward or inward to view large or small regions, just as the Genome Browser can. Moreover, one genome can be visually flipped with respect to another so that its homology can be disentangled and made easy to see. With some care, it is possible to get pleasingly clear diagrams of what would otherwise be murky homologies.

5. Analysis of the C. briggsae genome identified ~12,000 predicted genes as putative orthologs of C. elegans genes (Stein et al., 2003). These orthologs, where they exist, are listed as hypertext links on each C. elegans Gene page, with the choice of going directly to the C. briggsae gene or to its syntenic alignment (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_03.png). On some C. elegans Gene pages, there will also be BLASTP hits given against the C. briggsae genome; clicking on these hits takes one to a C. briggsae protein page. In general, most WormBase functions work for C. briggsae in the same way as for C. elegans.

BASIC PROTOCOL 9: FINDING SEQUENCE SIMILARITIES WITH BLAST

This protocol describes how to BLAST a sequence against the sequences in WormBase. Further discusion of the BLAST algorithm can be found in UNITS 3.3, 3.4, & 3.11.

Necessary Resources

Hardware

   A standard computer with a reasonably fast connection to the Internet (cable modem, DSL, or Ethernet recommended)

Software

   Web browser such as Internet Explorer (http://www.microsoft.com) or Mozilla (http://www.mozilla.org)

1. Click the Blast/Blat link at the top middle left of the WormBase front page (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png). This will lead to http://blast.wormbase.org/db/searches/blat. This is the method of choice for searching the C. elegans genome with a non-worm protein or nucleic acid sequence.

For instance, consider the human gene DYMECLIN (DYM), which when mutated leads to Dyggve-Melchior-Clausen or Smith-McCort dysplasia (Cohn et al., 2003; El Ghouzzi et al., 2003). Suppose one would like to use C. elegans as a model system for dissecting DYMs function. Which, if any, C. elegans genes are significantly similar to DYM? Performing BLASTP on WormPep with the DYM protein sequence yields one strong hit, the gene C47D12.2, so far uncharacterized.

Fig_1_8_15.png

In addition to the plain orthology between DYM and C47D12.2, a weaker but also significant similarity is visible to two protein products of the hid-1 gene, required for normal muscular activity and insulin signaling (Ailion and Thomas, 2003). BLAST searches can thus reveal not only orthologies but also paralogies that may suggest further clues to gene function.

While BLAST is designed to efficiently search genomes for subtle matches to protein-coding sequences, BLAT is aimed at quickly finding strong (95% to 100%) identities of genomic DNA to (‚â•40-residue) nucleotide sequences (Kent, 2002).

2. To examine the exact match between a query sequence and a given BLAST hit, click on its Alignment link (in the Details section of the tabulated BLAST outputs; (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_15.png). This will yield BLAST results in the standard high-scoring segment pair format (Korf et al., 2003).

3. To examine the location of a BLAST hit in the context of the C. elegans genome, go to its Genome View link instead of its Alignment link (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_15.png). This will give a Genome Browser view with the BLAST hit mapped onto the genome.

BASIC PROTOCOL 10: MINING GENE DATA WITH WORMMART

Web pages for a single Gene, Sequence, or Protein only allow relatively small amounts of data to be analyzed at a time; yet the rise of functional genomics has made it useful and necessary for biologists to handle large, complex information about tens to thousands of genes in their everyday work. To support this, WormBase provides tools for wrangling large data sets. The most recently designed and generally usable of these is WormMart, based on the BioMart search engine used by Ensembl (Kasprzyk et al., 2004). WormMart is a Web interface that allows users to design and run complex database searches without having to use (or even know) complex database query languages.

Necessary Resources

Hardware

   A standard computer with a reasonably fast connection to the Internet (cable modem, DSL, or Ethernet recommended)

Software

   Web browser such as Internet Explorer (http://www.microsoft.com) or Mozilla (http://www.mozilla.org/firefox)

1. Go to the WormMart page (by clicking its link on the top left center of the front page; (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png). Once on the first WormMart page, choose a version of WormBase (either the most recent release, or a permanently archived release such as WS140). Then choose a Dataset to search, such as genes, expression patterns, research papers, phenotypes, or "variations" (i.e., mutations or RNAi). Click the "next" button, and go to the Filter page.

2. Decide on how to filter the data set chosen.

Fig_1_8_16.png

For instance, the Genes data set can be filtered by species, gene names or classes, the reliability of the gene's structual prediction, whether the gene is protein-coding or not, genomic location and DNA strand orientation, inclusion or exclusion of 5' and 3' flanking sequences, and RNAi phenotype. Other data sets can be filtered in analogous ways. Moreoever, WormMart allows more than one data set to be filtered in a single search. Although the Filter page gives many choices, it is easy to undo one's choices and explore a different query, by clicking the "back" button. After the data filters are satisfactory, click the "next" button and go to the Output page.

3. On the Output page, decide how to best select and present the filtered data. First, choose an "Attribute". For genes, these attributes can be "Features", "Structures", or "Sequences". Features include short identifiers, genomic locations, functional annotations, available reagents, mutant alleles, or references; Structures include exons, introns, and 5' or 3' untranslated regions; and Sequences are pure nucleotide sequences selected from structural elements or from DNA-based reagents such as cDNAs.

Fig_1_8_17.png

After selecting the details for such an output, choose an output format such as HTML, plain text, Excel table, or compressed file, and click "export".

4. Examine the results.

Fig_1_8_18.png

If they seem promising but not as well-chosen as necessary, go back one or more steps and try different options, and see how the results come out with somewhat different filter and output choices. The best way to get an intuitive sense for this interface is probably to tinker with it at first. With practice, however, it can be a very powerful and flexible tool.

BASIC PROTOCOL 11: DOWNLOADING A BATCH OF SEQUENCES

BASIC PROTOCOL 12: EXAMINING THE GENOMIC CONTENT OF A CLASSICAL GENETIC INTERVAL

ALTERNATE PROTOCOL 1: INSTALLING AND RUNNING WormBase LOCALLY

COMMENTARY

KEY REFERENCES

INTERNET RESOURCES

FIGURE(S)