Searching WormBase for Information About C. elegans (Wiley)

From WormBaseWiki
Jump to navigationJump to search

INTRODUCTION

WormBase is the major public biological database for the nematode Caenorhabditis elegans (Chen et al., 2005; PMID: 15608221). It is meant to be useful to any biologist who wants to use C. elegans, whatever his or her specialty. WormBase contains information about the genomic sequence of C. elegans, its genes and their products, and its higher-level traits such as gene expression patterns and neuronal connectivity. Also, WormBase contains sequence and gene structures of C. briggsae and C. remanei, two closely related worms. These data are interconnected, so that a search beginning with one object (such as a gene) can be directed to related objects of a different type (such as the DNA sequence of the gene, or the cells in which the gene is active). One can also do searches for complex data sets.

WormBase is constantly being changed and expanded, both by curation of newly available data and by modifications to the user interface. The entire database is updated and rebuilt into a new release every 3 weeks throughout the year, in releases named "WSnnn" (with "nnn" being 147 as of this writing). To give bioinformaticians reproducible data sets for their analyses, each tenth release is made permanently available as a frozen online archive, roughly once every seven months. Releases WS100 through WS150 have been archived so far. All of the information in this chapter is based on the version of WormBase available in September, 2005 (WS147 release; BioPerl/Generic Model Organism Database software).

The protocols described in this chapter include the following: general searches of WormBase with single search terms; studying a gene, sequence, or protein with its individual web page, or with the Genome Browser; searching for proteins by BLAST hits, sequence motifs, or Gene Ontology terms; aligning C. elegans with C. briggsae genomic sequences; detailed, user-customized searches with WormMart or AceDB Query Language; batch downloads of many sequences at once; identifying the genomic regions, genetic contents, or molecular clones spanning a defined chromosomal interval; electronic PCR; finding expression patterns, and the cell types or developmental origins from which these patterns arise; and searching for genome-wide RNAi results yielding particular phenotypes. Some advice is also provided for installing and running WormBase on a local computer.


BASIC PROTOCOL 1: NAVIGATING THE WormBase HOME PAGE

Necessary Resources

Hardware

   A standard computer with a reasonably fast connection to the Internet (cable modem, DSL, or Ethernet recommended)

Software

   Web browser such as Internet Explorer (http://www.microsoft.com) or Firefox (http://www.mozilla.org/firefox)

The home page for the main WormBase site, http://wormbase.org allows general searches, gives links to specialized searches, and has news about improvements to WormBase.


Fig_1_8_01.png


This page is divided into several parts. At the top is a large, simple menu bar giving quick access to six popular search pages, a Web form for submitting comments or new data, an all-purpose Searches page, and a Site Map. This bar appears at the top of all of the WormBase pages and also includes a link back to the home page (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png). On the home page immediately below this menu bar are the serial number of the data release currently in use, and the official WormBase logo.

The next section of the main page has fields and menus that are used for a basic search of the full database (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png). A user can choose searches from any of over 20 different data types from the "Search for" drop-down menu (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png). The check boxes immediately beneath the search fields offer choices to require strict identity for search terms, to give results in XML format, or to search primary research articles in depth with Textpresso (http://www.textpresso.org; Muller et al., 2004).

Below the basic search section are a directory for the entire WormBase site (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png), and news about C. elegans bioinformatics. Near the bottom of the page is a Links section (with connections to several other sites about C. elegans which complement WormBase in some way), and a list of other sites running WormBase itself (the development server, a data-mining server, official mirror sites, and archival data freeze sites for http://wormbase.org). Finally, the bottom of the page gives a link for users to make comments or ask questions to the WormBase staff, and other links to WormBase policies on copyright and privacy; these links are given on every WormBase page.

WormBase has a development site (http://dev.wormbase.org) which uses the very latest data release and site software, while the main site lags by one release; while closely similar, the main site focuses on stability and the development site on novelty.

The examples of searches given in the following protocols are intended to be illustrative, but not exhaustive; many other searches are possible.


BASIC PROTOCOL 2: PERFORMING A DATABASE SEARCH

This protocol presents how to conduct a general search of WormBase

Necessary Resources

Hardware

   A standard computer with a reasonably fast connection to the Internet (cable modem, DSL, or Ethernet recommended)

Software

   Web browser such as Internet Explorer (http://www.microsoft.com) or Firefox (http://www.mozilla.org/firefox)

1. Go to the main page, either by entering its URL http://wormbase.org, or by clicking on the Home link at the top left of any WormBase Web page (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png).

2. Type a word or phrase into the search form (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png). For a search of the entire database, click the Search link after selecting Anything in the "Search for" pull-down menu.

3. Once the search has run, examine the results (if any). If a single data record is found, the search routine will sometimes go directly to the Web page for that record (it will automatically do this for Gene pages). However, it will usually instead have 0 or ‚â•2 results. In the latter case, the search will give one or more summary pages on which the data records found are listed.

Fig_1_8_02.png

4. Look at the page of summarized results, and consider whether there are too few or too many. If there are too few results, try different key words for the search; computers are literal-minded, and a particular search word may be almost but not quite recognized by the database. Conversely, if there are too many results, select a restricted subset of the database for the search from the "Search for" pull-down menu (e.g., "Any Gene"; (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png)), or check the "Exact match" box and resubmit the search by clicking the Search button.

5. If the search results look reasonable, then follow the hypertext links to the data records themselves.

6. To get a list of many data records all falling into a specific class, pick a subset of the database and search with a wildcard. For instance, if the "Any Gene" search is selected, and run with unc-* as the search term, WormBase will return links to 117 Locus records (snt-1 through vab-8, with 114 unc-* hits). When possible, WormBase recognizes and returns synonyms in searches (which is why a search for unc-* genes returns three non-unc-* gene names; these three genes have aliases of unc-107, unc-110, and unc-121).

7. Remember that search words can be anything of interest. While there is no guarantee that the database will give hits for any given search string, specific topics of interest can have useful results.

For instance, consider a search for' '''hyperplasia ''under all categories (i.e., by choosing Anything from the pull-down menu in (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_01.png). Hyperplasia is a topic normally associated with cancer biology or endocrinology (Merke and Bornstein 2005; Simpson et al., 2005), but one might want to see the potential for C. elegans to serve as a model for the control of tissue proliferation. A search with this term reveals nine hits, including six genetic loci that either have hyperplastic phenotypes or are homologs of human disease genes with roles in deregulated proliferation.


BASIC PROTOCOL 3: EXAMINING A GENE IN C. ELEGANS

Necessary Resources

Hardware

   A standard computer with a reasonably fast connection to the Internet (cable modem, DSL, or Ethernet recommended)

Software


Web browser such as Internet Explorer (http://www.microsoft.com) or Firefox (http://www.mozilla.org/firefox)

From the front WormBase page, enter the name of a gene of interest and click Search (leaving the Find menu set to its default, "Any Gene"). A successful search will lead directly to a Gene page for the gene requested.

Fig_1_8_03.png

A Gene page is intended summarize everything of biological importance known about a given gene; it can thus be quite complex, although it is also meant to be concise ((http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_03.png) and

Fig_1_8_04.png

The page can describe any or all of the following information, when available: identity of the gene's product; normal function of the gene's product; orthologs of the gene (if any); the gene's meiotic and physical location within the genome; its phenotype in classical mutations or RNAi screens; its spatial and temporal expression pattern, with microarray data; domains of its encoded protein; Gene Ontology (UNIT 7.2) terms describing its function; mutant alleles and strains carrying them; homologs identified by best BLASTP (UNIT 3.4) scores versus other proteomes; cDNA, STS, and antibody reagents; microarray probes; SAGE oligonucleotides; and references from the primary C. elegans research literature. All of these are given with hypertext links allowing the user to find more information on a given datum of interest.

Most of this information is given as tabulated text. However, for cloned genes, there will also be a schematic diagram of the gene's physical organization. For instance, the diagram on the zyg-1 Gene Page reveals that zyg-1 actually has a functional gene, bli-2, nested within itself (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_04.png). This diagram provides a graphical link to the Genome Browser (see Basic Protocol 7).

C. elegans is currently known to have ~20,100 protein-coding genes, as well as ~900 genes producing non-protein-coding RNA transcripts. A gene in C. elegans can be studied in several different ways. The original method was genetics, in which a gene equated a classical locus at which two or more different alleles had been mapped (Brenner, 1974). Despite all advances in functional genomics, classical genetics remains uniquely powerful in uncovering new aspects of worm biology (Jorgensen and Mango, 2002).

There are, accordingly, 44,606 Gene objects in the WS147 release of WormBase, marking either current or obsolete gene predictions for C. elegans. In the database, they are given uninformative serial numbers such as "WBGene00021622". However, in real use these genes either have names like "xyz-N" (a three-letter lower-case abbreviation, where N represents an Arabic numeral) or others with names like "cosmid.number" (where "cosmid" is the name of a genomic clone, usually a cosmid, sequenced in the C. elegans genome project (C. elegans Sequencing Consortium, 1998), and "number" denotes an otherwise anonymous gene embedded in the clone).

There are fewer classical (three-letter) gene names for C. elegans genes (~6,100) than names for genes identified through genomic sequencing (~21,000). An increasing number of three-letter names are given to genes identified purely through genomic sequencing and analysis. Conversely, a significant number of genes identified through classical mutagenesis have not yet been linked to the genome through cloning. So, the rule of thumb is that, while Gene objects in WormBase will often have both a classical and a sequence name (as zyg-1 does), this is not inevitable; many genes in C. elegans either lack a classical name or have a classical (mutant) name without a known sequence name. At the same time, functional genomics is being applied to all genes in C. elegans through chromosome-wide RNAi screens, microarray analyses, and protein interaction mapping (Ge et al., 2003; Gunsalus et al., 2005). WormBase therefore annotates a good deal of information about "anonymous" genes with only nondescript sequence names.

If a set of Gene records is wanted, do the search for "Any Gene" with a wildcard entry (such as unc-*), or do a basic search (for "Anything"). With a well chosen query, this will produce a summary page having one or more entries in the database with the general format Gene:xyz-N. Examine the entries and click on whichever is most useful.

Each Gene page for a genomic protein-coding sequence (CDS), even if it lacks mutant alleles mapped through classical genetics, has an interpolated genetic map position for that coding sequence, with a link to a list of nearby genes. This map position is formatted and handled in the same way as the genetic map position for a classical genetic locus (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_04.png). In the case of a gene like zyg-1, which already has mutant alleles, the Genetic Position link (http://www.wormbase.org/wiki/index.php/Image:Fig_1_8_04.png) will mainly help design tests for allelism with other mutant loci. However, for a gene identified solely as a CDS, interpolated map information can indicate which classical mutant alleles might reside within the gene.

BASIC PROTOCOL 4: EXAMINING A MOLECULAR SEQUENCE IN C. ELEGANS

BASIC PROTOCOL 5: FINDING PROTEIN FEATURES

BASIC PROTOCOL 6: SEARCHING FOR GENE PRODUCTS WITH PARTICULAR SEQUENCE MOTIFS

BASIC PROTOCOL 7: USING THE GENOME BROWSER

BASIC PROTOCOL 8: VIEWING THE C. BRIGGSAE GENOME AND ITS SYNTENY WITH C. ELEGANS

BASIC PROTOCOL 9: FINDING SEQUENCE SIMILARITIES WITH BLAST

BASIC PROTOCOL 10: MINING GENE DATA WITH WORMMART

BASIC PROTOCOL 11: DOWNLOADING A BATCH OF SEQUENCES

BASIC PROTOCOL 12: EXAMINING THE GENOMIC CONTENT OF A CLASSICAL GENETIC INTERVAL

ALTERNATE PROTOCOL 1: INSTALLING AND RUNNING WormBase LOCALLY

COMMENTARY

KEY REFERENCES

INTERNET RESOURCES

FIGURE(S)