BASIC PROTOCOL 11: DOWNLOADING A BATCH OF SEQUENCES

From WormBaseWiki
Jump to: navigation, search

In analyses of the C. elegans genome, it can be useful to have such things as the complete set of 5¢ flanking regions for all known or predicted genes in C. elegans. One way WormBase provides these data is through its Genome Dumper. To download a single segment of the genome, use the Genome Browser (see Basic Protocol 7, steps 7-8).

Necessary Resources

Hardware

   A standard computer with a reasonably fast connection to the Internet (cable modem, DSL, or Ethernet recommended)

Software

   Web browser such as Internet Explorer (http://www.microsoft.com) or Mozilla

(http://www.mozilla.org)

1. From the middle top of the front page, click on the Batch Sequences link; this leads to http://wormbase.org/db/searches/advanced/dumper.

2. The Genome Dumper can extract data from either the C. elegans or the C. briggsae genome. Pick the species to be scanned from the Species menu.

3. In the Input Options box, enter a list of gene, sequence or chromosome names, either by cut-and-paste into the large box, or by uploading a text file with space-separated names. Note that this can automatically translate classical gene names (such as zyg-1) into their genomic sequence equivalents (such as F59E12.2), so translating classical gene lists into molecular outputs is easy. This is of particular use if WormMart or the Batch Genes server has first been used to produce the full list of classical gene names sharing some characteristic of interest; this list can then be fed directly to the Batch Sequences service.

Also note that, because a single gene may have two or more alternatively spliced transcripts, entering one classical gene name may produce two or more sequence outputs.

4. Pick a choice from the "Select one feature to retrieve" pull-down menu. The default, Gene Models, will base a search on the structures of known or predicted worm genes.

5. In "Output options," choose whichever alternatives make sense. The Dumper will provide sequences in FASTA format. Unless it is desirable to have both a flanking sequence and the gene sequence linked to it, select the radio button for "flanking sequences only." For many purposes, the default of plain text is preferable to HTML. Saving to disk is an option, and is probably preferable for large outputs (such as chromosome- or genome-wide ones).

6. Click the DUMP button. The resulting download will be into a window of the Web browser being used. If the download is large (e.g., flanking sequences for all C. elegans genes), it will take some time (up to 1 hr). It is important not to interrupt the download by prematurely quitting the browser or using the Stop button of the browser; doing this will give a partial download, useless except for giving an example of the correct output. Again, for large downloads, consider saving the output to disk as a file to avoid this problem.

7. Check the format of the resulting text. It may be necessary to reformat the headers of the sequences; this can be done with Perl, or with the text editor or word processor of one's choice.

One example of a Genome Dumper search would be to run it with the following settings:

Species: "C. elegans"

1. Input Options: "I II III IV V X" [i.e., all nuclear chromosomes]

2. Select one feature to retrieve: "Gene models"

3. Output options:"flanking sequences only"

                  show 0 bp, 5¢ flank
                  show 1000 bp, 3¢ flank
                  coordinates relative to chromosome
                  sequence orientation relative to feature
                  Save to disk (plain TEXT).

A complete download, from the WS147 release of WormBase, gives 28,145 sequences from the 3¢ flanks of transcription units. A general use for such data sets is to acquire large sets of possible regulatory sequences (promoters or 3¢ UTRs) for computational analysis of cis-regulatory motifs through Gibbs sampling or hidden Markov models (e.g., GuhaThakurta et al., 2004). Prediction of such motifs can, in turn, be used to formulate experimentally testable hypotheses for transcriptional or translational gene regulation.

8. The Genome Dumper can do recursive searches. To use this, check the "paste features back for further searches" radio button in the "Output options" menu before carrying out a genome dump.