UserGuide:Blast Search

From WormBaseWiki
Jump to: navigation, search

OVERVIEW: I_blast1.jpg

BLAST is a method for searching either protein or DNA databases with either a protein or a DNA query sequence. Positive matches in the database to the query sequence are shown with an alignment of the positive to the query, and with a "P value" that represents the probability (given the numerical score of the alignment) that such a query-positive match would occur purely by chance. In other words, small P values are good (very unlikely to be random, i.e., very likely to be meaningful). P values become unreliable above ~0.01, but it is worth noting that a hit which is statistically insignificant in isolation can nevertheless be a real hint of a subtle connection between two highly diverged, yet evolutionarily related sequences.

Full details of how to effectively search databases with BLAST cannot be given here. A short list, by no means complete, of references is:

Jambeck, P. and Gibas, C. (2001). Developing Bioinformatics Computer Skills. O'Reilly, Sebastopol, CA. ISBN: 1-56592-664-1.

http://www.oreilly.com/catalog/bioskills

Mount, D.W. (2001). Bioinformatics: Sequence and Genome Analysis. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY. ISBN: 0-87969-608-7.

http://www.bioinformaticsonline.org/

Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, 2cd ed. A.D. Baxevanis and B.F.F. Ouellette, eds. Wiley-Interscience, New York. ISBN: 0-471-38391-0.

http://www.wiley.com/Corporate/Website/Objects/Products/0,9049,39021,00.html

Also, information about BLAST is available from the NCBI in an overview, http://www.ncbi.nlm.nih.gov/BLAST/blast_overview.html

frequently asked questions, http://www.ncbi.nlm.nih.gov/BLAST/blast_FAQs.html

a tutorial, http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html

or a formal course. http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

BLAT, which is short for "BLAST-like alignment tool" is designed to quickly find sequences of 95% and greater similarity of length 40 bases or more.It may miss more divergent or shorter sequence alignments. It was written by James Kent. (e-mail him at kent@biology.ucsc.edu). BLAT is similar in many ways to BLAST. The program rapidly scans for relatively short matches (hits), and extends these into high-scoring pairs (HSPs). However, BLAT differs from BLAST in some significant ways. For instance, where BLAST returns each area of homology between two sequences as separate alignments, BLAT stitches them together into a larger alignment. BLAT has a special code to handle introns in RNA/DNA alignments. Therefore, whereas BLAST delivers a list of exons sorted by exon size, with alignments extending slightly beyond the edge of each exon, BLAT effectively "unsplices" mRNA onto the genome giving a single alignment that uses each base of the mRNA only once, and which correctly positions splice sites.

More information about BLAT can be found at: Kent WJ. BLAT-The BLAST-Like Alignment Tool. Genome Res 2002 Apr 12(4):656-64. http://www.genome.org/cgi/content/full/12/4/656

HOW TO USE WORMBASE BLAST SEARCH

The WormBase site provides BLAST/BLAT searches that are specifically directed to C. elegans DNA or protein sequences; positive hits are then linked by hypertext to their information in Wormbase. The query (search) sequence can be either DNA or protein, and can either from Wormbase itself or a sequence supplied by the user. Sequences are automatically provided from WormBase when a BLAST search is specifically requested for a given sequence by clicking on "BLAST against WormPep/Elegans genome".

The following is an example of "Blast" button on Sequence Report pages:

I_blast2.jpg

1. Sequence Input:

Back to the Blast Search page. First, sequences must either be completely plain (nothing but one-letter residues) or in the FASTA format. FASTA format sequences start with one header line in the format:

>Sequence_name [plain residues from here on] [end of text]

Keep in mind that only nucleotide sequences can be used as queries for BLAT searches.

The following is an example of a BLAST input.

I_blast3.jpg

2. Expected Threshold:

This is very important but easily forgotten by users: the number you entered for Expected Threshold controls the number of hits displayed in the result. As noted above, BLAST provides quantitative estimates of the probability that a given hit is purely by chance, and these E-value scores are better (from the standpoint of somebody looking for a new, nontrivial sequence similarity) the smaller they are. The Wormbase BLAST interface allows the user to only see the subset of BLAST hits that have E values less than a given threshold value. Note that there can be many hits *not* shown, due to this default, that nevertheless have significant E values (e.g., E=0.002) or weak but possibly interesting E values (e.g., E=0.1). Often when one does a BLAST search and 'fails' to see any hits, it is worth checking to see if there are in fact hits whose E values were greater than 0.001.

3. Program Type:

The name "BLAST" is generically used to describe several different varieties of search programs that vary depending on the type of query (search) sequence and the type of database being searched:

   * blastP -- protein query sequence, protein database (e.g.: a given protein versus Wormpep)
   * blastN -- DNA query sequence, DNA database (e.g.: a given DNA sequence versus worm genomic DNA, or versus worm ESTs)
   * blastX -- DNA query sequence, protein database (e.g.: a given DNA sequence versus Wormpep)
   * TblastN -- protein query sequence, DNA database (e.g.: a given protein versus worm genomic DNA, or versus worm ESTs). As mentioned before, BLAT is used to search worm genomic DNA or EST databases solely with DNA query sequences.

Where proteins and DNA are compared, six-frame translations of the DNA sequence are compared to protein.

4. Database Type:

There are three built-in databases that BLAST on WormBase is enabled to search. These complement the various versions of BLAST/BLAT search that WormBase allows a user:

   * WormPep -- the most recent compilation of predicted, conceptual protein sequences, derived from the most recent set of gene models.
   * elegans ESTs -- the current set of expressed sequence tag sequences available in Wormbase; these are predominantly, though not entirely, from the laboratory of Yuji Kohara.
   * elegans genomic -- the most recent version of the C. elegans genomic DNA sequence.

5. Show Maximum Hits:

In typical BLAST/BLAT searches, there is only a small number of hits that will be of interest -- for example, if one is searching for the single C. elegans ortholog of a given gene of interest. On the other hand, it is also possible to have many hits that are both significant and important (e.g., if one is examining a diverse protein family). On the Wormbase BLAST/BLAT server, the default maximum number of hits shown is 20, but this can be set to smaller values, greater values, or to show all hits.

WHERE TO GET THE SOFTWARE:

There are two somewhat different versions of the BLAST search software. A publicly available, and reasonably compilable, open-source version is available from the NCBI.

ftp://ftp.ncbi.nlm.nih.gov/blast

The version used in Wormbase, however, is a closed-source version provided from Washington University under a free academic license.

http://blast.wustl.edu

The latter version has been chosen for use in Wormbase because it is thought to be superior to NCBI BLAST.