previous section previous page next page next section
CMB

Online Lectures on Bioinformatics

navigation


Database searching


BLAST

The other widely used program to search a database is called BLAST (Basic Local Alignment Search Tool) [3][4]. Blast follows a similar scheme in that it relies on a core similarity, although with less emphasis on the occurrence of exact matches. This program also aims at identifying core similarities for later extension. The core similarity is defined by a window with a certain match density on DNA or with an amino acid similarity score above some threshold for proteins. Independent of the exact definition of the core similarity, BLAST rests on the precomputation of all strings which are in the given sense similar to any position in the query. The resulting list may be on the order of thousand or more words long, each of which if detected in a database give rise to a core similarity. In Blast nomenclature this set of strings is called the neighborhood of the query. The code to generate this neighborhood is in fact exceedingly fast.

Given the neighborhood, a so-called finite automaton is used to detect occurrences in the database of any string from the neighborhood. This automaton is a program, constructed on the fly and specifically for the particular word neighborhood that has been computed for a query. Upon reading through a database of sequences, the automaton is given an additional letter at a time and decides whether the string that ends in this letter is part of the neighborhood. If so, BLAST attempts to extend the similarity around the neighborhood and if this is successful reports a match.







Like with the FASTA, BLAST has also been adapted to connect good diagonals and report local alignments with gaps. BLAST additionally converts the database file into its own format to allow for faster reading. This makes it somewhat unwieldy to use in a local installation unless someone takes care of the installation. FASTA, on the other hand, is slower but easier to use. There exist excellent web servers that offer these programs, in particular at the NCBI, where BLAST can be used on up-to-date DNA and protein databases.

Progress in computational speed using either specially designed or massively parallel hardware have led to the availability of extremely fast versions of the Smith-Waterman algorithm. The EBI, among other institutions, is offering a service where this algorithm is executed on a massively parallel computer resulting in search times of a few seconds. Companies like Compugen or Paracel have developed special hardware to do this job. With the availability of EST sequences it has become very important to match DNA sequence with protein sequence in such a way that a possible translation is maintained throughout the alignment. Both the FASTA and the BLAST package contain programs for this and related task. When coding DNA is compared to proteins, gaps are inserted in such a way as to maintain a reading frame. Likewise, a protein sequence can be searched versus a DNA sequence database, and DNA can be searched versus DNA, too.


Comments are very welcome.
luz@molgen.mpg.de