Online Lectures on Bioinformatics
|
Database searchingIntroductionThis section takes a first look at the problem of identifying those sequences in a sequence database that are similar to a given sequence. This task arises, e.g., when a gene has been newly sequenced and one wants to determine whether a related sequence already exists in a database. Generally, two settings can be distinguished. The starting point for the search may either be a single sequence with the goal of identifying its relatives, or a family of sequences with the goal of identifying further members of that family. Searching a data base needs to be fast and sensitive but the two objectives counteract each other. Fast methods have been developed primarily for searching with a single sequence and this shall be the topic of this section. When searching a database with a newly determined DNA or amino acid sequence - the so-called query sequence - the user will typically lack knowledge of whether an expected similarity might span the entire query or just part of it. Likewise, he will be ignorant of whether the match will extend along the full length of some database sequence or only part of it. Therefore, one needs to look for a local alignment between the query and any sequence in the database. This immediately suggests the application of the Smith-Waterman algorithm to each database sequence. One should take care, though, to apply a fairly stringent gap penalty such that the algorithm focuses on the regions that really match. After sorting the resulting scores the top scoring database sequences are the candidates one is interested in. Several implementations of this procedure are available, most prominently the SSEARCH program from the FASTA package [1]. There exist versions of this program that are tuned for speed like the one due to Phil Green, one that runs especially fast on SUN computers [2], and one by Geoff Barton. Depending on implementation, computer and database size, a search with such program will take on the order of several minutes. The motivation behind the development of other database search programs has been to emulate the Smith-Waterman algorithm's ability to discern related sequences as closely as possible while at the same time performing the job in much less time. To this end, one usually makes the assumption that any good alignment as one wishes to identify, contains, in particular, some stretch of ungapped similarity. Furthermore this stretch will tend to contain a certain number of identically matching residues and not only conservative replacements. Based on these assumptions most heuristic programs rely on identifying a well-matching core and then extending it or combining several of these. With hindsight, the different developments in this area can further be classified according to a traditional distinction in computer science according to which one either preprocesses the query or the text (i.e., the database). Preprocessing means that the string is represented in a different form, that allows for faster answer to particular questions like, e.g., whether it contains a certain subword. Comments are very welcome. luz@molgen.mpg.de |