previous section previous page next page next section
CMB

Online Lectures on Bioinformatics

navigation


Database searching


Multiple alignments and database searching

Information about which residues are conserved and thus important for a particular family are crucial not only for the purpose of multiply aligning a set of sequences. Also in the context of identifying related sequences in a database this information is very valuable. Thus, a multitude of methods has been developed that aim at identifying sequences in a database which are related to a given family.

Historically, the first such method had introduced the profiles described above in the context of multiple sequence alignment. Like in this application, profiles help in emphasizing conserved regions in a database search. Thus, a sequence that matches the query profile in a conserved region will receive a higher score than a database sequence matching only in a divergent part of an alignment. This feature is of enormous help in distinguishing truly related sequences.

Algorithmically, profile searching simply uses the dynamic programming alignment algorithm for aligning a sequence to a profile on each sequence in the database. Of course, this is computationally quite demanding and much slower than the heuristic database search algorithms like BLAST or FASTA. Typically, the multiple alignment underlying the profile will describe a conserved domain which one expects to find within a database sequence. Therefore, in this context it is important that end gaps should not be penalized. Gap penalties for profile matching frequently vary along the profile in order to reflect the existence of gaps within the underlying multiple alignment. Through this mechanism one attempts to preferentially include new gaps in regions where gaps have been observed already. However, different suggestions exist as to the choice and derivation method for these gap penalties (Bucher, Taylor).

In 1993 Hidden Markov Models were introduced for the purpose of identifying family members in a database [5] [6]. Hidden Markov Models are a class of mathematical models well-suited for describing the relevant parameters in matching a given multiple alignment against a database. For HMMs there exist automatic learning algorithms that adapt the parameters of the HMM for best identification of family members. Thus, they offer in particular a solution to the question of gap penalty settings along a profile. Sequences are matched to HMMs in much the same way as they are aligned to a profile although the interpretation of the procedure is different. The HMM is thought of as producing sequences by going through different states. Aligning a sequence to an HMM amounts to delineating the series of states that is most likely to have produced the sequence.

exercises
exercises


Based on this interpretation one can make an interesting distinction between the optimal alignment of a sequence with a Hidden Markov Model and the computation of the probability that a model has produced a particular sequence. The optimal alignment is computed with an algorithm exactly analogous to the dynamic programming algorithm and maximizing the probabilty that a series of states as given rise to a particular sequence. In contrast hereto, in absence of knowledge of the correct path of states, the probability that a model has given rise to a particular sequence should rather be computed as the sum over the different sets of states that could have produced the sequence. This interpretation leads to a summation over all paths instead of the choice of the best one. Practically, there is little known about the difference in performance between the two approaches.


Comments are very welcome.
luz@molgen.mpg.de