Online Lectures on Bioinformatics
|
Database searchingMultiple alignments and database searchingInformation about which residues are conserved and thus important for a particular family are crucial not only for the purpose of multiply aligning a set of sequences. Also in the context of identifying related sequences in a database this information is very valuable. Thus, a multitude of methods has been developed that aim at identifying sequences in a database which are related to a given family. Historically, the first such method had introduced the profiles described above in the context of multiple sequence alignment. Like in this application, profiles help in emphasizing conserved regions in a database search. Thus, a sequence that matches the query profile in a conserved region will receive a higher score than a database sequence matching only in a divergent part of an alignment. This feature is of enormous help in distinguishing truly related sequences. Algorithmically, profile searching simply uses the dynamic programming alignment algorithm for aligning a sequence to a profile on each sequence in the database. Of course, this is computationally quite demanding and much slower than the heuristic database search algorithms like BLAST or FASTA. Typically, the multiple alignment underlying the profile will describe a conserved domain which one expects to find within a database sequence. Therefore, in this context it is important that end gaps should not be penalized. Gap penalties for profile matching frequently vary along the profile in order to reflect the existence of gaps within the underlying multiple alignment. Through this mechanism one attempts to preferentially include new gaps in regions where gaps have been observed already. However, different suggestions exist as to the choice and derivation method for these gap penalties (Bucher, Taylor).
In 1993 Hidden Markov Models were introduced
for the purpose of identifying family members in a database [5] [6].
Hidden Markov Models are a class of mathematical models well-suited
for describing the relevant parameters in matching a given multiple
alignment against a database. For HMMs there exist automatic learning
algorithms that adapt the parameters of the HMM for best identification
of family members. Thus, they offer in particular a solution to
the question of gap penalty settings along a profile.
Sequences are matched to HMMs in much the same way
as they are aligned to a profile although the interpretation
of the procedure is different.
The HMM is thought of as producing sequences by going through different states.
Aligning a sequence to an HMM amounts to delineating the series of states
that is most likely to have produced the sequence.
![]() exercises
Based on this interpretation one can make an interesting
distinction between the optimal alignment of a sequence with
a Hidden Markov Model and the computation of the probability
that a model has produced a particular sequence.
The optimal alignment is computed with an algorithm exactly analogous
to the dynamic programming algorithm and maximizing the probabilty
that a series of states as given rise to a particular sequence.
In contrast hereto, in absence of knowledge of the correct path of states,
the probability that a model has given rise to a particular
sequence should rather be computed as the sum over the different
sets of states that could have produced the sequence.
This interpretation leads to a summation over all paths
instead of the choice of the best one. Practically,
there is little known about the difference in performance between
the two approaches.
Comments are very welcome. luz@molgen.mpg.de |