previous section previous page next page next section
CMB

Online Lectures on Bioinformatics

navigation


Alignment statistics



Introduction

Alignment score is the product of an optimization, mostly a maximization procedure. As such it tends to be a large number, sometimes suggesting biological relatedness where there is none. In pairwise comparisons the user still has a chance to study an alignment by eye in order to come to a conclusion but, e.g., upon searching an entire database automatic methods are necessary to attribute a statistical significance to an alignment score.

In the early times of sequence alignment, the statistical significance of the score of a given pairwise alignment was assessed using the following procedure. The letters of the sequences are permuted randomly and a new alignment score is calculated. This procedure is repeated roughly 100 times and mean and standard deviation of this sample are calculated. The significance of the given alignment score is reported in 'number of standard deviations above the mean', also called the Z-value. Studying large numbers of random alignments is in principle correct. However, the significance of the alignment should then be reported as the fraction of random alignments that score less than the given alignment. The procedure described assumes that these scores were normally distributed. Since the random variable under study - the score of an optimal alignment - is the maximum over a large number of values this is not a reasonable assumption. In fact, when trying to fit a normal to the data the lack of fit quickly becomes obvious. The second argument against this way of calculating significance is a pragmatic one: The procedure needs to be repeated for every alignment under study because the effect of the sequence length cannot be accounted for.

A closer study of the problem reveals that the score actually obeys another law. We proceed to show how to calculate a formula for the statistical significance of alignment scores under a given parameter setting. The lengths of the sequences will then be parameters in the formula. The result of such a computation is that we obtain a probability that a particular score might be due to chance where "chance" is interpreted as being due to an alignment of two randomly generated and thus unrelated sequences.


Comments are very welcome.
luz@molgen.mpg.de