previous section previous page next page next section
CMB

Online Lectures on Bioinformatics

navigation


Physical Mapping and Sequence Assembly


Sequence Assembly

Current technology permits experimentalists to directly determine the sequence of a DNA strand of approximately 500 nucleotides in length. To sequence a long piece of DNA many such reads are taken and subsequently re-assembled to produce the original sequence. Computationally, this gives rise to the problem of assembling the fragments using the overlap information among them. Overlaps are deduced from sequence similarity. Since 1-10% of the nucleotides in the fragment data are missing or incorrect, and since a fragment's sequence can be reversed with respect to the others these overlaps cannot be perfectly determined.

The standard approach to solving sequence assembly has three steps. (1) All pairwise overlaps of fragments are determined. (2) Layout of fragments into approximate positions with chosen orientation for each fragment so that the overlaps can now be used to determine the sequence. (3) Multiply align the fragments using the fragment layout to infer the sequence.

On the biological side the phenomenon that is most difficult to account for are internal repeats in sequences. Repeating sequences occur naturally, especially in human genome sequencing. These can be simple repeats (several recently identified genetic diseases are simply caused by variations in the lengths of such repeats) or longer highly similar 300 character repeats (such repeats make up nearly 15% of the human genome).

exercises
exercises



Comments are very welcome.
luz@molgen.mpg.de