Online Lectures on Bioinformatics
|
Database searchingFASTAThe FASTA program [1] sets a size k for k-tuple subwords. The program then looks for diagonals in the comparison matrix between query and search sequence along which many k-tuples match. This can be done very quickly based on a preprocessed list of k-tuples contained in the query sequence. The set of k-tuples can be identified with an array whose length corresponds to the number of possible tuples of size k. This array is linked to the indices where the particular k-tuples occur in the query sequence. Note that a matching k-tuple at index i in the query and at index j in the database sequence can be attributed to a diagonal by subtracting the one index from the other. Therefore, when inspecting a new sequence for similarity, one walks along this sequence inspecting each k-tuple. For each of them one looks up the indices where it occurs in the query, computes the index-difference to identify the diagonal and increases a counter for this diagonal. After inspecting the search sequence in this way a diagonal with a high count is likely to contain a well-matching region. In terms of the exection time, this procedure is only linear in the length of the database sequence and can easily be iterated for a whole database. Of course this rough outline needs to be adapted to focus on regions on diagonals where the match density is high and link nearby, good diagonals into alignments. Comments are very welcome. luz@molgen.mpg.de |