DIMACS Seminar on Math and CS in Biology


Finding Approximate Repetitions in Sequences


Jeanette Schmidt
Polytechnic University


DIMACS Seminar Room 431
CoRE Building
Rutgers University


1:00 PM
Tuesday, November 14, 1995


We discuss algorithms to detect repeated regions in DNA and protein sequences. Repeated patterns of various sizes, and with various degrees of similarity, make up a significant portion of DNA. These repeated regions have been associated with many important functions. Many of these repeated regions are difficult to detect because the repetitions are inexact, sometimes exhibiting a low percentage of identical symbols. We discuss why some obvious approaches fail in identifying all significant repeated regions, under a general substitution matrix. We then present an algorithm for finding all locally optimal pairs of regions in a string that exhibit a sufficiently high degree of similarity. One feature of the algorithm is a data structure that when given a sequence S of size m and a pattern of size k, can quickly report (in O(log m) time) the score of the best alignment of any prefix of the pattern with any substring of S. The construction of the data structure takes O(mk log (m+k)) time. Queries about ``the best prefix of the pattern'' matching a given substring of S can be answered in O(1) time.

11/20: Mona Singh, Princeton University 
       On Learning-based Algorithms for Protein Motif Recognition

11/28: Fred Hughson, Princeton, Chemistry
       On protein structure

12/5:  Doug Deutschman, Cornell, Ecology
       Max likelihood models of forest ecology

12/12: Alex Schaffer, NIH

Document last modified on November 8, 1995