DIMACS Seminar on Math and CS in Biology

Title:

Finding Approximate Repetitions in Sequences

Speaker:

Jeanette Schmidt

Polytechnic University

Place:

DIMACS Seminar Room 431

CoRE Building

Rutgers University

Time:

1:00 PM

Tuesday, November 14, 1995

Abstract:

We discuss algorithms to detect repeated regions in DNA and protein sequences. Repeated patterns of various sizes, and with various degrees of similarity, make up a significant portion of DNA. These repeated regions have been associated with many important functions. Many of these repeated regions are difficult to detect because the repetitions are inexact, sometimes exhibiting a low percentage of identical symbols. We discuss why some obvious approaches fail in identifying all significant repeated regions, under a general substitution matrix. We then present an algorithm for finding all locally optimal pairs of regions in a string that exhibit a sufficiently high degree of similarity. One feature of the algorithm is a data structure that when given a sequence S of size m and a pattern of size k, can quickly report (in O(log m) time) the score of the best alignment of any prefix of the pattern with any substring of S. The construction of the data structure takes O(mk log (m+k)) time. Queries about ``the best prefix of the pattern'' matching a given substring of S can be answered in O(1) time.

11/20: Mona Singh, Princeton University 
       On Learning-based Algorithms for Protein Motif Recognition

11/28: Fred Hughson, Princeton, Chemistry
       On protein structure

12/5:  Doug Deutschman, Cornell, Ecology
       Max likelihood models of forest ecology

12/12: Alex Schaffer, NIH





Document last modified on November 8, 1995