DIMACS Seminar on Math and CS in Biology
Title:
Finding Approximate Repetitions in Sequences
Speaker:
- Jeanette Schmidt
- Polytechnic University
Place:
- DIMACS Seminar Room 431
- CoRE Building
- Rutgers University
Time:
- 1:00 PM
- Tuesday, November 14, 1995
Abstract:
We discuss algorithms to detect repeated regions in DNA and
protein sequences. Repeated patterns of various sizes, and with
various degrees of similarity, make up a significant portion of
DNA. These repeated regions have been associated with many important
functions. Many of these repeated regions are difficult to detect
because the repetitions are inexact, sometimes exhibiting a low
percentage of identical symbols. We discuss why some obvious
approaches fail in identifying all significant repeated regions, under
a general substitution matrix. We then present an algorithm for
finding all locally optimal pairs of regions in a string that exhibit
a sufficiently high degree of similarity. One feature of the
algorithm is a data structure that when given a sequence S of size m
and a pattern of size k, can quickly report (in O(log m) time) the
score of the best alignment of any prefix of the pattern with any
substring of S. The construction of the data structure takes O(mk log
(m+k)) time. Queries about ``the best prefix of the pattern''
matching a given substring of S can be answered in O(1) time.
11/20: Mona Singh, Princeton University
On Learning-based Algorithms for Protein Motif Recognition
11/28: Fred Hughson, Princeton, Chemistry
On protein structure
12/5: Doug Deutschman, Cornell, Ecology
Max likelihood models of forest ecology
12/12: Alex Schaffer, NIH
Document last modified on November 8, 1995