### DIMACS/Dept. of Computer Science Colloquium Talk

Title: Locally Consistent Parsing with Applications to Approximate String Comparisons

Speaker: **S. Cenk Sahinalp**, Simon Fraser University

Date: Tuesday, January 31, 2006 11:00am - 12:15pm

Location: DIMACS Center, CoRE Bldg, Room 431, Rutgers University, Busch Campus, Piscataway, NJ

Abstract:

Locally consistent parsing (LCP) is a method for partitioning a
string from any given alphabet into short substrings in a "consistent''
fashion, i.e., identical substrings are partitioned identically (with the
possible exception of their margins). LCP was introduced more than 10 years
ago. Since then, it has been applied to solve a number of fundamental
problems in combinatorial pattern matching; well known examples include the
first work optimal parallel algorithm for suffix tree construction, the
first subquadratic algorithm for string matching under edit distance and
the first optimal communication protocol for exchanging strings that have
been subject to character and block edit operations.

In this talk we will review applications of LCP to approximately computing
the standard edit distance and some of its variants between a pair of
strings in near linear time. The standard edit distance between two strings
can be computed exactly in quadratic time for general alphabets and
slightly under quadratic time for constant-size alphabets. The more
involved block edit distance and its variants, on the other hand, are all
known to be NP-hard to compute. The problem of efficiently computing these
edit distances, even approximately, is of significance, especially in
computational biology, where the data is very large and thus fast
algorithms are highly desirable.

We will describe a number of LCP based algorithms that approximately
compute both the standard and the block edit distance variants in near
linear time. One of the most interesting results we will focus on is based
on an embedding of strings of size n into strings of size n/r for any value
of the contraction parameter r. This embedding preserves the edit distance
between two strings within a factor of O(r^{1+o(1)}), which is (almost)
optimal. Together with some annotations, the embedding is used to compute
the edit distance between two strings of size n within an approximation
factor of n^{1/3 + o(1)} in near linear time