DIMACS/Dept. of Computer Science Colloquium Talk


Title: Locally Consistent Parsing with Applications to Approximate String Comparisons

Speaker: S. Cenk Sahinalp, Simon Fraser University

Date: Tuesday, January 31, 2006 11:00am - 12:15pm

Location: DIMACS Center, CoRE Bldg, Room 431, Rutgers University, Busch Campus, Piscataway, NJ


Abstract:

Locally consistent parsing (LCP) is a method for partitioning a string from any given alphabet into short substrings in a "consistent'' fashion, i.e., identical substrings are partitioned identically (with the possible exception of their margins). LCP was introduced more than 10 years ago. Since then, it has been applied to solve a number of fundamental problems in combinatorial pattern matching; well known examples include the first work optimal parallel algorithm for suffix tree construction, the first subquadratic algorithm for string matching under edit distance and the first optimal communication protocol for exchanging strings that have been subject to character and block edit operations.

In this talk we will review applications of LCP to approximately computing the standard edit distance and some of its variants between a pair of strings in near linear time. The standard edit distance between two strings can be computed exactly in quadratic time for general alphabets and slightly under quadratic time for constant-size alphabets. The more involved block edit distance and its variants, on the other hand, are all known to be NP-hard to compute. The problem of efficiently computing these edit distances, even approximately, is of significance, especially in computational biology, where the data is very large and thus fast algorithms are highly desirable.

We will describe a number of LCP based algorithms that approximately compute both the standard and the block edit distance variants in near linear time. One of the most interesting results we will focus on is based on an embedding of strings of size n into strings of size n/r for any value of the contraction parameter r. This embedding preserves the edit distance between two strings within a factor of O(r^{1+o(1)}), which is (almost) optimal. Together with some annotations, the embedding is used to compute the edit distance between two strings of size n within an approximation factor of n^{1/3 + o(1)} in near linear time