### DIMACS Theoretical Computer Science Seminar

Title: Comparing Biological Sequences with Segment Rearrangements

Speaker: **S. Muthu Muthukrishnan**, DIMACS, Rutgers University

Date: February 2, 2004 3:30-4:30pm

Location: DIMACS Center, CoRE Bldg, Room 433, Rutgers University, Busch Campus, Piscataway, NJ

Abstract:

Computational genomics involves comparing sequences based on
``similarity'' for detecting evolutionary and functional
relationships. Until very recently, available portions of the human
genome sequence (and that of other species) were fairly short and
sparse. Most sequencing effort was focused on genes and other short
units; similarity between such sequences was measured based on
character level differences. However with the advent of whole genome
sequencing technology there is emerging consensus that the measure of
similarity between long genome sequences must capture the
rearrangements of large segments found in abundance in the human
genome.

We abstract the general problem of computing sequence similarity in
the presence of segment rearrangements. This problem is closely
related to computing the smallest grammar for a string or the block
edit distance between two strings. Our problem, like these other
problems, is NP hard. The main result I will present is a simple
$O(1)$ factor approximation algorithm for this problem. In contrast,
best known approximations for the related problems are factor
$\Omega(\log n)$ off from the optimal. Our algorithm works in linear
time, and in one pass. In proving our result, we relate sequence
similarity measures based on different segment rearrangements, to each
other, tight up to constant factors.

This is joint work with Funda Ergun (CWRU) and Cenk Sahinalp (SFU).
See Webpage: http://www.cs.rutgers.edu/~muthu/theory.html