Short Superstrings and the Structure of Overlapping Strings

Chris Armen and Clifford Stein

Department of Computer Science
Dartmouth College, Hanover, NH 03755

Given a collection S of strings over some alphabet, a superstring A is a string which contains each of the elements of S as a substring; that is, for each string s in S, there is a block of |s| consecutive characters in A which match s exactly. In the shortest superstring problem, A is further required to be of minimum length.

This problem can be viewed as an abstraction of the sequence assembly problem and as such has attracted considerable recent attention. It is NP-hard, and several constant-factor approximation algorithms have been proposed.

In this talk we describe our 2.75-approximation algorithm for the superstring problem. This result not only represents a significant improvement in the best known approximation ratio for the problem, but also reveals a great deal about the structure of overlapping strings. While previous algorithms have grown increasingly sophisticated, their gains have come in a fundamentally graph-theoretic setting. In this sense they have addressed a more general problem than the one at hand. We believe that our approach, because it deals specifically with the structure of strings with large amounts of overlap, may be better able to address some of the feature of the underlying sequence assembly problem.

DIMACS Homepage
Contacting the Center
Document last modified on March 28, 2000.