Reconstructing Strings from
Substrings
Steven S. Skiena
Department of Computer Science
State University of New York
Stony Brook, NY 11794-4400
Suppose that there was an unknown string S, over a known alphabet,
containing the secret to life. We are permitted to ask questions of
the form "is s a substring of S?'', where s is a specific query
string. We are not told where s occurs in S, nor how many times it
occurs, just whether or not s a substring of S. Our goal is to
determine the exact contents of S using as few queries as possible.
Although this tale is perhaps too dramatic, it is not completely
inaccurate. The problem arises in sequencing by hybridization, a new
and promising approach to DNA sequencing which offers the potential of
reduced cost and higher throughput over traditional gel-based
approaches. In this talk, we will discuss the application to
sequencing by hybridization and present the following results:
- We provide tight bounds on the complexity of reconstructing
unknown strings from substring queries. Our lower bound, which holds
even for a stronger model which returns the number of occurrences of s
as a substring of S, relies on interesting arguments based on
de Bruijn sequences.
- We demonstrate that subsequence queries are significantly more powerful
than substring queries, matching the information theoretic lower bound.
- In certain applications, something may already be known about the
unknown string, and hence it be determined faster than an arbitrary
string. We show that building an optimal decision tree is NP-complete,
then give an approximation algorithm which gives trees within a
constant multiplicative factor of optimal.
This is joint work with Gopalakrishnan Sundaram.
Program
DIMACS Homepage
Contacting the Center
Document last modified on March 28, 2000.