### Reconstructing Strings from
Substrings

**Steven S. Skiena**
Department of Computer Science

State University of New York

Stony Brook, NY 11794-4400

Suppose that there was an unknown string *S*, over a known alphabet,
containing the secret to life. We are permitted to ask questions of
the form "is *s* a substring of *S*?'', where *s* is a specific query
string. We are not told where *s* occurs in *S*, nor how many times it
occurs, just whether or not *s* a substring of *S*. Our goal is to
determine the exact contents of *S* using as few queries as possible.

Although this tale is perhaps too dramatic, it is not completely
inaccurate. The problem arises in sequencing by hybridization, a new
and promising approach to DNA sequencing which offers the potential of
reduced cost and higher throughput over traditional gel-based
approaches. In this talk, we will discuss the application to
sequencing by hybridization and present the following results:

- We provide tight bounds on the complexity of reconstructing
unknown strings from substring queries. Our lower bound, which holds
even for a stronger model which returns the number of occurrences of s
as a substring of
*S*, relies on interesting arguments based on
de Bruijn sequences.

- We demonstrate that subsequence queries are significantly more powerful
than substring queries, matching the information theoretic lower bound.

- In certain applications, something may already be known about the
unknown string, and hence it be determined faster than an arbitrary
string. We show that building an optimal decision tree is NP-complete,
then give an approximation algorithm which gives trees within a
constant multiplicative factor of optimal.

This is joint work with Gopalakrishnan Sundaram.

Program

DIMACS Homepage

Contacting the Center

Document last modified on March 28, 2000.