Reconstructing Strings from Substrings

Steven S. Skiena

Department of Computer Science
State University of New York
Stony Brook, NY 11794-4400

Suppose that there was an unknown string S, over a known alphabet, containing the secret to life. We are permitted to ask questions of the form "is s a substring of S?'', where s is a specific query string. We are not told where s occurs in S, nor how many times it occurs, just whether or not s a substring of S. Our goal is to determine the exact contents of S using as few queries as possible.

Although this tale is perhaps too dramatic, it is not completely inaccurate. The problem arises in sequencing by hybridization, a new and promising approach to DNA sequencing which offers the potential of reduced cost and higher throughput over traditional gel-based approaches. In this talk, we will discuss the application to sequencing by hybridization and present the following results:

We provide tight bounds on the complexity of reconstructing unknown strings from substring queries. Our lower bound, which holds even for a stronger model which returns the number of occurrences of s as a substring of S, relies on interesting arguments based on de Bruijn sequences.
We demonstrate that subsequence queries are significantly more powerful than substring queries, matching the information theoretic lower bound.
In certain applications, something may already be known about the unknown string, and hence it be determined faster than an arbitrary string. We show that building an optimal decision tree is NP-complete, then give an approximation algorithm which gives trees within a constant multiplicative factor of optimal.

This is joint work with Gopalakrishnan Sundaram.

Program

DIMACS Homepage

Contacting the Center
Document last modified on March 28, 2000.