Experiments on Large Scale DNA Sequence Assembly

S. Rao Kosaraju

Department of Computer Science, The Johns Hopkins University, Baltimore, MD 21218-2694


We consider the problem of assembling a given set of DNA sequence fragments into a single string. A simple version of this problem is known as the superstring problem. A superstring for a given set of strings {s1,...,sm} is a string which contains each si as a substring. There is an extensive literature on the construction of shortest length superstrings. We first show that the greedy heuristic for the construction of a superstring can be implemented as a linear-time algorithm. We then generalize the problem to several DNA string assembly problems and develop algorithms for them which are generalizations of the simple greedy algorithm.

We have implemented the most general problem, which closely matches the real DNA string assembly problem. This takes into consideration base-calting errors and base-calling ambiguities. We report the success of our efforts in assembling strings.

We discuss some interesting new theoretical observations that resulted from this effort.

This work is joint with Prof. Arthur L. Delcher.


Program
DIMACS Homepage
Contacting the Center
Document last modified on March 28, 2000.